Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2019-02-16 Thread Heiner Kallweit
On 15.02.2019 01:31, Frederic Weisbecker wrote:
> On Thu, Feb 14, 2019 at 11:33:44PM +0100, Heiner Kallweit wrote:
>> On 14.02.2019 22:47, Thomas Gleixner wrote:
>>> On Thu, 14 Feb 2019, Heiner Kallweit wrote:
 On 24.01.2019 20:37, Heiner Kallweit wrote:
>>
> Hi Frederic,
> is the update debug patch still to come?
>>>
>>> Heiner, there is a related fix queued in
>>>
>>>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core
>>>
>>> See also:
>>> https://lkml.kernel.org/r/20190128234625.78241-3-...@chromium.org
>>> https://lkml.kernel.org/r/20190128234625.78241-2-...@chromium.org
>>>
>>> Can you please have a look?
>>>
>> Great. And sure, I can test it over the weekend.
> 
> Thanks you, and sorry for the shutdown, I was buried into softirq code lately.
> 
No worries, appreciate your efforts. I tested and wasn't able to reproduce
the issue. Looks good ..

Thanks, Heiner


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2019-02-14 Thread Frederic Weisbecker
On Thu, Feb 14, 2019 at 11:33:44PM +0100, Heiner Kallweit wrote:
> On 14.02.2019 22:47, Thomas Gleixner wrote:
> > On Thu, 14 Feb 2019, Heiner Kallweit wrote:
> >> On 24.01.2019 20:37, Heiner Kallweit wrote:
> 
> >>> Hi Frederic,
> >>> is the update debug patch still to come?
> > 
> > Heiner, there is a related fix queued in
> > 
> >git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core
> > 
> > See also:
> > https://lkml.kernel.org/r/20190128234625.78241-3-...@chromium.org
> > https://lkml.kernel.org/r/20190128234625.78241-2-...@chromium.org
> > 
> > Can you please have a look?
> > 
> Great. And sure, I can test it over the weekend.

Thanks you, and sorry for the shutdown, I was buried into softirq code lately.


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2019-02-14 Thread Heiner Kallweit
On 14.02.2019 22:47, Thomas Gleixner wrote:
> On Thu, 14 Feb 2019, Heiner Kallweit wrote:
>> On 24.01.2019 20:37, Heiner Kallweit wrote:

>>> Hi Frederic,
>>> is the update debug patch still to come?
> 
> Heiner, there is a related fix queued in
> 
>git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core
> 
> See also:
> https://lkml.kernel.org/r/20190128234625.78241-3-...@chromium.org
> https://lkml.kernel.org/r/20190128234625.78241-2-...@chromium.org
> 
> Can you please have a look?
> 
Great. And sure, I can test it over the weekend.

> Thanks,
> 
>   tglx
> 
Rgds, Heiner


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2019-02-14 Thread Thomas Gleixner
On Thu, 14 Feb 2019, Heiner Kallweit wrote:
> On 24.01.2019 20:37, Heiner Kallweit wrote:
> >>
> > Hi Frederic,
> > is the update debug patch still to come?

Heiner, there is a related fix queued in

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git irq/core

See also:
https://lkml.kernel.org/r/20190128234625.78241-3-...@chromium.org
https://lkml.kernel.org/r/20190128234625.78241-2-...@chromium.org

Can you please have a look?

Thanks,

tglx


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2019-02-14 Thread Heiner Kallweit
On 24.01.2019 20:37, Heiner Kallweit wrote:
> On 16.01.2019 07:24, Frederic Weisbecker wrote:
>> On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote:
>>>
>>> # tracer: nop
>>> #
>>> #  _-=> irqs-off
>>> # / _=> need-resched
>>> #| / _---=> hardirq/softirq
>>> #|| / _--=> preempt-depth
>>> #||| / delay
>>> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
>>> #  | |   |      | |
>>>   -0 [001] d.h2  1479.099092: softirq_raise: vec=1 
>>> [action=TIMER]
>>>   -0 [001] d.h2  1479.099098: softirq_raise: vec=9 
>>> [action=RCU]
>>>   -0 [001] d.h2  1479.099106: softirq_raise: vec=7 
>>> [action=SCHED]
>>>   -0 [001] ..s2  1479.099114: softirq_entry: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.099120: softirq_exit: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.099121: softirq_entry: vec=7 
>>> [action=SCHED]
>>>   -0 [001] ..s2  1479.099134: softirq_exit: vec=7 
>>> [action=SCHED]
>>>   -0 [001] ..s2  1479.099135: softirq_entry: vec=9 
>>> [action=RCU]
>>>   -0 [001] ..s2  1479.099141: softirq_exit: vec=9 
>>> [action=RCU]
>>>   -0 [001] d.h2  1479.100094: softirq_raise: vec=9 
>>> [action=RCU]
>>>   -0 [001] ..s2  1479.100109: softirq_entry: vec=9 
>>> [action=RCU]
>>>   -0 [001] ..s2  1479.100116: softirq_exit: vec=9 
>>> [action=RCU]
>>>   -0 [001] d.h2  1479.101091: softirq_raise: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.101113: softirq_entry: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.101118: softirq_exit: vec=1 
>>> [action=TIMER]
>>>   -0 [001] d.h2  1479.102094: softirq_raise: vec=9 
>>> [action=RCU]
>>>   -0 [001] ..s2  1479.102114: softirq_entry: vec=9 
>>> [action=RCU]
>>>   -0 [001] ..s2  1479.102121: softirq_exit: vec=9 
>>> [action=RCU]
>>>   -0 [001] d.h2  1479.103091: softirq_raise: vec=1 
>>> [action=TIMER]
>>>   -0 [001] d.h2  1479.103097: softirq_raise: vec=9 
>>> [action=RCU]
>>>   -0 [001] d.h2  1479.103105: softirq_raise: vec=7 
>>> [action=SCHED]
>>>   -0 [001] ..s2  1479.103114: softirq_entry: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.103118: softirq_exit: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.103119: softirq_entry: vec=7 
>>> [action=SCHED]
>>>   -0 [001] ..s2  1479.103131: softirq_exit: vec=7 
>>> [action=SCHED]
>>>   -0 [001] ..s2  1479.103132: softirq_entry: vec=9 
>>> [action=RCU]
>>>   -0 [001] ..s2  1479.103138: softirq_exit: vec=9 
>>> [action=RCU]
>>>   -0 [001] d.h2  1479.105092: softirq_raise: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.105115: softirq_entry: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.105119: softirq_exit: vec=1 
>>> [action=TIMER]
>>>   -0 [001] d.h2  1479.106092: softirq_raise: vec=9 
>>> [action=RCU]
>>>   -0 [001] ..s2  1479.106112: softirq_entry: vec=9 
>>> [action=RCU]
>>>   -0 [001] .Ns2  1479.106144: softirq_exit: vec=9 
>>> [action=RCU]
>>>  cpuhp/1-13[001] d..2  1479.106279: timer_cancel: 
>>> timer=09a25653
>>>   -0 [001] d.h2  1479.106965: softirq_raise: vec=1 
>>> [action=TIMER]
>>>   -0 [001] d.h2  1479.106969: softirq_raise: vec=9 
>>> [action=RCU]
>>>   -0 [001] d.h2  1479.106974: softirq_raise: vec=7 
>>> [action=SCHED]
>>>   -0 [001] ..s2  1479.106981: softirq_entry: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.106984: softirq_exit: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.106985: softirq_entry: vec=7 
>>> [action=SCHED]
>>>   -0 [001] ..s2  1479.106994: softirq_exit: vec=7 
>>> [action=SCHED]
>>>   -0 [001] ..s2  1479.106995: softirq_entry: vec=9 
>>> [action=RCU]
>>>   -0 [001] ..s2  1479.106999: softirq_exit: vec=9 
>>> [action=RCU]
>>>   -0 [001] d.h2  1479.107996: softirq_raise: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.108010: softirq_entry: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.108014: softirq_exit: vec=1 
>>> [action=TIMER]
>>>   -0 [001] d.h2  1479.109009: softirq_raise: vec=1 
>>> [action=TIMER]
>>>   -0 [001] d.h2  1479.109013: softirq_raise: vec=9 
>>> [action=RCU]
>>>   -0 [001] ..s2  1479.109024: softirq_entry: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.109028: softirq_exit: vec=1 
>>> [action=TIMER]
>>>   -0 [001] ..s2  1479.109028: softirq_entry: vec=9 
>>> [action=RCU]
>>>   -0 [001] 

Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2019-01-24 Thread Heiner Kallweit
On 16.01.2019 07:24, Frederic Weisbecker wrote:
> On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote:
>>
>> # tracer: nop
>> #
>> #  _-=> irqs-off
>> # / _=> need-resched
>> #| / _---=> hardirq/softirq
>> #|| / _--=> preempt-depth
>> #||| / delay
>> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
>> #  | |   |      | |
>>   -0 [001] d.h2  1479.099092: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.099098: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.099106: softirq_raise: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.099114: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.099120: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.099121: softirq_entry: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.099134: softirq_exit: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.099135: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.099141: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.100094: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.100109: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.100116: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.101091: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.101113: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.101118: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.102094: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.102114: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.102121: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.103091: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.103097: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.103105: softirq_raise: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.103114: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.103118: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.103119: softirq_entry: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.103131: softirq_exit: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.103132: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.103138: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.105092: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.105115: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.105119: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.106092: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.106112: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] .Ns2  1479.106144: softirq_exit: vec=9 
>> [action=RCU]
>>  cpuhp/1-13[001] d..2  1479.106279: timer_cancel: 
>> timer=09a25653
>>   -0 [001] d.h2  1479.106965: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.106969: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.106974: softirq_raise: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.106981: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.106984: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.106985: softirq_entry: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.106994: softirq_exit: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.106995: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.106999: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.107996: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.108010: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.108014: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.109009: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.109013: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.109024: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.109028: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.109028: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.109033: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.110013: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 

Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2019-01-16 Thread Heiner Kallweit
On 16.01.2019 07:24, Frederic Weisbecker wrote:
> On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote:
>>
>> # tracer: nop
>> #
>> #  _-=> irqs-off
>> # / _=> need-resched
>> #| / _---=> hardirq/softirq
>> #|| / _--=> preempt-depth
>> #||| / delay
>> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
>> #  | |   |      | |
>>   -0 [001] d.h2  1479.099092: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.099098: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.099106: softirq_raise: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.099114: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.099120: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.099121: softirq_entry: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.099134: softirq_exit: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.099135: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.099141: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.100094: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.100109: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.100116: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.101091: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.101113: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.101118: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.102094: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.102114: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.102121: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.103091: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.103097: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.103105: softirq_raise: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.103114: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.103118: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.103119: softirq_entry: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.103131: softirq_exit: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.103132: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.103138: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.105092: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.105115: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.105119: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.106092: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.106112: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] .Ns2  1479.106144: softirq_exit: vec=9 
>> [action=RCU]
>>  cpuhp/1-13[001] d..2  1479.106279: timer_cancel: 
>> timer=09a25653
>>   -0 [001] d.h2  1479.106965: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.106969: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.106974: softirq_raise: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.106981: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.106984: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.106985: softirq_entry: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.106994: softirq_exit: vec=7 
>> [action=SCHED]
>>   -0 [001] ..s2  1479.106995: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.106999: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.107996: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.108010: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.108014: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.109009: softirq_raise: vec=1 
>> [action=TIMER]
>>   -0 [001] d.h2  1479.109013: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.109024: softirq_entry: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.109028: softirq_exit: vec=1 
>> [action=TIMER]
>>   -0 [001] ..s2  1479.109028: softirq_entry: vec=9 
>> [action=RCU]
>>   -0 [001] ..s2  1479.109033: softirq_exit: vec=9 
>> [action=RCU]
>>   -0 [001] d.h2  1479.110013: softirq_raise: vec=9 
>> [action=RCU]
>>   -0 

Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2019-01-15 Thread Frederic Weisbecker
On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote:
> 
> # tracer: nop
> #
> #  _-=> irqs-off
> # / _=> need-resched
> #| / _---=> hardirq/softirq
> #|| / _--=> preempt-depth
> #||| / delay
> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
> #  | |   |      | |
>   -0 [001] d.h2  1479.099092: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.h2  1479.099098: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.h2  1479.099106: softirq_raise: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.099114: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.099120: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.099121: softirq_entry: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.099134: softirq_exit: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.099135: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.099141: softirq_exit: vec=9 
> [action=RCU]
>   -0 [001] d.h2  1479.100094: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.100109: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.100116: softirq_exit: vec=9 
> [action=RCU]
>   -0 [001] d.h2  1479.101091: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.101113: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.101118: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] d.h2  1479.102094: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.102114: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.102121: softirq_exit: vec=9 
> [action=RCU]
>   -0 [001] d.h2  1479.103091: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.h2  1479.103097: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.h2  1479.103105: softirq_raise: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.103114: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.103118: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.103119: softirq_entry: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.103131: softirq_exit: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.103132: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.103138: softirq_exit: vec=9 
> [action=RCU]
>   -0 [001] d.h2  1479.105092: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.105115: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.105119: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] d.h2  1479.106092: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.106112: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] .Ns2  1479.106144: softirq_exit: vec=9 
> [action=RCU]
>  cpuhp/1-13[001] d..2  1479.106279: timer_cancel: 
> timer=09a25653
>   -0 [001] d.h2  1479.106965: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.h2  1479.106969: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.h2  1479.106974: softirq_raise: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.106981: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.106984: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.106985: softirq_entry: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.106994: softirq_exit: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.106995: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.106999: softirq_exit: vec=9 
> [action=RCU]
>   -0 [001] d.h2  1479.107996: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.108010: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.108014: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] d.h2  1479.109009: softirq_raise: vec=1 
> [action=TIMER]
>   -0 [001] d.h2  1479.109013: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.109024: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.109028: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.109028: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.109033: softirq_exit: vec=9 
> [action=RCU]
>   -0 [001] d.h2  1479.110013: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.110033: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] ..s2  1479.110040: softirq_exit: vec=9 
> [action=RCU]
>   -0 [001] 

Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2019-01-11 Thread Frederic Weisbecker
On Wed, Jan 09, 2019 at 11:20:50PM +0100, Heiner Kallweit wrote:
> On 28.12.2018 07:39, Heiner Kallweit wrote:
> > On 28.12.2018 07:34, Heiner Kallweit wrote:
> >> On 28.12.2018 02:31, Frederic Weisbecker wrote:
> >>> On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote:
> 
> >> [...]
> >>>
> >>> Interesting, the softirq is raised from hardirq but it's not handled in 
> >>> the end of
> >>> the IRQ. Are you running threaded IRQS by any chance? If so I would 
> >>> expect ksoftirqd
> >>> to handle the pending work before we go idle. However I can imagine a 
> >>> small window
> >>> where such an expectation may not be met: if the softirq is raised after 
> >>> the ksoftirqd
> >>> thread is parked (CPUHP_AP_SMPBOOT_THREADS), which is right before we 
> >>> disable the CPU
> >>> (CPUHP_TEARDOWN_CPU).
> >>>
> >> I have a network driver (r8169) using NAPI which runs in softirq context 
> >> AFAIK.
> >> For testing purposes I sometimes trigger system suspend via network, so 
> >> there is
> >> network adapter activity when system suspends. Apart from that nothing 
> >> really
> >> exciting:
> >> CPU0   CPU1   CPU2   CPU3
> >>0: 43  0  0  0   IO-APIC2-edge  
> >> timer
> >>1:  4  0  0  0   IO-APIC1-edge  
> >> i8042
> >>8:  0  1  0  0   IO-APIC8-fasteoi   
> >> rtc0
> >>9:  0  0  0  0   IO-APIC9-fasteoi   
> >> acpi
> >>   12:  0  0  0  5   IO-APIC   12-edge  
> >> i8042
> >>  120:  0  0  0  0   PCI-MSI 311296-edge
> >>   PCIe PME
> >>  121:  0  0  0  0   PCI-MSI 315392-edge
> >>   PCIe PME
> >>  122:  0  0  0  0   PCI-MSI 327680-edge
> >>   PCIe PME
> >>  123:  0  0   3328  0   PCI-MSI 294912-edge
> >>   ahci[:00:12.0]
> >>  124:  0133  0  0   PCI-MSI 344064-edge
> >>   xhci_hcd
> >>  125:  0  0 32  0   PCI-MSI 245760-edge
> >>   mei_me
> >>  127:381  0  0  0   PCI-MSI 1572864-edge   
> >>enp3s0
> >>  128:  0  0  0236   PCI-MSI 32768-edge 
> >>  i915
> >>  129:  0374  0  0   PCI-MSI 229376-edge
> >>   snd_hda_intel:card0
> >>
> >>> I don't know if we can afford to ignore a softirq even at this late 
> >>> stage. We should
> >>> probably avoid leaking any. So here is a possible fix, if you don't mind 
> >>> trying:
> >>>
> >> I tested your patch and at least in the first minutes of testing couldn't 
> >> reproduce
> >> the issue any longer. I tested manual system suspend and the following 
> >> script you
> >> sent when we started to analyze the issue.
> >>
> > 
> > Also after some more time the issue didn't occur again. So it seems your 
> > analysis
> > was right and also the approach to fix it. Thanks!
> > Will let you know in case the issue should pop up again under special
> > circumstances.
> > 
> Frederic, so far this fix didn't appear in linux-next, are you going to 
> submit it?

Yep, I'll cook up a proper changelog and let Thomas judge if the change is 
worth.


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2019-01-09 Thread Heiner Kallweit
On 28.12.2018 07:39, Heiner Kallweit wrote:
> On 28.12.2018 07:34, Heiner Kallweit wrote:
>> On 28.12.2018 02:31, Frederic Weisbecker wrote:
>>> On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote:

>> [...]
>>>
>>> Interesting, the softirq is raised from hardirq but it's not handled in the 
>>> end of
>>> the IRQ. Are you running threaded IRQS by any chance? If so I would expect 
>>> ksoftirqd
>>> to handle the pending work before we go idle. However I can imagine a small 
>>> window
>>> where such an expectation may not be met: if the softirq is raised after 
>>> the ksoftirqd
>>> thread is parked (CPUHP_AP_SMPBOOT_THREADS), which is right before we 
>>> disable the CPU
>>> (CPUHP_TEARDOWN_CPU).
>>>
>> I have a network driver (r8169) using NAPI which runs in softirq context 
>> AFAIK.
>> For testing purposes I sometimes trigger system suspend via network, so 
>> there is
>> network adapter activity when system suspends. Apart from that nothing really
>> exciting:
>> CPU0   CPU1   CPU2   CPU3
>>0: 43  0  0  0   IO-APIC2-edge  
>> timer
>>1:  4  0  0  0   IO-APIC1-edge  
>> i8042
>>8:  0  1  0  0   IO-APIC8-fasteoi   
>> rtc0
>>9:  0  0  0  0   IO-APIC9-fasteoi   
>> acpi
>>   12:  0  0  0  5   IO-APIC   12-edge  
>> i8042
>>  120:  0  0  0  0   PCI-MSI 311296-edge  
>> PCIe PME
>>  121:  0  0  0  0   PCI-MSI 315392-edge  
>> PCIe PME
>>  122:  0  0  0  0   PCI-MSI 327680-edge  
>> PCIe PME
>>  123:  0  0   3328  0   PCI-MSI 294912-edge  
>> ahci[:00:12.0]
>>  124:  0133  0  0   PCI-MSI 344064-edge  
>> xhci_hcd
>>  125:  0  0 32  0   PCI-MSI 245760-edge  
>> mei_me
>>  127:381  0  0  0   PCI-MSI 1572864-edge 
>>  enp3s0
>>  128:  0  0  0236   PCI-MSI 32768-edge  
>> i915
>>  129:  0374  0  0   PCI-MSI 229376-edge  
>> snd_hda_intel:card0
>>
>>> I don't know if we can afford to ignore a softirq even at this late stage. 
>>> We should
>>> probably avoid leaking any. So here is a possible fix, if you don't mind 
>>> trying:
>>>
>> I tested your patch and at least in the first minutes of testing couldn't 
>> reproduce
>> the issue any longer. I tested manual system suspend and the following 
>> script you
>> sent when we started to analyze the issue.
>>
> 
> Also after some more time the issue didn't occur again. So it seems your 
> analysis
> was right and also the approach to fix it. Thanks!
> Will let you know in case the issue should pop up again under special
> circumstances.
> 
Frederic, so far this fix didn't appear in linux-next, are you going to submit 
it?

> 
>> Heiner
>>
>> --
>>
>> #!/bin/bash
>>
>> do_hotplug()
>> {
>>  for i in $(seq 1 $2)
>>  do
>>  echo $1 > /sys/devices/system/cpu/cpu$i/online
>>  done
>> }
>>
>> LAST_CPU=$(($(nproc)-1))
>>
>> while true
>> do
>>  do_hotplug 0 $LAST_CPU
>>  do_hotplug 1 $LAST_CPU
>> done
>>
> 



Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-12-27 Thread Heiner Kallweit
On 28.12.2018 07:34, Heiner Kallweit wrote:
> On 28.12.2018 02:31, Frederic Weisbecker wrote:
>> On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote:
>>>
> [...]
>>
>> Interesting, the softirq is raised from hardirq but it's not handled in the 
>> end of
>> the IRQ. Are you running threaded IRQS by any chance? If so I would expect 
>> ksoftirqd
>> to handle the pending work before we go idle. However I can imagine a small 
>> window
>> where such an expectation may not be met: if the softirq is raised after the 
>> ksoftirqd
>> thread is parked (CPUHP_AP_SMPBOOT_THREADS), which is right before we 
>> disable the CPU
>> (CPUHP_TEARDOWN_CPU).
>>
> I have a network driver (r8169) using NAPI which runs in softirq context 
> AFAIK.
> For testing purposes I sometimes trigger system suspend via network, so there 
> is
> network adapter activity when system suspends. Apart from that nothing really
> exciting:
> CPU0   CPU1   CPU2   CPU3
>0: 43  0  0  0   IO-APIC2-edge  
> timer
>1:  4  0  0  0   IO-APIC1-edge  
> i8042
>8:  0  1  0  0   IO-APIC8-fasteoi   
> rtc0
>9:  0  0  0  0   IO-APIC9-fasteoi   
> acpi
>   12:  0  0  0  5   IO-APIC   12-edge  
> i8042
>  120:  0  0  0  0   PCI-MSI 311296-edge  
> PCIe PME
>  121:  0  0  0  0   PCI-MSI 315392-edge  
> PCIe PME
>  122:  0  0  0  0   PCI-MSI 327680-edge  
> PCIe PME
>  123:  0  0   3328  0   PCI-MSI 294912-edge  
> ahci[:00:12.0]
>  124:  0133  0  0   PCI-MSI 344064-edge  
> xhci_hcd
>  125:  0  0 32  0   PCI-MSI 245760-edge  
> mei_me
>  127:381  0  0  0   PCI-MSI 1572864-edge  
> enp3s0
>  128:  0  0  0236   PCI-MSI 32768-edge  
> i915
>  129:  0374  0  0   PCI-MSI 229376-edge  
> snd_hda_intel:card0
> 
>> I don't know if we can afford to ignore a softirq even at this late stage. 
>> We should
>> probably avoid leaking any. So here is a possible fix, if you don't mind 
>> trying:
>>
> I tested your patch and at least in the first minutes of testing couldn't 
> reproduce
> the issue any longer. I tested manual system suspend and the following script 
> you
> sent when we started to analyze the issue.
> 

Also after some more time the issue didn't occur again. So it seems your 
analysis
was right and also the approach to fix it. Thanks!
Will let you know in case the issue should pop up again under special
circumstances.


> Heiner
> 
> --
> 
> #!/bin/bash
> 
> do_hotplug()
> {
>   for i in $(seq 1 $2)
>   do
>   echo $1 > /sys/devices/system/cpu/cpu$i/online
>   done
> }
> 
> LAST_CPU=$(($(nproc)-1))
> 
> while true
> do
>   do_hotplug 0 $LAST_CPU
>   do_hotplug 1 $LAST_CPU
> done
> 



Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-12-27 Thread Heiner Kallweit
On 28.12.2018 02:31, Frederic Weisbecker wrote:
> On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote:
>>
[...]
> 
> Interesting, the softirq is raised from hardirq but it's not handled in the 
> end of
> the IRQ. Are you running threaded IRQS by any chance? If so I would expect 
> ksoftirqd
> to handle the pending work before we go idle. However I can imagine a small 
> window
> where such an expectation may not be met: if the softirq is raised after the 
> ksoftirqd
> thread is parked (CPUHP_AP_SMPBOOT_THREADS), which is right before we disable 
> the CPU
> (CPUHP_TEARDOWN_CPU).
> 
I have a network driver (r8169) using NAPI which runs in softirq context AFAIK.
For testing purposes I sometimes trigger system suspend via network, so there is
network adapter activity when system suspends. Apart from that nothing really
exciting:
CPU0   CPU1   CPU2   CPU3
   0: 43  0  0  0   IO-APIC2-edge  timer
   1:  4  0  0  0   IO-APIC1-edge  i8042
   8:  0  1  0  0   IO-APIC8-fasteoi   rtc0
   9:  0  0  0  0   IO-APIC9-fasteoi   acpi
  12:  0  0  0  5   IO-APIC   12-edge  i8042
 120:  0  0  0  0   PCI-MSI 311296-edge  
PCIe PME
 121:  0  0  0  0   PCI-MSI 315392-edge  
PCIe PME
 122:  0  0  0  0   PCI-MSI 327680-edge  
PCIe PME
 123:  0  0   3328  0   PCI-MSI 294912-edge  
ahci[:00:12.0]
 124:  0133  0  0   PCI-MSI 344064-edge  
xhci_hcd
 125:  0  0 32  0   PCI-MSI 245760-edge  
mei_me
 127:381  0  0  0   PCI-MSI 1572864-edge  
enp3s0
 128:  0  0  0236   PCI-MSI 32768-edge  i915
 129:  0374  0  0   PCI-MSI 229376-edge  
snd_hda_intel:card0

> I don't know if we can afford to ignore a softirq even at this late stage. We 
> should
> probably avoid leaking any. So here is a possible fix, if you don't mind 
> trying:
> 
I tested your patch and at least in the first minutes of testing couldn't 
reproduce
the issue any longer. I tested manual system suspend and the following script 
you
sent when we started to analyze the issue.

Heiner

--

#!/bin/bash

do_hotplug()
{
for i in $(seq 1 $2)
do
echo $1 > /sys/devices/system/cpu/cpu$i/online
done
}

LAST_CPU=$(($(nproc)-1))

while true
do
do_hotplug 0 $LAST_CPU
do_hotplug 1 $LAST_CPU
done


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-12-27 Thread Frederic Weisbecker
On Fri, Dec 28, 2018 at 12:11:12AM +0100, Heiner Kallweit wrote:
> 
> OK, did as you advised and here comes the trace. That's the related dmesg 
> part:
> 
> [ 1479.025092] x86: Booting SMP configuration:
> [ 1479.025129] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [ 1479.094715] NOHZ: local_softirq_pending 202
> [ 1479.096557] smpboot: CPU 1 is now offline
> 
> Hope it helps.
> Heiner
> 
> 
> # tracer: nop
> #
> #  _-=> irqs-off
> # / _=> need-resched
> #| / _---=> hardirq/softirq
> #|| / _--=> preempt-depth
> #||| / delay
> #   TASK-PID   CPU#  TIMESTAMP  FUNCTION
> #  | |   |      | |
[...]
>   -0 [001] d.h2  1479.111017: softirq_raise: vec=9 
> [action=RCU]
>   -0 [001] d.h2  1479.111026: softirq_raise: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.111035: softirq_entry: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.111040: softirq_exit: vec=1 
> [action=TIMER]
>   -0 [001] ..s2  1479.111040: softirq_entry: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.111052: softirq_exit: vec=7 
> [action=SCHED]
>   -0 [001] ..s2  1479.111052: softirq_entry: vec=9 
> [action=RCU]
>   -0 [001] .Ns2  1479.111079: softirq_exit: vec=9 
> [action=RCU]
>  cpuhp/1-13[001] dNh2  1479.112930: softirq_raise: vec=1 
> [action=TIMER]
>  cpuhp/1-13[001] dNh2  1479.112935: softirq_raise: vec=9 
> [action=RCU]

Interesting, the softirq is raised from hardirq but it's not handled in the end 
of
the IRQ. Are you running threaded IRQS by any chance? If so I would expect 
ksoftirqd
to handle the pending work before we go idle. However I can imagine a small 
window
where such an expectation may not be met: if the softirq is raised after the 
ksoftirqd
thread is parked (CPUHP_AP_SMPBOOT_THREADS), which is right before we disable 
the CPU
(CPUHP_TEARDOWN_CPU).

I don't know if we can afford to ignore a softirq even at this late stage. We 
should
probably avoid leaking any. So here is a possible fix, if you don't mind trying:

diff --git a/kernel/softirq.c b/kernel/softirq.c
index d288133..716096b 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -56,6 +56,7 @@ EXPORT_PER_CPU_SYMBOL(irq_stat);
 static struct softirq_action softirq_vec[NR_SOFTIRQS] 
__cacheline_aligned_in_smp;
 
 DEFINE_PER_CPU(struct task_struct *, ksoftirqd);
+DEFINE_PER_CPU(int, ksoftirqd_parked);
 
 const char * const softirq_to_name[NR_SOFTIRQS] = {
"HI", "TIMER", "NET_TX", "NET_RX", "BLOCK", "IRQ_POLL",
@@ -363,7 +364,7 @@ static inline void invoke_softirq(void)
if (ksoftirqd_running(local_softirq_pending()))
return;
 
-   if (!force_irqthreads) {
+   if (!force_irqthreads || __this_cpu_read(ksoftirqd_parked)) {
 #ifdef CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK
/*
 * We can safely execute softirq on the current stack if
@@ -659,6 +660,22 @@ static void run_ksoftirqd(unsigned int cpu)
local_irq_enable();
 }
 
+static void ksoftirqd_park(unsigned int cpu)
+{
+   local_irq_disable();
+   __this_cpu_write(ksoftirqd_parked, 1);
+
+   if (local_softirq_pending())
+   run_ksoftirqd(cpu);
+
+   local_irq_enable();
+}
+
+static void ksoftirqd_unpark(unsigned int cpu)
+{
+   __this_cpu_write(ksoftirqd_parked, 0);
+}
+
 #ifdef CONFIG_HOTPLUG_CPU
 /*
  * tasklet_kill_immediate is called to remove a tasklet which can already be
@@ -724,6 +741,8 @@ static int takeover_tasklets(unsigned int cpu)
 static struct smp_hotplug_thread softirq_threads = {
.store  = ,
.thread_should_run  = ksoftirqd_should_run,
+   .park   = ksoftirqd_park,
+   .unpark = ksoftirqd_unpark,
.thread_fn  = run_ksoftirqd,
.thread_comm= "ksoftirqd/%u",
 };


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-12-27 Thread Heiner Kallweit
On 27.12.2018 07:53, Frederic Weisbecker wrote:
> On Mon, Oct 15, 2018 at 10:58:54PM +0200, Heiner Kallweit wrote:
>> On 28.09.2018 15:18, Frederic Weisbecker wrote:
>>> On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
 On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
> On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
>> I tested it and Frederic is right, it doesn't help. Can it be somehow 
>> related to
>> the cpu being brought down during suspend? Because I get the warning 
>> only during
>> suspend when the cpu is inactive already (but still online).
>
> It's hard to tell, I haven't been able to reproduce on suspend to 
> disk/mem.
>
> Does this script eventually trigger it after some time?

 Any update to this?
>>>
>>> Heiner? Can you please test the script I sent to you?
>>>
>>> Thanks.
>>>
>> Sorry, took some time .. And yes, running your script triggers the message 
>> too.
>>
[...]
> 
> Sorry, I got sidetracked and almost forgot about it.
> 
> So this is triggered by CPU hotplug. At some point the CPU has an
> opportunity to go idle and for some reason the timer softirq is still
> pending. We need to know which timer this is about and why the timer
> softirq keeps pending.
> 
> I'm going to need your help again. Can you please run the following (possibly
> repeat until it triggers the bug) ?
> 
>echo 1 > /sys/devices/system/cpu/cpu1/online
> 
># Pause and reset tracing
>echo 0 > /sys/kernel/debug/tracing/tracing_on
>echo > /sys/kernel/debug/tracing/trace
> 
># Turn on relevant events
>echo 1 > /sys/kernel/debug/tracing/events/timer/timer_*/enable
>echo 1 > /sys/kernel/debug/tracing/events/irq/softirq_*/enable
>echo 1 > /sys/kernel/debug/tracing/tracing_on
> 
># Trigger
>echo 0 > /sys/devices/system/cpu/cpu1/online
> 
>echo 0 > /sys/kernel/debug/tracing/tracing_on
> 
> And please apply the following patch before. With all that I'll have the
> relevant informations stored in /sys/kernel/debug/tracing/per_cpu/cpu1/trace
> Please send its content to me. Thanks!
> 
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 69e673b..0e57a3b 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -892,6 +892,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched 
> *ts)
>   (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
>   pr_warn("NOHZ: local_softirq_pending %02x\n",
>   (unsigned int) local_softirq_pending());
> + trace_dump_stack(0);
>   ratelimit++;
>   }
>   return false;
> 
> 

OK, did as you advised and here comes the trace. That's the related dmesg part:

[ 1479.025092] x86: Booting SMP configuration:
[ 1479.025129] smpboot: Booting Node 0 Processor 1 APIC 0x2
[ 1479.094715] NOHZ: local_softirq_pending 202
[ 1479.096557] smpboot: CPU 1 is now offline

Hope it helps.
Heiner


# tracer: nop
#
#  _-=> irqs-off
# / _=> need-resched
#| / _---=> hardirq/softirq
#|| / _--=> preempt-depth
#||| / delay
#   TASK-PID   CPU#  TIMESTAMP  FUNCTION
#  | |   |      | |
  -0 [001] d.h2  1479.099092: softirq_raise: vec=1 
[action=TIMER]
  -0 [001] d.h2  1479.099098: softirq_raise: vec=9 
[action=RCU]
  -0 [001] d.h2  1479.099106: softirq_raise: vec=7 
[action=SCHED]
  -0 [001] ..s2  1479.099114: softirq_entry: vec=1 
[action=TIMER]
  -0 [001] ..s2  1479.099120: softirq_exit: vec=1 
[action=TIMER]
  -0 [001] ..s2  1479.099121: softirq_entry: vec=7 
[action=SCHED]
  -0 [001] ..s2  1479.099134: softirq_exit: vec=7 
[action=SCHED]
  -0 [001] ..s2  1479.099135: softirq_entry: vec=9 
[action=RCU]
  -0 [001] ..s2  1479.099141: softirq_exit: vec=9 [action=RCU]
  -0 [001] d.h2  1479.100094: softirq_raise: vec=9 
[action=RCU]
  -0 [001] ..s2  1479.100109: softirq_entry: vec=9 
[action=RCU]
  -0 [001] ..s2  1479.100116: softirq_exit: vec=9 [action=RCU]
  -0 [001] d.h2  1479.101091: softirq_raise: vec=1 
[action=TIMER]
  -0 [001] ..s2  1479.101113: softirq_entry: vec=1 
[action=TIMER]
  -0 [001] ..s2  1479.101118: softirq_exit: vec=1 
[action=TIMER]
  -0 [001] d.h2  1479.102094: softirq_raise: vec=9 
[action=RCU]
  -0 [001] ..s2  1479.102114: softirq_entry: vec=9 
[action=RCU]
  -0 [001] ..s2  1479.102121: softirq_exit: vec=9 [action=RCU]
  -0 [001] d.h2  1479.103091: softirq_raise: vec=1 
[action=TIMER]
  -0 [001] d.h2  1479.103097: softirq_raise: vec=9 
[action=RCU]
  -0 

Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-12-26 Thread Frederic Weisbecker
On Mon, Oct 15, 2018 at 10:58:54PM +0200, Heiner Kallweit wrote:
> On 28.09.2018 15:18, Frederic Weisbecker wrote:
> > On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
> >> On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
> >>> On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
>  I tested it and Frederic is right, it doesn't help. Can it be somehow 
>  related to
>  the cpu being brought down during suspend? Because I get the warning 
>  only during
>  suspend when the cpu is inactive already (but still online).
> >>>
> >>> It's hard to tell, I haven't been able to reproduce on suspend to 
> >>> disk/mem.
> >>>
> >>> Does this script eventually trigger it after some time?
> >>
> >> Any update to this?
> > 
> > Heiner? Can you please test the script I sent to you?
> > 
> > Thanks.
> > 
> Sorry, took some time .. And yes, running your script triggers the message 
> too.
> 
> [   25.646015] x86: Booting SMP configuration:
> [   25.646044] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.664491] smpboot: CPU 1 is now offline
> [   25.679299] x86: Booting SMP configuration:
> [   25.679329] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.698449] smpboot: CPU 1 is now offline
> [   25.711698] x86: Booting SMP configuration:
> [   25.711727] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.729185] NOHZ: local_softirq_pending 202
> [   25.729229] NOHZ: local_softirq_pending 202
> [   25.730759] smpboot: CPU 1 is now offline
> [   25.744053] x86: Booting SMP configuration:
> [   25.744083] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.762520] smpboot: CPU 1 is now offline
> [   25.776834] x86: Booting SMP configuration:
> [   25.776863] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.794189] NOHZ: local_softirq_pending 202
> [   25.796672] smpboot: CPU 1 is now offline
> [   25.805970] x86: Booting SMP configuration:
> [   25.805999] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.827360] smpboot: CPU 1 is now offline
> [   25.839043] x86: Booting SMP configuration:
> [   25.839073] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.858184] NOHZ: local_softirq_pending 202
> [   25.862182] smpboot: CPU 1 is now offline
> [   25.873759] x86: Booting SMP configuration:
> [   25.873789] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.893385] smpboot: CPU 1 is now offline
> 

Sorry, I got sidetracked and almost forgot about it.

So this is triggered by CPU hotplug. At some point the CPU has an
opportunity to go idle and for some reason the timer softirq is still
pending. We need to know which timer this is about and why the timer
softirq keeps pending.

I'm going to need your help again. Can you please run the following (possibly
repeat until it triggers the bug) ?

   echo 1 > /sys/devices/system/cpu/cpu1/online

   # Pause and reset tracing
   echo 0 > /sys/kernel/debug/tracing/tracing_on
   echo > /sys/kernel/debug/tracing/trace

   # Turn on relevant events
   echo 1 > /sys/kernel/debug/tracing/events/timer/timer_*/enable
   echo 1 > /sys/kernel/debug/tracing/events/irq/softirq_*/enable
   echo 1 > /sys/kernel/debug/tracing/tracing_on

   # Trigger
   echo 0 > /sys/devices/system/cpu/cpu1/online

   echo 0 > /sys/kernel/debug/tracing/tracing_on

And please apply the following patch before. With all that I'll have the
relevant informations stored in /sys/kernel/debug/tracing/per_cpu/cpu1/trace
Please send its content to me. Thanks!

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 69e673b..0e57a3b 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -892,6 +892,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched 
*ts)
(local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
pr_warn("NOHZ: local_softirq_pending %02x\n",
(unsigned int) local_softirq_pending());
+   trace_dump_stack(0);
ratelimit++;
}
return false;



Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-12-24 Thread Heiner Kallweit
On 15.10.2018 22:58, Heiner Kallweit wrote:
> On 28.09.2018 15:18, Frederic Weisbecker wrote:
>> On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
>>> On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
 On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
> I tested it and Frederic is right, it doesn't help. Can it be somehow 
> related to
> the cpu being brought down during suspend? Because I get the warning only 
> during
> suspend when the cpu is inactive already (but still online).

 It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.

 Does this script eventually trigger it after some time?
>>>
>>> Any update to this?
>>
>> Heiner? Can you please test the script I sent to you?
>>
>> Thanks.
>>
> Sorry, took some time .. And yes, running your script triggers the message 
> too.
> 
> [   25.646015] x86: Booting SMP configuration:
> [   25.646044] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.664491] smpboot: CPU 1 is now offline
> [   25.679299] x86: Booting SMP configuration:
> [   25.679329] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.698449] smpboot: CPU 1 is now offline
> [   25.711698] x86: Booting SMP configuration:
> [   25.711727] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.729185] NOHZ: local_softirq_pending 202
> [   25.729229] NOHZ: local_softirq_pending 202
> [   25.730759] smpboot: CPU 1 is now offline
> [   25.744053] x86: Booting SMP configuration:
> [   25.744083] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.762520] smpboot: CPU 1 is now offline
> [   25.776834] x86: Booting SMP configuration:
> [   25.776863] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.794189] NOHZ: local_softirq_pending 202
> [   25.796672] smpboot: CPU 1 is now offline
> [   25.805970] x86: Booting SMP configuration:
> [   25.805999] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.827360] smpboot: CPU 1 is now offline
> [   25.839043] x86: Booting SMP configuration:
> [   25.839073] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.858184] NOHZ: local_softirq_pending 202
> [   25.862182] smpboot: CPU 1 is now offline
> [   25.873759] x86: Booting SMP configuration:
> [   25.873789] smpboot: Booting Node 0 Processor 1 APIC 0x2
> [   25.893385] smpboot: CPU 1 is now offline
> 
Almost forgot about his topic, but the warning is still there.
Has there been any progress in analysis?


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-10-15 Thread Heiner Kallweit
On 28.09.2018 15:18, Frederic Weisbecker wrote:
> On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
>> On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
>>> On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
 I tested it and Frederic is right, it doesn't help. Can it be somehow 
 related to
 the cpu being brought down during suspend? Because I get the warning only 
 during
 suspend when the cpu is inactive already (but still online).
>>>
>>> It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.
>>>
>>> Does this script eventually trigger it after some time?
>>
>> Any update to this?
> 
> Heiner? Can you please test the script I sent to you?
> 
> Thanks.
> 
Sorry, took some time .. And yes, running your script triggers the message too.

[   25.646015] x86: Booting SMP configuration:
[   25.646044] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.664491] smpboot: CPU 1 is now offline
[   25.679299] x86: Booting SMP configuration:
[   25.679329] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.698449] smpboot: CPU 1 is now offline
[   25.711698] x86: Booting SMP configuration:
[   25.711727] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.729185] NOHZ: local_softirq_pending 202
[   25.729229] NOHZ: local_softirq_pending 202
[   25.730759] smpboot: CPU 1 is now offline
[   25.744053] x86: Booting SMP configuration:
[   25.744083] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.762520] smpboot: CPU 1 is now offline
[   25.776834] x86: Booting SMP configuration:
[   25.776863] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.794189] NOHZ: local_softirq_pending 202
[   25.796672] smpboot: CPU 1 is now offline
[   25.805970] x86: Booting SMP configuration:
[   25.805999] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.827360] smpboot: CPU 1 is now offline
[   25.839043] x86: Booting SMP configuration:
[   25.839073] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.858184] NOHZ: local_softirq_pending 202
[   25.862182] smpboot: CPU 1 is now offline
[   25.873759] x86: Booting SMP configuration:
[   25.873789] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.893385] smpboot: CPU 1 is now offline



Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-10-15 Thread Heiner Kallweit
On 28.09.2018 15:18, Frederic Weisbecker wrote:
> On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
>> On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
>>> On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
 I tested it and Frederic is right, it doesn't help. Can it be somehow 
 related to
 the cpu being brought down during suspend? Because I get the warning only 
 during
 suspend when the cpu is inactive already (but still online).
>>>
>>> It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.
>>>
>>> Does this script eventually trigger it after some time?
>>
>> Any update to this?
> 
> Heiner? Can you please test the script I sent to you?
> 
> Thanks.
> 
Sorry, took some time .. And yes, running your script triggers the message too.

[   25.646015] x86: Booting SMP configuration:
[   25.646044] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.664491] smpboot: CPU 1 is now offline
[   25.679299] x86: Booting SMP configuration:
[   25.679329] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.698449] smpboot: CPU 1 is now offline
[   25.711698] x86: Booting SMP configuration:
[   25.711727] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.729185] NOHZ: local_softirq_pending 202
[   25.729229] NOHZ: local_softirq_pending 202
[   25.730759] smpboot: CPU 1 is now offline
[   25.744053] x86: Booting SMP configuration:
[   25.744083] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.762520] smpboot: CPU 1 is now offline
[   25.776834] x86: Booting SMP configuration:
[   25.776863] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.794189] NOHZ: local_softirq_pending 202
[   25.796672] smpboot: CPU 1 is now offline
[   25.805970] x86: Booting SMP configuration:
[   25.805999] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.827360] smpboot: CPU 1 is now offline
[   25.839043] x86: Booting SMP configuration:
[   25.839073] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.858184] NOHZ: local_softirq_pending 202
[   25.862182] smpboot: CPU 1 is now offline
[   25.873759] x86: Booting SMP configuration:
[   25.873789] smpboot: Booting Node 0 Processor 1 APIC 0x2
[   25.893385] smpboot: CPU 1 is now offline



Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-09-28 Thread Heiner Kallweit
On 28.09.2018 15:18, Frederic Weisbecker wrote:
> On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
>> On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
>>> On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
 I tested it and Frederic is right, it doesn't help. Can it be somehow 
 related to
 the cpu being brought down during suspend? Because I get the warning only 
 during
 suspend when the cpu is inactive already (but still online).
>>>
>>> It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.
>>>
>>> Does this script eventually trigger it after some time?
>>
>> Any update to this?
> 
> Heiner? Can you please test the script I sent to you?
> 
> Thanks.
> 
Sorry, this wasn't on my radar any longer. Next week I'm on
travel, I can do it the week after.

Heiner


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-09-28 Thread Heiner Kallweit
On 28.09.2018 15:18, Frederic Weisbecker wrote:
> On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
>> On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
>>> On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
 I tested it and Frederic is right, it doesn't help. Can it be somehow 
 related to
 the cpu being brought down during suspend? Because I get the warning only 
 during
 suspend when the cpu is inactive already (but still online).
>>>
>>> It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.
>>>
>>> Does this script eventually trigger it after some time?
>>
>> Any update to this?
> 
> Heiner? Can you please test the script I sent to you?
> 
> Thanks.
> 
Sorry, this wasn't on my radar any longer. Next week I'm on
travel, I can do it the week after.

Heiner


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-09-28 Thread Frederic Weisbecker
On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
> On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
> > On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
> > > I tested it and Frederic is right, it doesn't help. Can it be somehow 
> > > related to
> > > the cpu being brought down during suspend? Because I get the warning only 
> > > during
> > > suspend when the cpu is inactive already (but still online).
> > 
> > It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.
> > 
> > Does this script eventually trigger it after some time?
> 
> Any update to this?

Heiner? Can you please test the script I sent to you?

Thanks.


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-09-28 Thread Frederic Weisbecker
On Thu, Sep 27, 2018 at 06:05:46PM +0200, Thomas Gleixner wrote:
> On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
> > On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
> > > I tested it and Frederic is right, it doesn't help. Can it be somehow 
> > > related to
> > > the cpu being brought down during suspend? Because I get the warning only 
> > > during
> > > suspend when the cpu is inactive already (but still online).
> > 
> > It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.
> > 
> > Does this script eventually trigger it after some time?
> 
> Any update to this?

Heiner? Can you please test the script I sent to you?

Thanks.


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-09-27 Thread Thomas Gleixner
On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
> On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
> > I tested it and Frederic is right, it doesn't help. Can it be somehow 
> > related to
> > the cpu being brought down during suspend? Because I get the warning only 
> > during
> > suspend when the cpu is inactive already (but still online).
> 
> It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.
> 
> Does this script eventually trigger it after some time?

Any update to this?

Thanks,

tglx


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-09-27 Thread Thomas Gleixner
On Tue, 28 Aug 2018, Frederic Weisbecker wrote:
> On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
> > I tested it and Frederic is right, it doesn't help. Can it be somehow 
> > related to
> > the cpu being brought down during suspend? Because I get the warning only 
> > during
> > suspend when the cpu is inactive already (but still online).
> 
> It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.
> 
> Does this script eventually trigger it after some time?

Any update to this?

Thanks,

tglx


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-27 Thread Frederic Weisbecker
On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
> On 24.08.2018 16:30, Frederic Weisbecker wrote:
> >> Can you try the one I posted in this thread:
> >>
> >>  
> >> https://lkml.kernel.org/r/alpine.deb.2.21.1808240851420.1...@nanos.tec.linutronix.de
> >>
> >> Also below for reference.
> >>
> >> Thanks,
> >>
> >>tglx
> >>
> >> 8<
> >> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> >> index 5b33e2f5c0ed..6aab9d54a331 100644
> >> --- a/kernel/time/tick-sched.c
> >> +++ b/kernel/time/tick-sched.c
> >> @@ -888,7 +888,7 @@ static bool can_stop_idle_tick(int cpu, struct 
> >> tick_sched *ts)
> >>if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
> >>static int ratelimit;
> >>  
> >> -  if (ratelimit < 10 &&
> >> +  if (ratelimit < 10 && !in_softirq() &&
> >>(local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
> >>pr_warn("NOHZ: local_softirq_pending %02x\n",
> >>(unsigned int) local_softirq_pending());
> > 
> > I fear it may not work in his case because it happens in -next and we don't 
> > stop
> > the idle tick from IRQ tail anymore. So we shouldn't be interrupting a 
> > softirq
> > in this path. Still it's worth trying, I may well be missing something.
> > 
> > Thanks.
> > 
> I tested it and Frederic is right, it doesn't help. Can it be somehow related 
> to
> the cpu being brought down during suspend? Because I get the warning only 
> during
> suspend when the cpu is inactive already (but still online).

It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.

Does this script eventually trigger it after some time?

#!/bin/bash

do_hotplug()
{
for i in $(seq 1 $2)
do
echo $1 > /sys/devices/system/cpu/cpu$i/online
done
}

LAST_CPU=$(($(nproc)-1))

while true
do
do_hotplug 0 $LAST_CPU
do_hotplug 1 $LAST_CPU
done


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-27 Thread Frederic Weisbecker
On Fri, Aug 24, 2018 at 07:06:32PM +0200, Heiner Kallweit wrote:
> On 24.08.2018 16:30, Frederic Weisbecker wrote:
> >> Can you try the one I posted in this thread:
> >>
> >>  
> >> https://lkml.kernel.org/r/alpine.deb.2.21.1808240851420.1...@nanos.tec.linutronix.de
> >>
> >> Also below for reference.
> >>
> >> Thanks,
> >>
> >>tglx
> >>
> >> 8<
> >> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> >> index 5b33e2f5c0ed..6aab9d54a331 100644
> >> --- a/kernel/time/tick-sched.c
> >> +++ b/kernel/time/tick-sched.c
> >> @@ -888,7 +888,7 @@ static bool can_stop_idle_tick(int cpu, struct 
> >> tick_sched *ts)
> >>if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
> >>static int ratelimit;
> >>  
> >> -  if (ratelimit < 10 &&
> >> +  if (ratelimit < 10 && !in_softirq() &&
> >>(local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
> >>pr_warn("NOHZ: local_softirq_pending %02x\n",
> >>(unsigned int) local_softirq_pending());
> > 
> > I fear it may not work in his case because it happens in -next and we don't 
> > stop
> > the idle tick from IRQ tail anymore. So we shouldn't be interrupting a 
> > softirq
> > in this path. Still it's worth trying, I may well be missing something.
> > 
> > Thanks.
> > 
> I tested it and Frederic is right, it doesn't help. Can it be somehow related 
> to
> the cpu being brought down during suspend? Because I get the warning only 
> during
> suspend when the cpu is inactive already (but still online).

It's hard to tell, I haven't been able to reproduce on suspend to disk/mem.

Does this script eventually trigger it after some time?

#!/bin/bash

do_hotplug()
{
for i in $(seq 1 $2)
do
echo $1 > /sys/devices/system/cpu/cpu$i/online
done
}

LAST_CPU=$(($(nproc)-1))

while true
do
do_hotplug 0 $LAST_CPU
do_hotplug 1 $LAST_CPU
done


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-24 Thread Heiner Kallweit
On 24.08.2018 16:30, Frederic Weisbecker wrote:
> On Fri, Aug 24, 2018 at 10:01:35AM +0200, Thomas Gleixner wrote:
>> On Fri, 24 Aug 2018, Heiner Kallweit wrote:
>>> On 24.08.2018 06:12, Frederic Weisbecker wrote:
 On Thu, Aug 16, 2018 at 08:13:03AM +0200, Heiner Kallweit wrote:
> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
> I think it's related to mentioned commit (didn't bisect it yet).
> See log from suspending.
>
> I have no reason to think the fix is wrong, it may just have revealed
> another issue which existed before and was hidden by the bug.
>
> Rgds, Heiner
>
> [   75.073353] random: crng init done
> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
> [   78.619564] PM: suspend entry (deep)
> [   78.619675] PM: Syncing filesystems ... done.
> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) 
> done.
> [   78.656094] OOM killer disabled.
> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 
> seconds) done.
> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
> [   78.663066] nuvoton-cir 00:07: disabled
> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
> [   78.786651] ACPI: Preparing to enter system sleep state S3
> [   78.789613] PM: Saving platform NVS memory
> [   78.789759] Disabling non-boot CPUs ...
> [   78.805154] NOHZ: local_softirq_pending 202
> [   78.805182] NOHZ: local_softirq_pending 202
> [   78.807102] smpboot: CPU 1 is now offline

 I've tried to reproduce with suspend on disk but got unsuccessful.

 A small question as I see someone is having a similar issue with a stable
 release only. On which kernel did you trigger that: upstream or stable?

 I'll continue investigating.

 Thanks.

>>> Affected is recent linux-next, after the commit mentioned in the subject.
>>> I can work around the warning (not sure whether it's a proper fix),
>>> see here:
>>> https://lkml.org/lkml/2018/8/18/272
>>
>> Can you try the one I posted in this thread:
>>
>>  
>> https://lkml.kernel.org/r/alpine.deb.2.21.1808240851420.1...@nanos.tec.linutronix.de
>>
>> Also below for reference.
>>
>> Thanks,
>>
>>  tglx
>>
>> 8<
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index 5b33e2f5c0ed..6aab9d54a331 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -888,7 +888,7 @@ static bool can_stop_idle_tick(int cpu, struct 
>> tick_sched *ts)
>>  if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
>>  static int ratelimit;
>>  
>> -if (ratelimit < 10 &&
>> +if (ratelimit < 10 && !in_softirq() &&
>>  (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
>>  pr_warn("NOHZ: local_softirq_pending %02x\n",
>>  (unsigned int) local_softirq_pending());
> 
> I fear it may not work in his case because it happens in -next and we don't 
> stop
> the idle tick from IRQ tail anymore. So we shouldn't be interrupting a softirq
> in this path. Still it's worth trying, I may well be missing something.
> 
> Thanks.
> 
I tested it and Frederic is right, it doesn't help. Can it be somehow related to
the cpu being brought down during suspend? Because I get the warning only during
suspend when the cpu is inactive already (but still online).


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-24 Thread Heiner Kallweit
On 24.08.2018 16:30, Frederic Weisbecker wrote:
> On Fri, Aug 24, 2018 at 10:01:35AM +0200, Thomas Gleixner wrote:
>> On Fri, 24 Aug 2018, Heiner Kallweit wrote:
>>> On 24.08.2018 06:12, Frederic Weisbecker wrote:
 On Thu, Aug 16, 2018 at 08:13:03AM +0200, Heiner Kallweit wrote:
> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
> I think it's related to mentioned commit (didn't bisect it yet).
> See log from suspending.
>
> I have no reason to think the fix is wrong, it may just have revealed
> another issue which existed before and was hidden by the bug.
>
> Rgds, Heiner
>
> [   75.073353] random: crng init done
> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
> [   78.619564] PM: suspend entry (deep)
> [   78.619675] PM: Syncing filesystems ... done.
> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) 
> done.
> [   78.656094] OOM killer disabled.
> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 
> seconds) done.
> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
> [   78.663066] nuvoton-cir 00:07: disabled
> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
> [   78.786651] ACPI: Preparing to enter system sleep state S3
> [   78.789613] PM: Saving platform NVS memory
> [   78.789759] Disabling non-boot CPUs ...
> [   78.805154] NOHZ: local_softirq_pending 202
> [   78.805182] NOHZ: local_softirq_pending 202
> [   78.807102] smpboot: CPU 1 is now offline

 I've tried to reproduce with suspend on disk but got unsuccessful.

 A small question as I see someone is having a similar issue with a stable
 release only. On which kernel did you trigger that: upstream or stable?

 I'll continue investigating.

 Thanks.

>>> Affected is recent linux-next, after the commit mentioned in the subject.
>>> I can work around the warning (not sure whether it's a proper fix),
>>> see here:
>>> https://lkml.org/lkml/2018/8/18/272
>>
>> Can you try the one I posted in this thread:
>>
>>  
>> https://lkml.kernel.org/r/alpine.deb.2.21.1808240851420.1...@nanos.tec.linutronix.de
>>
>> Also below for reference.
>>
>> Thanks,
>>
>>  tglx
>>
>> 8<
>> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
>> index 5b33e2f5c0ed..6aab9d54a331 100644
>> --- a/kernel/time/tick-sched.c
>> +++ b/kernel/time/tick-sched.c
>> @@ -888,7 +888,7 @@ static bool can_stop_idle_tick(int cpu, struct 
>> tick_sched *ts)
>>  if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
>>  static int ratelimit;
>>  
>> -if (ratelimit < 10 &&
>> +if (ratelimit < 10 && !in_softirq() &&
>>  (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
>>  pr_warn("NOHZ: local_softirq_pending %02x\n",
>>  (unsigned int) local_softirq_pending());
> 
> I fear it may not work in his case because it happens in -next and we don't 
> stop
> the idle tick from IRQ tail anymore. So we shouldn't be interrupting a softirq
> in this path. Still it's worth trying, I may well be missing something.
> 
> Thanks.
> 
I tested it and Frederic is right, it doesn't help. Can it be somehow related to
the cpu being brought down during suspend? Because I get the warning only during
suspend when the cpu is inactive already (but still online).


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-24 Thread Frederic Weisbecker
On Fri, Aug 24, 2018 at 10:01:35AM +0200, Thomas Gleixner wrote:
> On Fri, 24 Aug 2018, Heiner Kallweit wrote:
> > On 24.08.2018 06:12, Frederic Weisbecker wrote:
> > > On Thu, Aug 16, 2018 at 08:13:03AM +0200, Heiner Kallweit wrote:
> > >> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
> > >> I think it's related to mentioned commit (didn't bisect it yet).
> > >> See log from suspending.
> > >>
> > >> I have no reason to think the fix is wrong, it may just have revealed
> > >> another issue which existed before and was hidden by the bug.
> > >>
> > >> Rgds, Heiner
> > >>
> > >> [   75.073353] random: crng init done
> > >> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
> > >> [   78.619564] PM: suspend entry (deep)
> > >> [   78.619675] PM: Syncing filesystems ... done.
> > >> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) 
> > >> done.
> > >> [   78.656094] OOM killer disabled.
> > >> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 
> > >> seconds) done.
> > >> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
> > >> [   78.663066] nuvoton-cir 00:07: disabled
> > >> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > >> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
> > >> [   78.786651] ACPI: Preparing to enter system sleep state S3
> > >> [   78.789613] PM: Saving platform NVS memory
> > >> [   78.789759] Disabling non-boot CPUs ...
> > >> [   78.805154] NOHZ: local_softirq_pending 202
> > >> [   78.805182] NOHZ: local_softirq_pending 202
> > >> [   78.807102] smpboot: CPU 1 is now offline
> > > 
> > > I've tried to reproduce with suspend on disk but got unsuccessful.
> > > 
> > > A small question as I see someone is having a similar issue with a stable
> > > release only. On which kernel did you trigger that: upstream or stable?
> > > 
> > > I'll continue investigating.
> > > 
> > > Thanks.
> > > 
> > Affected is recent linux-next, after the commit mentioned in the subject.
> > I can work around the warning (not sure whether it's a proper fix),
> > see here:
> > https://lkml.org/lkml/2018/8/18/272
> 
> Can you try the one I posted in this thread:
> 
>  
> https://lkml.kernel.org/r/alpine.deb.2.21.1808240851420.1...@nanos.tec.linutronix.de
> 
> Also below for reference.
> 
> Thanks,
> 
>   tglx
> 
> 8<
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 5b33e2f5c0ed..6aab9d54a331 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -888,7 +888,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched 
> *ts)
>   if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
>   static int ratelimit;
>  
> - if (ratelimit < 10 &&
> + if (ratelimit < 10 && !in_softirq() &&
>   (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
>   pr_warn("NOHZ: local_softirq_pending %02x\n",
>   (unsigned int) local_softirq_pending());

I fear it may not work in his case because it happens in -next and we don't stop
the idle tick from IRQ tail anymore. So we shouldn't be interrupting a softirq
in this path. Still it's worth trying, I may well be missing something.

Thanks.


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-24 Thread Frederic Weisbecker
On Fri, Aug 24, 2018 at 10:01:35AM +0200, Thomas Gleixner wrote:
> On Fri, 24 Aug 2018, Heiner Kallweit wrote:
> > On 24.08.2018 06:12, Frederic Weisbecker wrote:
> > > On Thu, Aug 16, 2018 at 08:13:03AM +0200, Heiner Kallweit wrote:
> > >> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
> > >> I think it's related to mentioned commit (didn't bisect it yet).
> > >> See log from suspending.
> > >>
> > >> I have no reason to think the fix is wrong, it may just have revealed
> > >> another issue which existed before and was hidden by the bug.
> > >>
> > >> Rgds, Heiner
> > >>
> > >> [   75.073353] random: crng init done
> > >> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
> > >> [   78.619564] PM: suspend entry (deep)
> > >> [   78.619675] PM: Syncing filesystems ... done.
> > >> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) 
> > >> done.
> > >> [   78.656094] OOM killer disabled.
> > >> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 
> > >> seconds) done.
> > >> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
> > >> [   78.663066] nuvoton-cir 00:07: disabled
> > >> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> > >> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
> > >> [   78.786651] ACPI: Preparing to enter system sleep state S3
> > >> [   78.789613] PM: Saving platform NVS memory
> > >> [   78.789759] Disabling non-boot CPUs ...
> > >> [   78.805154] NOHZ: local_softirq_pending 202
> > >> [   78.805182] NOHZ: local_softirq_pending 202
> > >> [   78.807102] smpboot: CPU 1 is now offline
> > > 
> > > I've tried to reproduce with suspend on disk but got unsuccessful.
> > > 
> > > A small question as I see someone is having a similar issue with a stable
> > > release only. On which kernel did you trigger that: upstream or stable?
> > > 
> > > I'll continue investigating.
> > > 
> > > Thanks.
> > > 
> > Affected is recent linux-next, after the commit mentioned in the subject.
> > I can work around the warning (not sure whether it's a proper fix),
> > see here:
> > https://lkml.org/lkml/2018/8/18/272
> 
> Can you try the one I posted in this thread:
> 
>  
> https://lkml.kernel.org/r/alpine.deb.2.21.1808240851420.1...@nanos.tec.linutronix.de
> 
> Also below for reference.
> 
> Thanks,
> 
>   tglx
> 
> 8<
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 5b33e2f5c0ed..6aab9d54a331 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -888,7 +888,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched 
> *ts)
>   if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
>   static int ratelimit;
>  
> - if (ratelimit < 10 &&
> + if (ratelimit < 10 && !in_softirq() &&
>   (local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
>   pr_warn("NOHZ: local_softirq_pending %02x\n",
>   (unsigned int) local_softirq_pending());

I fear it may not work in his case because it happens in -next and we don't stop
the idle tick from IRQ tail anymore. So we shouldn't be interrupting a softirq
in this path. Still it's worth trying, I may well be missing something.

Thanks.


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-24 Thread Thomas Gleixner
On Fri, 24 Aug 2018, Heiner Kallweit wrote:
> On 24.08.2018 06:12, Frederic Weisbecker wrote:
> > On Thu, Aug 16, 2018 at 08:13:03AM +0200, Heiner Kallweit wrote:
> >> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
> >> I think it's related to mentioned commit (didn't bisect it yet).
> >> See log from suspending.
> >>
> >> I have no reason to think the fix is wrong, it may just have revealed
> >> another issue which existed before and was hidden by the bug.
> >>
> >> Rgds, Heiner
> >>
> >> [   75.073353] random: crng init done
> >> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
> >> [   78.619564] PM: suspend entry (deep)
> >> [   78.619675] PM: Syncing filesystems ... done.
> >> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) 
> >> done.
> >> [   78.656094] OOM killer disabled.
> >> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 
> >> seconds) done.
> >> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
> >> [   78.663066] nuvoton-cir 00:07: disabled
> >> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
> >> [   78.786651] ACPI: Preparing to enter system sleep state S3
> >> [   78.789613] PM: Saving platform NVS memory
> >> [   78.789759] Disabling non-boot CPUs ...
> >> [   78.805154] NOHZ: local_softirq_pending 202
> >> [   78.805182] NOHZ: local_softirq_pending 202
> >> [   78.807102] smpboot: CPU 1 is now offline
> > 
> > I've tried to reproduce with suspend on disk but got unsuccessful.
> > 
> > A small question as I see someone is having a similar issue with a stable
> > release only. On which kernel did you trigger that: upstream or stable?
> > 
> > I'll continue investigating.
> > 
> > Thanks.
> > 
> Affected is recent linux-next, after the commit mentioned in the subject.
> I can work around the warning (not sure whether it's a proper fix),
> see here:
> https://lkml.org/lkml/2018/8/18/272

Can you try the one I posted in this thread:

 
https://lkml.kernel.org/r/alpine.deb.2.21.1808240851420.1...@nanos.tec.linutronix.de

Also below for reference.

Thanks,

tglx

8<
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5b33e2f5c0ed..6aab9d54a331 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -888,7 +888,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched 
*ts)
if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
static int ratelimit;
 
-   if (ratelimit < 10 &&
+   if (ratelimit < 10 && !in_softirq() &&
(local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
pr_warn("NOHZ: local_softirq_pending %02x\n",
(unsigned int) local_softirq_pending());


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-24 Thread Thomas Gleixner
On Fri, 24 Aug 2018, Heiner Kallweit wrote:
> On 24.08.2018 06:12, Frederic Weisbecker wrote:
> > On Thu, Aug 16, 2018 at 08:13:03AM +0200, Heiner Kallweit wrote:
> >> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
> >> I think it's related to mentioned commit (didn't bisect it yet).
> >> See log from suspending.
> >>
> >> I have no reason to think the fix is wrong, it may just have revealed
> >> another issue which existed before and was hidden by the bug.
> >>
> >> Rgds, Heiner
> >>
> >> [   75.073353] random: crng init done
> >> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
> >> [   78.619564] PM: suspend entry (deep)
> >> [   78.619675] PM: Syncing filesystems ... done.
> >> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) 
> >> done.
> >> [   78.656094] OOM killer disabled.
> >> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 
> >> seconds) done.
> >> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
> >> [   78.663066] nuvoton-cir 00:07: disabled
> >> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
> >> [   78.786651] ACPI: Preparing to enter system sleep state S3
> >> [   78.789613] PM: Saving platform NVS memory
> >> [   78.789759] Disabling non-boot CPUs ...
> >> [   78.805154] NOHZ: local_softirq_pending 202
> >> [   78.805182] NOHZ: local_softirq_pending 202
> >> [   78.807102] smpboot: CPU 1 is now offline
> > 
> > I've tried to reproduce with suspend on disk but got unsuccessful.
> > 
> > A small question as I see someone is having a similar issue with a stable
> > release only. On which kernel did you trigger that: upstream or stable?
> > 
> > I'll continue investigating.
> > 
> > Thanks.
> > 
> Affected is recent linux-next, after the commit mentioned in the subject.
> I can work around the warning (not sure whether it's a proper fix),
> see here:
> https://lkml.org/lkml/2018/8/18/272

Can you try the one I posted in this thread:

 
https://lkml.kernel.org/r/alpine.deb.2.21.1808240851420.1...@nanos.tec.linutronix.de

Also below for reference.

Thanks,

tglx

8<
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5b33e2f5c0ed..6aab9d54a331 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -888,7 +888,7 @@ static bool can_stop_idle_tick(int cpu, struct tick_sched 
*ts)
if (unlikely(local_softirq_pending() && cpu_online(cpu))) {
static int ratelimit;
 
-   if (ratelimit < 10 &&
+   if (ratelimit < 10 && !in_softirq() &&
(local_softirq_pending() & SOFTIRQ_STOP_IDLE_MASK)) {
pr_warn("NOHZ: local_softirq_pending %02x\n",
(unsigned int) local_softirq_pending());


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-24 Thread Heiner Kallweit
On 24.08.2018 06:12, Frederic Weisbecker wrote:
> On Thu, Aug 16, 2018 at 08:13:03AM +0200, Heiner Kallweit wrote:
>> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
>> I think it's related to mentioned commit (didn't bisect it yet).
>> See log from suspending.
>>
>> I have no reason to think the fix is wrong, it may just have revealed
>> another issue which existed before and was hidden by the bug.
>>
>> Rgds, Heiner
>>
>> [   75.073353] random: crng init done
>> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
>> [   78.619564] PM: suspend entry (deep)
>> [   78.619675] PM: Syncing filesystems ... done.
>> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) 
>> done.
>> [   78.656094] OOM killer disabled.
>> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 
>> seconds) done.
>> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
>> [   78.663066] nuvoton-cir 00:07: disabled
>> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
>> [   78.786651] ACPI: Preparing to enter system sleep state S3
>> [   78.789613] PM: Saving platform NVS memory
>> [   78.789759] Disabling non-boot CPUs ...
>> [   78.805154] NOHZ: local_softirq_pending 202
>> [   78.805182] NOHZ: local_softirq_pending 202
>> [   78.807102] smpboot: CPU 1 is now offline
> 
> I've tried to reproduce with suspend on disk but got unsuccessful.
> 
> A small question as I see someone is having a similar issue with a stable
> release only. On which kernel did you trigger that: upstream or stable?
> 
> I'll continue investigating.
> 
> Thanks.
> 
Affected is recent linux-next, after the commit mentioned in the subject.
I can work around the warning (not sure whether it's a proper fix),
see here:
https://lkml.org/lkml/2018/8/18/272

Rgds, Heiner


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-24 Thread Heiner Kallweit
On 24.08.2018 06:12, Frederic Weisbecker wrote:
> On Thu, Aug 16, 2018 at 08:13:03AM +0200, Heiner Kallweit wrote:
>> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
>> I think it's related to mentioned commit (didn't bisect it yet).
>> See log from suspending.
>>
>> I have no reason to think the fix is wrong, it may just have revealed
>> another issue which existed before and was hidden by the bug.
>>
>> Rgds, Heiner
>>
>> [   75.073353] random: crng init done
>> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
>> [   78.619564] PM: suspend entry (deep)
>> [   78.619675] PM: Syncing filesystems ... done.
>> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) 
>> done.
>> [   78.656094] OOM killer disabled.
>> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 
>> seconds) done.
>> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
>> [   78.663066] nuvoton-cir 00:07: disabled
>> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
>> [   78.786651] ACPI: Preparing to enter system sleep state S3
>> [   78.789613] PM: Saving platform NVS memory
>> [   78.789759] Disabling non-boot CPUs ...
>> [   78.805154] NOHZ: local_softirq_pending 202
>> [   78.805182] NOHZ: local_softirq_pending 202
>> [   78.807102] smpboot: CPU 1 is now offline
> 
> I've tried to reproduce with suspend on disk but got unsuccessful.
> 
> A small question as I see someone is having a similar issue with a stable
> release only. On which kernel did you trigger that: upstream or stable?
> 
> I'll continue investigating.
> 
> Thanks.
> 
Affected is recent linux-next, after the commit mentioned in the subject.
I can work around the warning (not sure whether it's a proper fix),
see here:
https://lkml.org/lkml/2018/8/18/272

Rgds, Heiner


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-23 Thread Frederic Weisbecker
On Thu, Aug 16, 2018 at 08:13:03AM +0200, Heiner Kallweit wrote:
> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
> I think it's related to mentioned commit (didn't bisect it yet).
> See log from suspending.
> 
> I have no reason to think the fix is wrong, it may just have revealed
> another issue which existed before and was hidden by the bug.
> 
> Rgds, Heiner
> 
> [   75.073353] random: crng init done
> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
> [   78.619564] PM: suspend entry (deep)
> [   78.619675] PM: Syncing filesystems ... done.
> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) done.
> [   78.656094] OOM killer disabled.
> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) 
> done.
> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
> [   78.663066] nuvoton-cir 00:07: disabled
> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
> [   78.786651] ACPI: Preparing to enter system sleep state S3
> [   78.789613] PM: Saving platform NVS memory
> [   78.789759] Disabling non-boot CPUs ...
> [   78.805154] NOHZ: local_softirq_pending 202
> [   78.805182] NOHZ: local_softirq_pending 202
> [   78.807102] smpboot: CPU 1 is now offline

I've tried to reproduce with suspend on disk but got unsuccessful.

A small question as I see someone is having a similar issue with a stable
release only. On which kernel did you trigger that: upstream or stable?

I'll continue investigating.

Thanks.


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-23 Thread Frederic Weisbecker
On Thu, Aug 16, 2018 at 08:13:03AM +0200, Heiner Kallweit wrote:
> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
> I think it's related to mentioned commit (didn't bisect it yet).
> See log from suspending.
> 
> I have no reason to think the fix is wrong, it may just have revealed
> another issue which existed before and was hidden by the bug.
> 
> Rgds, Heiner
> 
> [   75.073353] random: crng init done
> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
> [   78.619564] PM: suspend entry (deep)
> [   78.619675] PM: Syncing filesystems ... done.
> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) done.
> [   78.656094] OOM killer disabled.
> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) 
> done.
> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
> [   78.663066] nuvoton-cir 00:07: disabled
> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
> [   78.786651] ACPI: Preparing to enter system sleep state S3
> [   78.789613] PM: Saving platform NVS memory
> [   78.789759] Disabling non-boot CPUs ...
> [   78.805154] NOHZ: local_softirq_pending 202
> [   78.805182] NOHZ: local_softirq_pending 202
> [   78.807102] smpboot: CPU 1 is now offline

I've tried to reproduce with suspend on disk but got unsuccessful.

A small question as I see someone is having a similar issue with a stable
release only. On which kernel did you trigger that: upstream or stable?

I'll continue investigating.

Thanks.


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-18 Thread Heiner Kallweit
On 18.08.2018 13:26, Thomas Gleixner wrote:
> On Thu, 16 Aug 2018, Heiner Kallweit wrote:
> 
>> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
>> I think it's related to mentioned commit (didn't bisect it yet).
>> See log from suspending.
>>
>> I have no reason to think the fix is wrong, it may just have revealed
>> another issue which existed before and was hidden by the bug.
> 
> Looks so. That seems to be related to CPU offlining. No idea yet...
> 
I checked a little further and at the time the warning is printed the
cpu is still online but not active any longer.
I can avoid the warning with the following change, but as a
disclaimer: I have no clue of this subsystem and don't know what
I'm doing ..

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5b33e2f5c..19a030e40 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -862,13 +862,13 @@ static void tick_nohz_full_update_tick(struct tick_sched 
*ts)
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
/*
-* If this CPU is offline and it is the one which updates
+* If this CPU is inactive and it is the one which updates
 * jiffies, then give up the assignment and let it be taken by
 * the CPU which runs the tick timer next. If we don't drop
 * this here the jiffies might be stale and do_timer() never
 * invoked.
 */
-   if (unlikely(!cpu_online(cpu))) {
+   if (unlikely(!cpu_active(cpu))) {
if (cpu == tick_do_timer_cpu)
tick_do_timer_cpu = TICK_DO_TIMER_NONE;
/*
-- 



>> Rgds, Heiner
>>
>> [   75.073353] random: crng init done
>> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
>> [   78.619564] PM: suspend entry (deep)
>> [   78.619675] PM: Syncing filesystems ... done.
>> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) 
>> done.
>> [   78.656094] OOM killer disabled.
>> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 
>> seconds) done.
>> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
>> [   78.663066] nuvoton-cir 00:07: disabled
>> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
>> [   78.786651] ACPI: Preparing to enter system sleep state S3
>> [   78.789613] PM: Saving platform NVS memory
>> [   78.789759] Disabling non-boot CPUs ...
>> [   78.805154] NOHZ: local_softirq_pending 202
>> [   78.805182] NOHZ: local_softirq_pending 202
>> [   78.807102] smpboot: CPU 1 is now offline
>>
> 



Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-18 Thread Heiner Kallweit
On 18.08.2018 13:26, Thomas Gleixner wrote:
> On Thu, 16 Aug 2018, Heiner Kallweit wrote:
> 
>> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
>> I think it's related to mentioned commit (didn't bisect it yet).
>> See log from suspending.
>>
>> I have no reason to think the fix is wrong, it may just have revealed
>> another issue which existed before and was hidden by the bug.
> 
> Looks so. That seems to be related to CPU offlining. No idea yet...
> 
I checked a little further and at the time the warning is printed the
cpu is still online but not active any longer.
I can avoid the warning with the following change, but as a
disclaimer: I have no clue of this subsystem and don't know what
I'm doing ..

diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5b33e2f5c..19a030e40 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -862,13 +862,13 @@ static void tick_nohz_full_update_tick(struct tick_sched 
*ts)
 static bool can_stop_idle_tick(int cpu, struct tick_sched *ts)
 {
/*
-* If this CPU is offline and it is the one which updates
+* If this CPU is inactive and it is the one which updates
 * jiffies, then give up the assignment and let it be taken by
 * the CPU which runs the tick timer next. If we don't drop
 * this here the jiffies might be stale and do_timer() never
 * invoked.
 */
-   if (unlikely(!cpu_online(cpu))) {
+   if (unlikely(!cpu_active(cpu))) {
if (cpu == tick_do_timer_cpu)
tick_do_timer_cpu = TICK_DO_TIMER_NONE;
/*
-- 



>> Rgds, Heiner
>>
>> [   75.073353] random: crng init done
>> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
>> [   78.619564] PM: suspend entry (deep)
>> [   78.619675] PM: Syncing filesystems ... done.
>> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) 
>> done.
>> [   78.656094] OOM killer disabled.
>> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 
>> seconds) done.
>> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
>> [   78.663066] nuvoton-cir 00:07: disabled
>> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
>> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
>> [   78.786651] ACPI: Preparing to enter system sleep state S3
>> [   78.789613] PM: Saving platform NVS memory
>> [   78.789759] Disabling non-boot CPUs ...
>> [   78.805154] NOHZ: local_softirq_pending 202
>> [   78.805182] NOHZ: local_softirq_pending 202
>> [   78.807102] smpboot: CPU 1 is now offline
>>
> 



Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-18 Thread Thomas Gleixner
On Thu, 16 Aug 2018, Heiner Kallweit wrote:

> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
> I think it's related to mentioned commit (didn't bisect it yet).
> See log from suspending.
> 
> I have no reason to think the fix is wrong, it may just have revealed
> another issue which existed before and was hidden by the bug.

Looks so. That seems to be related to CPU offlining. No idea yet...

> Rgds, Heiner
> 
> [   75.073353] random: crng init done
> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
> [   78.619564] PM: suspend entry (deep)
> [   78.619675] PM: Syncing filesystems ... done.
> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) done.
> [   78.656094] OOM killer disabled.
> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) 
> done.
> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
> [   78.663066] nuvoton-cir 00:07: disabled
> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
> [   78.786651] ACPI: Preparing to enter system sleep state S3
> [   78.789613] PM: Saving platform NVS memory
> [   78.789759] Disabling non-boot CPUs ...
> [   78.805154] NOHZ: local_softirq_pending 202
> [   78.805182] NOHZ: local_softirq_pending 202
> [   78.807102] smpboot: CPU 1 is now offline
> 


Re: Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-18 Thread Thomas Gleixner
On Thu, 16 Aug 2018, Heiner Kallweit wrote:

> Recently I started to get warning "NOHZ: local_softirq_pending 202" and
> I think it's related to mentioned commit (didn't bisect it yet).
> See log from suspending.
> 
> I have no reason to think the fix is wrong, it may just have revealed
> another issue which existed before and was hidden by the bug.

Looks so. That seems to be related to CPU offlining. No idea yet...

> Rgds, Heiner
> 
> [   75.073353] random: crng init done
> [   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
> [   78.619564] PM: suspend entry (deep)
> [   78.619675] PM: Syncing filesystems ... done.
> [   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) done.
> [   78.656094] OOM killer disabled.
> [   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) 
> done.
> [   78.658177] Suspending console(s) (use no_console_suspend to debug)
> [   78.663066] nuvoton-cir 00:07: disabled
> [   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
> [   78.672210] sd 0:0:0:0: [sda] Stopping disk
> [   78.786651] ACPI: Preparing to enter system sleep state S3
> [   78.789613] PM: Saving platform NVS memory
> [   78.789759] Disabling non-boot CPUs ...
> [   78.805154] NOHZ: local_softirq_pending 202
> [   78.805182] NOHZ: local_softirq_pending 202
> [   78.807102] smpboot: CPU 1 is now offline
> 


Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-16 Thread Heiner Kallweit
Recently I started to get warning "NOHZ: local_softirq_pending 202" and
I think it's related to mentioned commit (didn't bisect it yet).
See log from suspending.

I have no reason to think the fix is wrong, it may just have revealed
another issue which existed before and was hidden by the bug.

Rgds, Heiner

[   75.073353] random: crng init done
[   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
[   78.619564] PM: suspend entry (deep)
[   78.619675] PM: Syncing filesystems ... done.
[   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) done.
[   78.656094] OOM killer disabled.
[   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) 
done.
[   78.658177] Suspending console(s) (use no_console_suspend to debug)
[   78.663066] nuvoton-cir 00:07: disabled
[   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   78.672210] sd 0:0:0:0: [sda] Stopping disk
[   78.786651] ACPI: Preparing to enter system sleep state S3
[   78.789613] PM: Saving platform NVS memory
[   78.789759] Disabling non-boot CPUs ...
[   78.805154] NOHZ: local_softirq_pending 202
[   78.805182] NOHZ: local_softirq_pending 202
[   78.807102] smpboot: CPU 1 is now offline


Fix 80d20d35af1e ("nohz: Fix local_timer_softirq_pending()") may have revealed another problem

2018-08-16 Thread Heiner Kallweit
Recently I started to get warning "NOHZ: local_softirq_pending 202" and
I think it's related to mentioned commit (didn't bisect it yet).
See log from suspending.

I have no reason to think the fix is wrong, it may just have revealed
another issue which existed before and was hidden by the bug.

Rgds, Heiner

[   75.073353] random: crng init done
[   75.073402] random: 7 urandom warning(s) missed due to ratelimiting
[   78.619564] PM: suspend entry (deep)
[   78.619675] PM: Syncing filesystems ... done.
[   78.653684] Freezing user space processes ... (elapsed 0.002 seconds) done.
[   78.656094] OOM killer disabled.
[   78.656113] Freezing remaining freezable tasks ... (elapsed 0.001 seconds) 
done.
[   78.658177] Suspending console(s) (use no_console_suspend to debug)
[   78.663066] nuvoton-cir 00:07: disabled
[   78.671817] sd 0:0:0:0: [sda] Synchronizing SCSI cache
[   78.672210] sd 0:0:0:0: [sda] Stopping disk
[   78.786651] ACPI: Preparing to enter system sleep state S3
[   78.789613] PM: Saving platform NVS memory
[   78.789759] Disabling non-boot CPUs ...
[   78.805154] NOHZ: local_softirq_pending 202
[   78.805182] NOHZ: local_softirq_pending 202
[   78.807102] smpboot: CPU 1 is now offline