Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-10-01 Thread Thorsten Leemhuis
On 01.10.2017 15:06, Yanko Kaneti wrote:
> On Sun, 2017-10-01 at 14:46 +0200, Thorsten Leemhuis wrote:
>> Hi, the regression tracker here. What's the status of this issue? Was
>> the problem fixed? It seems nothing happened for more than 10 days -- or
>> did the discussion move somewhere else? Ciao, Thorsten
> The commit was reverted last week before rc2
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0551968add53777fddd18f4ffb4e3bbc1f646d79

I could have sworn I checked that :-/ Thx for the hint and sorry for the
noise! Ciao, Thorsten

>> On 20.09.2017 02:30, Chuck Ebbert wrote:
>>> On Tue, 19 Sep 2017 16:51:06 +0100
>>> Marc Zyngier  wrote:
>>>
 On 19/09/17 16:40, Yanko Kaneti wrote:
> On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:  
>> On 19/09/17 16:12, Yanko Kaneti wrote:  
>>> Hello, 
>>>
>>> Fedora rawhide config here. 
>>> AMD FX-8370E
>>>
>>> Bisected a problem to:
>>> 74def747bcd0 (genirq: Restrict effective affinity to interrupts
>>> actually using it) 
>>>
>>> It seems to be causing stalls, short lived or long lived lockups
>>> very shortly after boot. Everything becomes jerky.
>>>
>>> The only visible in the log indication is something like :
>>> 
>>> [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking
>>> clocksource 'tsc' as unstable because the skew is too large:
>>> [   59.802134] clocksource:   'hpet' wd_now:
>>> 3326e7aa wd_last: 329956f8 mask:  [   59.802137]
>>> clocksource:   'tsc' cs_now: 423662bc6f
>>> cs_last: 41dfc91650 mask:  [   59.802140] tsc:
>>> Marking TSC unstable due to clocksource watchdog [   59.802158]
>>> TSC found unstable after boot, most likely due to broken BIOS.
>>> Use 'tsc=unstable'. [   59.802161] sched_clock: Marking unstable
>>> (59802142067, 15510)<-(59920871789, -118714277) [   60.015604]
>>> clocksource: Switched to clocksource hpet [   89.015994] INFO:
>>> NMI handler (perf_event_nmi_handler) took too long to run:
>>> 209.660 msecs [   89.016003] perf: interrupt took too long
>>> (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to
>>> 1000 
>>>
>>> Just reverting that commit on top of linus mainline cures all the
>>> symptoms  
>>
>> Interesting. Do you still get HPET interrupts?  
>
> Sorry, I might need some basic help here (i.e where do I count
> them...)  

 /proc/interrupts should display them.

> After the watchdog switches the clocksource to hpet the system is
> still somewhat alive, so I'll guess some clock is still
> ticking  

 Probably, but I suspect they're not hitting the right CPU, hence the
 lockups.

 Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off
 the net for a few days.

 Thomas, any insight?
>>>
>>> Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0
>>> can be correct:
>>>
>>> struct cpumask *effmsk =
>>> irq_data_get_effective_affinity_mask(irqdata); unsigned long
>>> cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS;
>>>
>>> if (!cpu_mask)
>>> return -EINVAL;
>>> *apicid = (unsigned int)cpu_mask;
>>> cpumask_bits(effmsk)[0] = cpu_mask;
>>>
>>> Before that patch, this function wrote to the effective mask
>>> unconditionally. After, it only writes to effective_mask if it is
>>> already non-zero.
>>>
>>>
>>> http://news.gmane.org/find-root.php?message_id=20170919203044.560cb9f1%40gmail.com
>>>  
>>> http://mid.gmane.org/20170919203044.560cb9f1%40gmail.com
>>>
> 


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-10-01 Thread Thorsten Leemhuis
On 01.10.2017 15:06, Yanko Kaneti wrote:
> On Sun, 2017-10-01 at 14:46 +0200, Thorsten Leemhuis wrote:
>> Hi, the regression tracker here. What's the status of this issue? Was
>> the problem fixed? It seems nothing happened for more than 10 days -- or
>> did the discussion move somewhere else? Ciao, Thorsten
> The commit was reverted last week before rc2
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0551968add53777fddd18f4ffb4e3bbc1f646d79

I could have sworn I checked that :-/ Thx for the hint and sorry for the
noise! Ciao, Thorsten

>> On 20.09.2017 02:30, Chuck Ebbert wrote:
>>> On Tue, 19 Sep 2017 16:51:06 +0100
>>> Marc Zyngier  wrote:
>>>
 On 19/09/17 16:40, Yanko Kaneti wrote:
> On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:  
>> On 19/09/17 16:12, Yanko Kaneti wrote:  
>>> Hello, 
>>>
>>> Fedora rawhide config here. 
>>> AMD FX-8370E
>>>
>>> Bisected a problem to:
>>> 74def747bcd0 (genirq: Restrict effective affinity to interrupts
>>> actually using it) 
>>>
>>> It seems to be causing stalls, short lived or long lived lockups
>>> very shortly after boot. Everything becomes jerky.
>>>
>>> The only visible in the log indication is something like :
>>> 
>>> [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking
>>> clocksource 'tsc' as unstable because the skew is too large:
>>> [   59.802134] clocksource:   'hpet' wd_now:
>>> 3326e7aa wd_last: 329956f8 mask:  [   59.802137]
>>> clocksource:   'tsc' cs_now: 423662bc6f
>>> cs_last: 41dfc91650 mask:  [   59.802140] tsc:
>>> Marking TSC unstable due to clocksource watchdog [   59.802158]
>>> TSC found unstable after boot, most likely due to broken BIOS.
>>> Use 'tsc=unstable'. [   59.802161] sched_clock: Marking unstable
>>> (59802142067, 15510)<-(59920871789, -118714277) [   60.015604]
>>> clocksource: Switched to clocksource hpet [   89.015994] INFO:
>>> NMI handler (perf_event_nmi_handler) took too long to run:
>>> 209.660 msecs [   89.016003] perf: interrupt took too long
>>> (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to
>>> 1000 
>>>
>>> Just reverting that commit on top of linus mainline cures all the
>>> symptoms  
>>
>> Interesting. Do you still get HPET interrupts?  
>
> Sorry, I might need some basic help here (i.e where do I count
> them...)  

 /proc/interrupts should display them.

> After the watchdog switches the clocksource to hpet the system is
> still somewhat alive, so I'll guess some clock is still
> ticking  

 Probably, but I suspect they're not hitting the right CPU, hence the
 lockups.

 Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off
 the net for a few days.

 Thomas, any insight?
>>>
>>> Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0
>>> can be correct:
>>>
>>> struct cpumask *effmsk =
>>> irq_data_get_effective_affinity_mask(irqdata); unsigned long
>>> cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS;
>>>
>>> if (!cpu_mask)
>>> return -EINVAL;
>>> *apicid = (unsigned int)cpu_mask;
>>> cpumask_bits(effmsk)[0] = cpu_mask;
>>>
>>> Before that patch, this function wrote to the effective mask
>>> unconditionally. After, it only writes to effective_mask if it is
>>> already non-zero.
>>>
>>>
>>> http://news.gmane.org/find-root.php?message_id=20170919203044.560cb9f1%40gmail.com
>>>  
>>> http://mid.gmane.org/20170919203044.560cb9f1%40gmail.com
>>>
> 


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-10-01 Thread Yanko Kaneti
On Sun, 2017-10-01 at 14:46 +0200, Thorsten Leemhuis wrote:
> Hi, the regression tracker here. What's the status of this issue? Was
> the problem fixed? It seems nothing happened for more than 10 days -- or
> did the discussion move somewhere else? Ciao, Thorsten

The commit was reverted last week before rc2

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0551968add53777fddd18f4ffb4e3bbc1f646d79

Thanks for tracking it
-Yanko

> 
> On 20.09.2017 02:30, Chuck Ebbert wrote:
> > On Tue, 19 Sep 2017 16:51:06 +0100
> > Marc Zyngier  wrote:
> > 
> > > On 19/09/17 16:40, Yanko Kaneti wrote:
> > > > On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:  
> > > > > On 19/09/17 16:12, Yanko Kaneti wrote:  
> > > > > > Hello, 
> > > > > > 
> > > > > > Fedora rawhide config here. 
> > > > > > AMD FX-8370E
> > > > > > 
> > > > > > Bisected a problem to:
> > > > > > 74def747bcd0 (genirq: Restrict effective affinity to interrupts
> > > > > > actually using it) 
> > > > > > 
> > > > > > It seems to be causing stalls, short lived or long lived lockups
> > > > > > very shortly after boot. Everything becomes jerky.
> > > > > > 
> > > > > > The only visible in the log indication is something like :
> > > > > > 
> > > > > > [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking
> > > > > > clocksource 'tsc' as unstable because the skew is too large:
> > > > > > [   59.802134] clocksource:   'hpet' wd_now:
> > > > > > 3326e7aa wd_last: 329956f8 mask:  [   59.802137]
> > > > > > clocksource:   'tsc' cs_now: 423662bc6f
> > > > > > cs_last: 41dfc91650 mask:  [   59.802140] tsc:
> > > > > > Marking TSC unstable due to clocksource watchdog [   59.802158]
> > > > > > TSC found unstable after boot, most likely due to broken BIOS.
> > > > > > Use 'tsc=unstable'. [   59.802161] sched_clock: Marking unstable
> > > > > > (59802142067, 15510)<-(59920871789, -118714277) [   60.015604]
> > > > > > clocksource: Switched to clocksource hpet [   89.015994] INFO:
> > > > > > NMI handler (perf_event_nmi_handler) took too long to run:
> > > > > > 209.660 msecs [   89.016003] perf: interrupt took too long
> > > > > > (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to
> > > > > > 1000 
> > > > > > 
> > > > > > Just reverting that commit on top of linus mainline cures all the
> > > > > > symptoms  
> > > > > 
> > > > > Interesting. Do you still get HPET interrupts?  
> > > > 
> > > > Sorry, I might need some basic help here (i.e where do I count
> > > > them...)  
> > > 
> > > /proc/interrupts should display them.
> > > 
> > > > After the watchdog switches the clocksource to hpet the system is
> > > > still somewhat alive, so I'll guess some clock is still
> > > > ticking  
> > > 
> > > Probably, but I suspect they're not hitting the right CPU, hence the
> > > lockups.
> > > 
> > > Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off
> > > the net for a few days.
> > > 
> > > Thomas, any insight?
> > 
> > Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0
> > can be correct:
> > 
> > struct cpumask *effmsk =
> > irq_data_get_effective_affinity_mask(irqdata); unsigned long
> > cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS;
> > 
> > if (!cpu_mask)
> > return -EINVAL;
> > *apicid = (unsigned int)cpu_mask;
> > cpumask_bits(effmsk)[0] = cpu_mask;
> > 
> > Before that patch, this function wrote to the effective mask
> > unconditionally. After, it only writes to effective_mask if it is
> > already non-zero.
> > 
> > 
> > http://news.gmane.org/find-root.php?message_id=20170919203044.560cb9f1%40gmail.com
> >  
> > http://mid.gmane.org/20170919203044.560cb9f1%40gmail.com
> > 


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-10-01 Thread Yanko Kaneti
On Sun, 2017-10-01 at 14:46 +0200, Thorsten Leemhuis wrote:
> Hi, the regression tracker here. What's the status of this issue? Was
> the problem fixed? It seems nothing happened for more than 10 days -- or
> did the discussion move somewhere else? Ciao, Thorsten

The commit was reverted last week before rc2

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0551968add53777fddd18f4ffb4e3bbc1f646d79

Thanks for tracking it
-Yanko

> 
> On 20.09.2017 02:30, Chuck Ebbert wrote:
> > On Tue, 19 Sep 2017 16:51:06 +0100
> > Marc Zyngier  wrote:
> > 
> > > On 19/09/17 16:40, Yanko Kaneti wrote:
> > > > On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:  
> > > > > On 19/09/17 16:12, Yanko Kaneti wrote:  
> > > > > > Hello, 
> > > > > > 
> > > > > > Fedora rawhide config here. 
> > > > > > AMD FX-8370E
> > > > > > 
> > > > > > Bisected a problem to:
> > > > > > 74def747bcd0 (genirq: Restrict effective affinity to interrupts
> > > > > > actually using it) 
> > > > > > 
> > > > > > It seems to be causing stalls, short lived or long lived lockups
> > > > > > very shortly after boot. Everything becomes jerky.
> > > > > > 
> > > > > > The only visible in the log indication is something like :
> > > > > > 
> > > > > > [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking
> > > > > > clocksource 'tsc' as unstable because the skew is too large:
> > > > > > [   59.802134] clocksource:   'hpet' wd_now:
> > > > > > 3326e7aa wd_last: 329956f8 mask:  [   59.802137]
> > > > > > clocksource:   'tsc' cs_now: 423662bc6f
> > > > > > cs_last: 41dfc91650 mask:  [   59.802140] tsc:
> > > > > > Marking TSC unstable due to clocksource watchdog [   59.802158]
> > > > > > TSC found unstable after boot, most likely due to broken BIOS.
> > > > > > Use 'tsc=unstable'. [   59.802161] sched_clock: Marking unstable
> > > > > > (59802142067, 15510)<-(59920871789, -118714277) [   60.015604]
> > > > > > clocksource: Switched to clocksource hpet [   89.015994] INFO:
> > > > > > NMI handler (perf_event_nmi_handler) took too long to run:
> > > > > > 209.660 msecs [   89.016003] perf: interrupt took too long
> > > > > > (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to
> > > > > > 1000 
> > > > > > 
> > > > > > Just reverting that commit on top of linus mainline cures all the
> > > > > > symptoms  
> > > > > 
> > > > > Interesting. Do you still get HPET interrupts?  
> > > > 
> > > > Sorry, I might need some basic help here (i.e where do I count
> > > > them...)  
> > > 
> > > /proc/interrupts should display them.
> > > 
> > > > After the watchdog switches the clocksource to hpet the system is
> > > > still somewhat alive, so I'll guess some clock is still
> > > > ticking  
> > > 
> > > Probably, but I suspect they're not hitting the right CPU, hence the
> > > lockups.
> > > 
> > > Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off
> > > the net for a few days.
> > > 
> > > Thomas, any insight?
> > 
> > Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0
> > can be correct:
> > 
> > struct cpumask *effmsk =
> > irq_data_get_effective_affinity_mask(irqdata); unsigned long
> > cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS;
> > 
> > if (!cpu_mask)
> > return -EINVAL;
> > *apicid = (unsigned int)cpu_mask;
> > cpumask_bits(effmsk)[0] = cpu_mask;
> > 
> > Before that patch, this function wrote to the effective mask
> > unconditionally. After, it only writes to effective_mask if it is
> > already non-zero.
> > 
> > 
> > http://news.gmane.org/find-root.php?message_id=20170919203044.560cb9f1%40gmail.com
> >  
> > http://mid.gmane.org/20170919203044.560cb9f1%40gmail.com
> > 


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-10-01 Thread Thorsten Leemhuis
Hi, the regression tracker here. What's the status of this issue? Was
the problem fixed? It seems nothing happened for more than 10 days -- or
did the discussion move somewhere else? Ciao, Thorsten

On 20.09.2017 02:30, Chuck Ebbert wrote:
> On Tue, 19 Sep 2017 16:51:06 +0100
> Marc Zyngier  wrote:
> 
>> On 19/09/17 16:40, Yanko Kaneti wrote:
>>> On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:  
 On 19/09/17 16:12, Yanko Kaneti wrote:  
> Hello, 
>
> Fedora rawhide config here. 
> AMD FX-8370E
>
> Bisected a problem to:
> 74def747bcd0 (genirq: Restrict effective affinity to interrupts
> actually using it) 
>
> It seems to be causing stalls, short lived or long lived lockups
> very shortly after boot. Everything becomes jerky.
>
> The only visible in the log indication is something like :
> 
> [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking
> clocksource 'tsc' as unstable because the skew is too large:
> [   59.802134] clocksource:   'hpet' wd_now:
> 3326e7aa wd_last: 329956f8 mask:  [   59.802137]
> clocksource:   'tsc' cs_now: 423662bc6f
> cs_last: 41dfc91650 mask:  [   59.802140] tsc:
> Marking TSC unstable due to clocksource watchdog [   59.802158]
> TSC found unstable after boot, most likely due to broken BIOS.
> Use 'tsc=unstable'. [   59.802161] sched_clock: Marking unstable
> (59802142067, 15510)<-(59920871789, -118714277) [   60.015604]
> clocksource: Switched to clocksource hpet [   89.015994] INFO:
> NMI handler (perf_event_nmi_handler) took too long to run:
> 209.660 msecs [   89.016003] perf: interrupt took too long
> (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to
> 1000 
>
> Just reverting that commit on top of linus mainline cures all the
> symptoms  

 Interesting. Do you still get HPET interrupts?  
>>>
>>> Sorry, I might need some basic help here (i.e where do I count
>>> them...)  
>>
>> /proc/interrupts should display them.
>>
>>> After the watchdog switches the clocksource to hpet the system is
>>> still somewhat alive, so I'll guess some clock is still
>>> ticking  
>> Probably, but I suspect they're not hitting the right CPU, hence the
>> lockups.
>>
>> Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off
>> the net for a few days.
>>
>> Thomas, any insight?
> 
> Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0
> can be correct:
> 
>   struct cpumask *effmsk =
>   irq_data_get_effective_affinity_mask(irqdata); unsigned long
>   cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS;
> 
>   if (!cpu_mask)
>   return -EINVAL;
>   *apicid = (unsigned int)cpu_mask;
>   cpumask_bits(effmsk)[0] = cpu_mask;
> 
> Before that patch, this function wrote to the effective mask
> unconditionally. After, it only writes to effective_mask if it is
> already non-zero.
> 
> 
> http://news.gmane.org/find-root.php?message_id=20170919203044.560cb9f1%40gmail.com
>  
> http://mid.gmane.org/20170919203044.560cb9f1%40gmail.com
> 


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-10-01 Thread Thorsten Leemhuis
Hi, the regression tracker here. What's the status of this issue? Was
the problem fixed? It seems nothing happened for more than 10 days -- or
did the discussion move somewhere else? Ciao, Thorsten

On 20.09.2017 02:30, Chuck Ebbert wrote:
> On Tue, 19 Sep 2017 16:51:06 +0100
> Marc Zyngier  wrote:
> 
>> On 19/09/17 16:40, Yanko Kaneti wrote:
>>> On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:  
 On 19/09/17 16:12, Yanko Kaneti wrote:  
> Hello, 
>
> Fedora rawhide config here. 
> AMD FX-8370E
>
> Bisected a problem to:
> 74def747bcd0 (genirq: Restrict effective affinity to interrupts
> actually using it) 
>
> It seems to be causing stalls, short lived or long lived lockups
> very shortly after boot. Everything becomes jerky.
>
> The only visible in the log indication is something like :
> 
> [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking
> clocksource 'tsc' as unstable because the skew is too large:
> [   59.802134] clocksource:   'hpet' wd_now:
> 3326e7aa wd_last: 329956f8 mask:  [   59.802137]
> clocksource:   'tsc' cs_now: 423662bc6f
> cs_last: 41dfc91650 mask:  [   59.802140] tsc:
> Marking TSC unstable due to clocksource watchdog [   59.802158]
> TSC found unstable after boot, most likely due to broken BIOS.
> Use 'tsc=unstable'. [   59.802161] sched_clock: Marking unstable
> (59802142067, 15510)<-(59920871789, -118714277) [   60.015604]
> clocksource: Switched to clocksource hpet [   89.015994] INFO:
> NMI handler (perf_event_nmi_handler) took too long to run:
> 209.660 msecs [   89.016003] perf: interrupt took too long
> (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to
> 1000 
>
> Just reverting that commit on top of linus mainline cures all the
> symptoms  

 Interesting. Do you still get HPET interrupts?  
>>>
>>> Sorry, I might need some basic help here (i.e where do I count
>>> them...)  
>>
>> /proc/interrupts should display them.
>>
>>> After the watchdog switches the clocksource to hpet the system is
>>> still somewhat alive, so I'll guess some clock is still
>>> ticking  
>> Probably, but I suspect they're not hitting the right CPU, hence the
>> lockups.
>>
>> Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off
>> the net for a few days.
>>
>> Thomas, any insight?
> 
> Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0
> can be correct:
> 
>   struct cpumask *effmsk =
>   irq_data_get_effective_affinity_mask(irqdata); unsigned long
>   cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS;
> 
>   if (!cpu_mask)
>   return -EINVAL;
>   *apicid = (unsigned int)cpu_mask;
>   cpumask_bits(effmsk)[0] = cpu_mask;
> 
> Before that patch, this function wrote to the effective mask
> unconditionally. After, it only writes to effective_mask if it is
> already non-zero.
> 
> 
> http://news.gmane.org/find-root.php?message_id=20170919203044.560cb9f1%40gmail.com
>  
> http://mid.gmane.org/20170919203044.560cb9f1%40gmail.com
> 


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-09-19 Thread Chuck Ebbert
On Tue, 19 Sep 2017 16:51:06 +0100
Marc Zyngier  wrote:

> On 19/09/17 16:40, Yanko Kaneti wrote:
> > On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:  
> >> On 19/09/17 16:12, Yanko Kaneti wrote:  
> >>> Hello, 
> >>>
> >>> Fedora rawhide config here. 
> >>> AMD FX-8370E
> >>>
> >>> Bisected a problem to:
> >>> 74def747bcd0 (genirq: Restrict effective affinity to interrupts
> >>> actually using it) 
> >>>
> >>> It seems to be causing stalls, short lived or long lived lockups
> >>> very shortly after boot. Everything becomes jerky.
> >>>
> >>> The only visible in the log indication is something like :
> >>> 
> >>> [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking
> >>> clocksource 'tsc' as unstable because the skew is too large:
> >>> [   59.802134] clocksource:   'hpet' wd_now:
> >>> 3326e7aa wd_last: 329956f8 mask:  [   59.802137]
> >>> clocksource:   'tsc' cs_now: 423662bc6f
> >>> cs_last: 41dfc91650 mask:  [   59.802140] tsc:
> >>> Marking TSC unstable due to clocksource watchdog [   59.802158]
> >>> TSC found unstable after boot, most likely due to broken BIOS.
> >>> Use 'tsc=unstable'. [   59.802161] sched_clock: Marking unstable
> >>> (59802142067, 15510)<-(59920871789, -118714277) [   60.015604]
> >>> clocksource: Switched to clocksource hpet [   89.015994] INFO:
> >>> NMI handler (perf_event_nmi_handler) took too long to run:
> >>> 209.660 msecs [   89.016003] perf: interrupt took too long
> >>> (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to
> >>> 1000 
> >>>
> >>> Just reverting that commit on top of linus mainline cures all the
> >>> symptoms  
> >>
> >> Interesting. Do you still get HPET interrupts?  
> > 
> > Sorry, I might need some basic help here (i.e where do I count
> > them...)  
> 
> /proc/interrupts should display them.
> 
> > After the watchdog switches the clocksource to hpet the system is
> > still somewhat alive, so I'll guess some clock is still
> > ticking  
> Probably, but I suspect they're not hitting the right CPU, hence the
> lockups.
> 
> Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off
> the net for a few days.
> 
> Thomas, any insight?

Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0
can be correct:

struct cpumask *effmsk =
irq_data_get_effective_affinity_mask(irqdata); unsigned long
cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS;

if (!cpu_mask)
return -EINVAL;
*apicid = (unsigned int)cpu_mask;
cpumask_bits(effmsk)[0] = cpu_mask;

Before that patch, this function wrote to the effective mask
unconditionally. After, it only writes to effective_mask if it is
already non-zero.



Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-09-19 Thread Chuck Ebbert
On Tue, 19 Sep 2017 16:51:06 +0100
Marc Zyngier  wrote:

> On 19/09/17 16:40, Yanko Kaneti wrote:
> > On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:  
> >> On 19/09/17 16:12, Yanko Kaneti wrote:  
> >>> Hello, 
> >>>
> >>> Fedora rawhide config here. 
> >>> AMD FX-8370E
> >>>
> >>> Bisected a problem to:
> >>> 74def747bcd0 (genirq: Restrict effective affinity to interrupts
> >>> actually using it) 
> >>>
> >>> It seems to be causing stalls, short lived or long lived lockups
> >>> very shortly after boot. Everything becomes jerky.
> >>>
> >>> The only visible in the log indication is something like :
> >>> 
> >>> [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking
> >>> clocksource 'tsc' as unstable because the skew is too large:
> >>> [   59.802134] clocksource:   'hpet' wd_now:
> >>> 3326e7aa wd_last: 329956f8 mask:  [   59.802137]
> >>> clocksource:   'tsc' cs_now: 423662bc6f
> >>> cs_last: 41dfc91650 mask:  [   59.802140] tsc:
> >>> Marking TSC unstable due to clocksource watchdog [   59.802158]
> >>> TSC found unstable after boot, most likely due to broken BIOS.
> >>> Use 'tsc=unstable'. [   59.802161] sched_clock: Marking unstable
> >>> (59802142067, 15510)<-(59920871789, -118714277) [   60.015604]
> >>> clocksource: Switched to clocksource hpet [   89.015994] INFO:
> >>> NMI handler (perf_event_nmi_handler) took too long to run:
> >>> 209.660 msecs [   89.016003] perf: interrupt took too long
> >>> (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to
> >>> 1000 
> >>>
> >>> Just reverting that commit on top of linus mainline cures all the
> >>> symptoms  
> >>
> >> Interesting. Do you still get HPET interrupts?  
> > 
> > Sorry, I might need some basic help here (i.e where do I count
> > them...)  
> 
> /proc/interrupts should display them.
> 
> > After the watchdog switches the clocksource to hpet the system is
> > still somewhat alive, so I'll guess some clock is still
> > ticking  
> Probably, but I suspect they're not hitting the right CPU, hence the
> lockups.
> 
> Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off
> the net for a few days.
> 
> Thomas, any insight?

Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0
can be correct:

struct cpumask *effmsk =
irq_data_get_effective_affinity_mask(irqdata); unsigned long
cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS;

if (!cpu_mask)
return -EINVAL;
*apicid = (unsigned int)cpu_mask;
cpumask_bits(effmsk)[0] = cpu_mask;

Before that patch, this function wrote to the effective mask
unconditionally. After, it only writes to effective_mask if it is
already non-zero.



Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-09-19 Thread Marc Zyngier
On 19/09/17 16:40, Yanko Kaneti wrote:
> On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:
>> On 19/09/17 16:12, Yanko Kaneti wrote:
>>> Hello, 
>>>
>>> Fedora rawhide config here. 
>>> AMD FX-8370E
>>>
>>> Bisected a problem to:
>>> 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually 
>>> using it) 
>>>
>>> It seems to be causing stalls, short lived or long lived lockups very 
>>> shortly after boot. 
>>> Everything becomes jerky.
>>>
>>> The only visible in the log indication is something like :
>>> 
>>> [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking 
>>> clocksource 'tsc' as unstable because the skew is too large:
>>> [   59.802134] clocksource:   'hpet' wd_now: 3326e7aa 
>>> wd_last: 329956f8 mask: 
>>> [   59.802137] clocksource:   'tsc' cs_now: 423662bc6f 
>>> cs_last: 41dfc91650 mask: 
>>> [   59.802140] tsc: Marking TSC unstable due to clocksource watchdog
>>> [   59.802158] TSC found unstable after boot, most likely due to broken 
>>> BIOS. Use 'tsc=unstable'.
>>> [   59.802161] sched_clock: Marking unstable (59802142067, 
>>> 15510)<-(59920871789, -118714277)
>>> [   60.015604] clocksource: Switched to clocksource hpet
>>> [   89.015994] INFO: NMI handler (perf_event_nmi_handler) took too long to 
>>> run: 209.660 msecs
>>> [   89.016003] perf: interrupt took too long (1638003 > 2500), lowering 
>>> kernel.perf_event_max_sample_rate to 1000
>>> 
>>>
>>> Just reverting that commit on top of linus mainline cures all the symptoms
>>
>> Interesting. Do you still get HPET interrupts?
> 
> Sorry, I might need some basic help here (i.e where do I count them...)

/proc/interrupts should display them.

> After the watchdog switches the clocksource to hpet the system is still
>  somewhat alive, so I'll guess some clock is still ticking
Probably, but I suspect they're not hitting the right CPU, hence the
lockups.

Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off
the net for a few days.

Thomas, any insight?

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-09-19 Thread Marc Zyngier
On 19/09/17 16:40, Yanko Kaneti wrote:
> On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:
>> On 19/09/17 16:12, Yanko Kaneti wrote:
>>> Hello, 
>>>
>>> Fedora rawhide config here. 
>>> AMD FX-8370E
>>>
>>> Bisected a problem to:
>>> 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually 
>>> using it) 
>>>
>>> It seems to be causing stalls, short lived or long lived lockups very 
>>> shortly after boot. 
>>> Everything becomes jerky.
>>>
>>> The only visible in the log indication is something like :
>>> 
>>> [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking 
>>> clocksource 'tsc' as unstable because the skew is too large:
>>> [   59.802134] clocksource:   'hpet' wd_now: 3326e7aa 
>>> wd_last: 329956f8 mask: 
>>> [   59.802137] clocksource:   'tsc' cs_now: 423662bc6f 
>>> cs_last: 41dfc91650 mask: 
>>> [   59.802140] tsc: Marking TSC unstable due to clocksource watchdog
>>> [   59.802158] TSC found unstable after boot, most likely due to broken 
>>> BIOS. Use 'tsc=unstable'.
>>> [   59.802161] sched_clock: Marking unstable (59802142067, 
>>> 15510)<-(59920871789, -118714277)
>>> [   60.015604] clocksource: Switched to clocksource hpet
>>> [   89.015994] INFO: NMI handler (perf_event_nmi_handler) took too long to 
>>> run: 209.660 msecs
>>> [   89.016003] perf: interrupt took too long (1638003 > 2500), lowering 
>>> kernel.perf_event_max_sample_rate to 1000
>>> 
>>>
>>> Just reverting that commit on top of linus mainline cures all the symptoms
>>
>> Interesting. Do you still get HPET interrupts?
> 
> Sorry, I might need some basic help here (i.e where do I count them...)

/proc/interrupts should display them.

> After the watchdog switches the clocksource to hpet the system is still
>  somewhat alive, so I'll guess some clock is still ticking
Probably, but I suspect they're not hitting the right CPU, hence the
lockups.

Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off
the net for a few days.

Thomas, any insight?

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-09-19 Thread Yanko Kaneti
On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:
> On 19/09/17 16:12, Yanko Kaneti wrote:
> > Hello, 
> > 
> > Fedora rawhide config here. 
> > AMD FX-8370E
> > 
> > Bisected a problem to:
> > 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually 
> > using it) 
> > 
> > It seems to be causing stalls, short lived or long lived lockups very 
> > shortly after boot. 
> > Everything becomes jerky.
> > 
> > The only visible in the log indication is something like :
> > 
> > [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking 
> > clocksource 'tsc' as unstable because the skew is too large:
> > [   59.802134] clocksource:   'hpet' wd_now: 3326e7aa 
> > wd_last: 329956f8 mask: 
> > [   59.802137] clocksource:   'tsc' cs_now: 423662bc6f 
> > cs_last: 41dfc91650 mask: 
> > [   59.802140] tsc: Marking TSC unstable due to clocksource watchdog
> > [   59.802158] TSC found unstable after boot, most likely due to broken 
> > BIOS. Use 'tsc=unstable'.
> > [   59.802161] sched_clock: Marking unstable (59802142067, 
> > 15510)<-(59920871789, -118714277)
> > [   60.015604] clocksource: Switched to clocksource hpet
> > [   89.015994] INFO: NMI handler (perf_event_nmi_handler) took too long to 
> > run: 209.660 msecs
> > [   89.016003] perf: interrupt took too long (1638003 > 2500), lowering 
> > kernel.perf_event_max_sample_rate to 1000
> > 
> > 
> > Just reverting that commit on top of linus mainline cures all the symptoms
> 
> Interesting. Do you still get HPET interrupts?

Sorry, I might need some basic help here (i.e where do I count them...)

After the watchdog switches the clocksource to hpet the system is still
 somewhat alive, so I'll guess some clock is still ticking

-Yanko


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-09-19 Thread Yanko Kaneti
On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote:
> On 19/09/17 16:12, Yanko Kaneti wrote:
> > Hello, 
> > 
> > Fedora rawhide config here. 
> > AMD FX-8370E
> > 
> > Bisected a problem to:
> > 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually 
> > using it) 
> > 
> > It seems to be causing stalls, short lived or long lived lockups very 
> > shortly after boot. 
> > Everything becomes jerky.
> > 
> > The only visible in the log indication is something like :
> > 
> > [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking 
> > clocksource 'tsc' as unstable because the skew is too large:
> > [   59.802134] clocksource:   'hpet' wd_now: 3326e7aa 
> > wd_last: 329956f8 mask: 
> > [   59.802137] clocksource:   'tsc' cs_now: 423662bc6f 
> > cs_last: 41dfc91650 mask: 
> > [   59.802140] tsc: Marking TSC unstable due to clocksource watchdog
> > [   59.802158] TSC found unstable after boot, most likely due to broken 
> > BIOS. Use 'tsc=unstable'.
> > [   59.802161] sched_clock: Marking unstable (59802142067, 
> > 15510)<-(59920871789, -118714277)
> > [   60.015604] clocksource: Switched to clocksource hpet
> > [   89.015994] INFO: NMI handler (perf_event_nmi_handler) took too long to 
> > run: 209.660 msecs
> > [   89.016003] perf: interrupt took too long (1638003 > 2500), lowering 
> > kernel.perf_event_max_sample_rate to 1000
> > 
> > 
> > Just reverting that commit on top of linus mainline cures all the symptoms
> 
> Interesting. Do you still get HPET interrupts?

Sorry, I might need some basic help here (i.e where do I count them...)

After the watchdog switches the clocksource to hpet the system is still
 somewhat alive, so I'll guess some clock is still ticking

-Yanko


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-09-19 Thread Marc Zyngier
On 19/09/17 16:12, Yanko Kaneti wrote:
> Hello, 
> 
> Fedora rawhide config here. 
> AMD FX-8370E
> 
> Bisected a problem to:
> 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually 
> using it) 
> 
> It seems to be causing stalls, short lived or long lived lockups very shortly 
> after boot. 
> Everything becomes jerky.
> 
> The only visible in the log indication is something like :
> 
> [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking clocksource 
> 'tsc' as unstable because the skew is too large:
> [   59.802134] clocksource:   'hpet' wd_now: 3326e7aa 
> wd_last: 329956f8 mask: 
> [   59.802137] clocksource:   'tsc' cs_now: 423662bc6f 
> cs_last: 41dfc91650 mask: 
> [   59.802140] tsc: Marking TSC unstable due to clocksource watchdog
> [   59.802158] TSC found unstable after boot, most likely due to broken BIOS. 
> Use 'tsc=unstable'.
> [   59.802161] sched_clock: Marking unstable (59802142067, 
> 15510)<-(59920871789, -118714277)
> [   60.015604] clocksource: Switched to clocksource hpet
> [   89.015994] INFO: NMI handler (perf_event_nmi_handler) took too long to 
> run: 209.660 msecs
> [   89.016003] perf: interrupt took too long (1638003 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 1000
> 
> 
> Just reverting that commit on top of linus mainline cures all the symptoms
Interesting. Do you still get HPET interrupts?

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...


Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-09-19 Thread Marc Zyngier
On 19/09/17 16:12, Yanko Kaneti wrote:
> Hello, 
> 
> Fedora rawhide config here. 
> AMD FX-8370E
> 
> Bisected a problem to:
> 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually 
> using it) 
> 
> It seems to be causing stalls, short lived or long lived lockups very shortly 
> after boot. 
> Everything becomes jerky.
> 
> The only visible in the log indication is something like :
> 
> [   59.802129] clocksource: timekeeping watchdog on CPU3: Marking clocksource 
> 'tsc' as unstable because the skew is too large:
> [   59.802134] clocksource:   'hpet' wd_now: 3326e7aa 
> wd_last: 329956f8 mask: 
> [   59.802137] clocksource:   'tsc' cs_now: 423662bc6f 
> cs_last: 41dfc91650 mask: 
> [   59.802140] tsc: Marking TSC unstable due to clocksource watchdog
> [   59.802158] TSC found unstable after boot, most likely due to broken BIOS. 
> Use 'tsc=unstable'.
> [   59.802161] sched_clock: Marking unstable (59802142067, 
> 15510)<-(59920871789, -118714277)
> [   60.015604] clocksource: Switched to clocksource hpet
> [   89.015994] INFO: NMI handler (perf_event_nmi_handler) took too long to 
> run: 209.660 msecs
> [   89.016003] perf: interrupt took too long (1638003 > 2500), lowering 
> kernel.perf_event_max_sample_rate to 1000
> 
> 
> Just reverting that commit on top of linus mainline cures all the symptoms
Interesting. Do you still get HPET interrupts?

Thanks,

M.
-- 
Jazz is not dead. It just smells funny...


[regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-09-19 Thread Yanko Kaneti
Hello, 

Fedora rawhide config here. 
AMD FX-8370E

Bisected a problem to:
74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using 
it) 

It seems to be causing stalls, short lived or long lived lockups very shortly 
after boot. 
Everything becomes jerky.

The only visible in the log indication is something like :

[   59.802129] clocksource: timekeeping watchdog on CPU3: Marking clocksource 
'tsc' as unstable because the skew is too large:
[   59.802134] clocksource:   'hpet' wd_now: 3326e7aa 
wd_last: 329956f8 mask: 
[   59.802137] clocksource:   'tsc' cs_now: 423662bc6f 
cs_last: 41dfc91650 mask: 
[   59.802140] tsc: Marking TSC unstable due to clocksource watchdog
[   59.802158] TSC found unstable after boot, most likely due to broken BIOS. 
Use 'tsc=unstable'.
[   59.802161] sched_clock: Marking unstable (59802142067, 
15510)<-(59920871789, -118714277)
[   60.015604] clocksource: Switched to clocksource hpet
[   89.015994] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
209.660 msecs
[   89.016003] perf: interrupt took too long (1638003 > 2500), lowering 
kernel.perf_event_max_sample_rate to 1000


Just reverting that commit on top of linus mainline cures all the symptoms


Regards
- Yanko


[regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it)

2017-09-19 Thread Yanko Kaneti
Hello, 

Fedora rawhide config here. 
AMD FX-8370E

Bisected a problem to:
74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using 
it) 

It seems to be causing stalls, short lived or long lived lockups very shortly 
after boot. 
Everything becomes jerky.

The only visible in the log indication is something like :

[   59.802129] clocksource: timekeeping watchdog on CPU3: Marking clocksource 
'tsc' as unstable because the skew is too large:
[   59.802134] clocksource:   'hpet' wd_now: 3326e7aa 
wd_last: 329956f8 mask: 
[   59.802137] clocksource:   'tsc' cs_now: 423662bc6f 
cs_last: 41dfc91650 mask: 
[   59.802140] tsc: Marking TSC unstable due to clocksource watchdog
[   59.802158] TSC found unstable after boot, most likely due to broken BIOS. 
Use 'tsc=unstable'.
[   59.802161] sched_clock: Marking unstable (59802142067, 
15510)<-(59920871789, -118714277)
[   60.015604] clocksource: Switched to clocksource hpet
[   89.015994] INFO: NMI handler (perf_event_nmi_handler) took too long to run: 
209.660 msecs
[   89.016003] perf: interrupt took too long (1638003 > 2500), lowering 
kernel.perf_event_max_sample_rate to 1000


Just reverting that commit on top of linus mainline cures all the symptoms


Regards
- Yanko