Re: Regression in 4.8 - CPU speed set very low

2016-09-30 Thread Larry Finger

On 09/29/2016 10:56 AM, Srinivas Pandruvada wrote:

On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote:

[...]


My laptop was inadvertently put to sleep while I was gone. I forgot
to leave a
note for my wife and she quieted the noisy cpu fan. :)

It looks like in 4.8-rc we made a change that caused the "high" trip
point to
be acted on.

This high trip point we don't expose in thermal subsystem (the thermal
 zone dump didn't show this anywhere as a trip). This is exposed by
core-dts driver only. This is the point BIOS is supposed to act, I
guess that's why you are seeing 50% clock modulation.


Are you running thermald

What is?
# ps -e | grep thermald




Srinivas, Rui, do you recall what that can be?

One more question (I think I asked it previously): In the failing
case
(4.8-rc1 and later), when the frequency drops down to the 400 MHz,
does it
ever go back higher or is it stuck at that level forever?

In any case, it may help to file a bug at bugzilla.kernel.org against
CPU/thermal or similar and let me know the bug number.  We'll need to
collect some tracepoint data to debug this and some place to put them
into for easy reference.

Yes, this is good idea.


To complete the record in this thread, the problem also happened with kernel 
4.7, thus it is not a regression in 4.8-rcX. The full discussion is at 
https://bugzilla.kernel.org/show_bug.cgi?id=173361.


Larry




Re: Regression in 4.8 - CPU speed set very low

2016-09-29 Thread Rafael J. Wysocki
On Thursday, September 29, 2016 08:56:16 AM Srinivas Pandruvada wrote:
> On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote:
> 
> [...]
> 
> > > My laptop was inadvertently put to sleep while I was gone. I forgot
> > > to leave a 
> > > note for my wife and she quieted the noisy cpu fan. :)
> > It looks like in 4.8-rc we made a change that caused the "high" trip
> > point to
> > be acted on.
> This high trip point we don't expose in thermal subsystem (the thermal
>  zone dump didn't show this anywhere as a trip). This is exposed by
> core-dts driver only. This is the point BIOS is supposed to act, I
> guess that's why you are seeing 50% clock modulation. 

Right.  That's SMM kicking in.

The real problem is that we get stuck at 400 MHz.

Thanks,
Rafael



Re: Regression in 4.8 - CPU speed set very low

2016-09-29 Thread Larry Finger

On 09/29/2016 10:56 AM, Srinivas Pandruvada wrote:

On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote:

[...]


My laptop was inadvertently put to sleep while I was gone. I forgot
to leave a
note for my wife and she quieted the noisy cpu fan. :)

It looks like in 4.8-rc we made a change that caused the "high" trip
point to
be acted on.

This high trip point we don't expose in thermal subsystem (the thermal
 zone dump didn't show this anywhere as a trip). This is exposed by
core-dts driver only. This is the point BIOS is supposed to act, I
guess that's why you are seeing 50% clock modulation.


Are you running thermald

What is?
# ps -e | grep thermald


The output is blank. I am not running thermald.

Larry



Re: Regression in 4.8 - CPU speed set very low

2016-09-29 Thread Srinivas Pandruvada
On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote:

[...]

> > My laptop was inadvertently put to sleep while I was gone. I forgot
> > to leave a 
> > note for my wife and she quieted the noisy cpu fan. :)
> It looks like in 4.8-rc we made a change that caused the "high" trip
> point to
> be acted on.
This high trip point we don't expose in thermal subsystem (the thermal
 zone dump didn't show this anywhere as a trip). This is exposed by
core-dts driver only. This is the point BIOS is supposed to act, I
guess that's why you are seeing 50% clock modulation. 


Are you running thermald 

What is?
# ps -e | grep thermald


> 
> Srinivas, Rui, do you recall what that can be?
> 
> One more question (I think I asked it previously): In the failing
> case
> (4.8-rc1 and later), when the frequency drops down to the 400 MHz,
> does it
> ever go back higher or is it stuck at that level forever?
> 
> In any case, it may help to file a bug at bugzilla.kernel.org against
> CPU/thermal or similar and let me know the bug number.  We'll need to
> collect some tracepoint data to debug this and some place to put them
> into for easy reference.
Yes, this is good idea.

Thanks,
Srinivas



Re: Regression in 4.8 - CPU speed set very low

2016-09-29 Thread Larry Finger

On 09/29/2016 07:19 AM, Rafael J. Wysocki wrote:

On Wednesday, September 28, 2016 09:22:59 PM Larry Finger wrote:

On 09/27/2016 06:46 AM, Rafael J. Wysocki wrote:

On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger
 wrote:

On 09/26/2016 10:12 PM, Doug Smythies wrote:


On 2016.09.26 18:31 Srinivas Pandruvada wrote:


On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:


On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:


On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
But for both we need a reproducer anyway.


I do not have a reliable reproducer. The condition has always
happened when
running a high-compute job such as a 'make -j8' on the kernel, or
building the
RPM for openSUSE's implementation of VirtualBox. The latter is what
I'm using
for most of my testing.



Run some CPU stressor and get all your CPU's going at 100% load.
And watch your core temperatures while you do so.



for i in 1 2 3 4; do while : ; do : ; done & done

triggered the fault in a few minutes.






It also would be good to rule out the thermal throttling (as per
the Srinivas' comments).



It is almost certainly thermal throttling, or similar causing
Clock modulation, of it seems 50%.



While the infinite loops were running, the temps were:

finger@linux-1t8h:~/rtlwifi_new> sensors
coretemp-isa-
Adapter: ISA adapter
Physical id 0:  +83.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0: +83.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1: +74.0°C  (high = +84.0°C, crit = +100.0°C)


It looks like the trip point (high) temperature was exceeded causing
thermal throttling to kick in.


After the fault occurs, I get

finger@linux-1t8h:~/rtlwifi_new> sensors
coretemp-isa-
Adapter: ISA adapter
Physical id 0:  +44.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0: +43.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1: +41.0°C  (high = +84.0°C, crit = +100.0°C)


So after that it stays at 400 MHz forever, right?



For now, please tell me what's in
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq


80


Your effective freq is lower than 800MHz. One of the possible reason is
thermal throttling.

What distro you are using?



And what make and model of LapTop?



Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @
2.90GHz. That is a dual-core unit with hyperthreading.

@Rafael: As I write this, the system has been running the infinite loop test
for almost 5 hours with kernel 4.7. I will leave that running while I'm
gone, but I am certain that it is OK.


OK, and what temperatures do you see while doing this?


finger@linux-1t8h:~/linux-2.6> sensors
coretemp-isa-
Adapter: ISA adapter
Physical id 0:  +90.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0: +90.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1: +78.0°C  (high = +84.0°C, crit = +100.0°C)

Once again, the CPU temp is greater than the "high" value; however, the clock
rate continues to hold near 3600 MHz.

My laptop was inadvertently put to sleep while I was gone. I forgot to leave a
note for my wife and she quieted the noisy cpu fan. :)


It looks like in 4.8-rc we made a change that caused the "high" trip point to
be acted on.

Srinivas, Rui, do you recall what that can be?

One more question (I think I asked it previously): In the failing case
(4.8-rc1 and later), when the frequency drops down to the 400 MHz, does it
ever go back higher or is it stuck at that level forever?

In any case, it may help to file a bug at bugzilla.kernel.org against
CPU/thermal or similar and let me know the bug number.  We'll need to
collect some tracepoint data to debug this and some place to put them
into for easy reference.


Sorry if I missed that earlier question. The CPU is stuck at that lower 
frequency until I reboot.


Bug report at https://bugzilla.kernel.org/show_bug.cgi?id=173361. I tried to 
cover the main points of the discussion. Please add the ones that I missed.


Larry




Re: Regression in 4.8 - CPU speed set very low

2016-09-29 Thread Lennart Sorensen
On Wed, Sep 28, 2016 at 09:26:42PM -0500, Larry Finger wrote:
> By the time it gets slow, the CPU's cool, and one cannot see the temp just
> before that event happened.

Hmm, I would not expect the CPU to drop from 80 to 40 degrees in a few
seconds if the fan is not spinning.  I wouldn't even expect it if the
fan was spinning.  I would think at least 30 to 60 seconds if not more.

The only way I would think the temperature could change quickly would
be if the heatsink isn't even touching the CPU anymore so there is very
little material to hold the heat in the CPU.

> The reason I suspect a bug is that it fails with 4.8-rcX, but not with 4.7.
> Of course, it could be something subtle that slightly changes the heat load,
> which causes the CPU temp to be a little higher so that the effect is
> triggered.
> 
> I am reasonably confident that it is not a hardware problem, but we may have
> to wait until 4.8 is released and gets wider usage. If no one else reports a
> problem, then I am certainly wrong.

Well hard to reproduce bugs are always really annoying.

This old bug sounds a lot like what you are seeing:
https://bugzilla.redhat.com/show_bug.cgi?id=924570
and it links to this:
https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.6-Thermal-Updates

Apparently turning off turbo boost seems to stop the problem for a lot
of people in that case.  Doesn't explain why it started happening
recently.  And of course that may have been a different problem in
the past.

-- 
Len Sorensen


Re: Regression in 4.8 - CPU speed set very low

2016-09-29 Thread Rafael J. Wysocki
On Wednesday, September 28, 2016 09:22:59 PM Larry Finger wrote:
> On 09/27/2016 06:46 AM, Rafael J. Wysocki wrote:
> > On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger
> >  wrote:
> >> On 09/26/2016 10:12 PM, Doug Smythies wrote:
> >>>
> >>> On 2016.09.26 18:31 Srinivas Pandruvada wrote:
> 
>  On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
> >
> > On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
> >>
> >> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
> >> But for both we need a reproducer anyway.
> >
> > I do not have a reliable reproducer. The condition has always
> > happened when
> > running a high-compute job such as a 'make -j8' on the kernel, or
> > building the
> > RPM for openSUSE's implementation of VirtualBox. The latter is what
> > I'm using
> > for most of my testing.
> >>>
> >>>
> >>> Run some CPU stressor and get all your CPU's going at 100% load.
> >>> And watch your core temperatures while you do so.
> >>
> >>
> >> for i in 1 2 3 4; do while : ; do : ; done & done
> >>
> >> triggered the fault in a few minutes.
> >>>
> >>>
> 
> >> It also would be good to rule out the thermal throttling (as per
> >> the Srinivas' comments).
> >>>
> >>>
> >>> It is almost certainly thermal throttling, or similar causing
> >>> Clock modulation, of it seems 50%.
> >>
> >>
> >> While the infinite loops were running, the temps were:
> >>
> >> finger@linux-1t8h:~/rtlwifi_new> sensors
> >> coretemp-isa-
> >> Adapter: ISA adapter
> >> Physical id 0:  +83.0°C  (high = +84.0°C, crit = +100.0°C)
> >> Core 0: +83.0°C  (high = +84.0°C, crit = +100.0°C)
> >> Core 1: +74.0°C  (high = +84.0°C, crit = +100.0°C)
> >
> > It looks like the trip point (high) temperature was exceeded causing
> > thermal throttling to kick in.
> >
> >> After the fault occurs, I get
> >>
> >> finger@linux-1t8h:~/rtlwifi_new> sensors
> >> coretemp-isa-
> >> Adapter: ISA adapter
> >> Physical id 0:  +44.0°C  (high = +84.0°C, crit = +100.0°C)
> >> Core 0: +43.0°C  (high = +84.0°C, crit = +100.0°C)
> >> Core 1: +41.0°C  (high = +84.0°C, crit = +100.0°C)
> >
> > So after that it stays at 400 MHz forever, right?
> >
> >>
> >> For now, please tell me what's in
> >> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
> >
> > 80
> 
>  Your effective freq is lower than 800MHz. One of the possible reason is
>  thermal throttling.
> 
>  What distro you are using?
> >>>
> >>>
> >>> And what make and model of LapTop?
> >>
> >>
> >> Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU 
> >> @
> >> 2.90GHz. That is a dual-core unit with hyperthreading.
> >>
> >> @Rafael: As I write this, the system has been running the infinite loop 
> >> test
> >> for almost 5 hours with kernel 4.7. I will leave that running while I'm
> >> gone, but I am certain that it is OK.
> >
> > OK, and what temperatures do you see while doing this?
> 
> finger@linux-1t8h:~/linux-2.6> sensors
> coretemp-isa-
> Adapter: ISA adapter
> Physical id 0:  +90.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 0: +90.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 1: +78.0°C  (high = +84.0°C, crit = +100.0°C)
> 
> Once again, the CPU temp is greater than the "high" value; however, the clock 
> rate continues to hold near 3600 MHz.
> 
> My laptop was inadvertently put to sleep while I was gone. I forgot to leave 
> a 
> note for my wife and she quieted the noisy cpu fan. :)

It looks like in 4.8-rc we made a change that caused the "high" trip point to
be acted on.

Srinivas, Rui, do you recall what that can be?

One more question (I think I asked it previously): In the failing case
(4.8-rc1 and later), when the frequency drops down to the 400 MHz, does it
ever go back higher or is it stuck at that level forever?

In any case, it may help to file a bug at bugzilla.kernel.org against
CPU/thermal or similar and let me know the bug number.  We'll need to
collect some tracepoint data to debug this and some place to put them
into for easy reference.

Thanks,
Rafael



Re: Regression in 4.8 - CPU speed set very low

2016-09-28 Thread Larry Finger

On 09/27/2016 09:51 AM, Lennart Sorensen wrote:

On Mon, Sep 26, 2016 at 04:28:29PM -0500, Larry Finger wrote:

Mostly I use a KDE applet named "System load" and look at the "average
clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
When the bug triggers, the system gets very slow, and the cpu fan stops even
though the cpu is still busy.

Commit f7816ad, which had run for 7 days without showing the bug, failed
after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh
well, that's the way it goes!


Is it possible there is no bug and instead you have a hardware problem?

What I am thinking:

CPU fan stops, then CPU gets busy, CPU overheats, thermal throtling
kicks in to protect CPU and it gets VERY slow.

So maybe you have a bad CPU fan that is getting stuck.  Perhaps even if
you have a motherboard that varies the CPU fan depending on need and the
fan doesn't like the lowest speed and sometimes gets stuck when asked
to go slow.

Of course if the CPU fan is the problem that could explain why it takes
varying amounts of time to see the problem.

I suggest checking what the cpu temperature sensors are showing next
time it gets slow.


By the time it gets slow, the CPU's cool, and one cannot see the temp just 
before that event happened.


The reason I suspect a bug is that it fails with 4.8-rcX, but not with 4.7. Of 
course, it could be something subtle that slightly changes the heat load, which 
causes the CPU temp to be a little higher so that the effect is triggered.


I am reasonably confident that it is not a hardware problem, but we may have to 
wait until 4.8 is released and gets wider usage. If no one else reports a 
problem, then I am certainly wrong.


Larry






Re: Regression in 4.8 - CPU speed set very low

2016-09-28 Thread Larry Finger

On 09/27/2016 06:46 AM, Rafael J. Wysocki wrote:

On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger
 wrote:

On 09/26/2016 10:12 PM, Doug Smythies wrote:


On 2016.09.26 18:31 Srinivas Pandruvada wrote:


On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:


On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:


On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
But for both we need a reproducer anyway.


I do not have a reliable reproducer. The condition has always
happened when
running a high-compute job such as a 'make -j8' on the kernel, or
building the
RPM for openSUSE's implementation of VirtualBox. The latter is what
I'm using
for most of my testing.



Run some CPU stressor and get all your CPU's going at 100% load.
And watch your core temperatures while you do so.



for i in 1 2 3 4; do while : ; do : ; done & done

triggered the fault in a few minutes.






It also would be good to rule out the thermal throttling (as per
the Srinivas' comments).



It is almost certainly thermal throttling, or similar causing
Clock modulation, of it seems 50%.



While the infinite loops were running, the temps were:

finger@linux-1t8h:~/rtlwifi_new> sensors
coretemp-isa-
Adapter: ISA adapter
Physical id 0:  +83.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0: +83.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1: +74.0°C  (high = +84.0°C, crit = +100.0°C)


It looks like the trip point (high) temperature was exceeded causing
thermal throttling to kick in.


After the fault occurs, I get

finger@linux-1t8h:~/rtlwifi_new> sensors
coretemp-isa-
Adapter: ISA adapter
Physical id 0:  +44.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0: +43.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1: +41.0°C  (high = +84.0°C, crit = +100.0°C)


So after that it stays at 400 MHz forever, right?



For now, please tell me what's in
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq


80


Your effective freq is lower than 800MHz. One of the possible reason is
thermal throttling.

What distro you are using?



And what make and model of LapTop?



Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @
2.90GHz. That is a dual-core unit with hyperthreading.

@Rafael: As I write this, the system has been running the infinite loop test
for almost 5 hours with kernel 4.7. I will leave that running while I'm
gone, but I am certain that it is OK.


OK, and what temperatures do you see while doing this?


finger@linux-1t8h:~/linux-2.6> sensors
coretemp-isa-
Adapter: ISA adapter
Physical id 0:  +90.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0: +90.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1: +78.0°C  (high = +84.0°C, crit = +100.0°C)

Once again, the CPU temp is greater than the "high" value; however, the clock 
rate continues to hold near 3600 MHz.


My laptop was inadvertently put to sleep while I was gone. I forgot to leave a 
note for my wife and she quieted the noisy cpu fan. :)


Larry




Re: Regression in 4.8 - CPU speed set very low

2016-09-27 Thread Lennart Sorensen
On Mon, Sep 26, 2016 at 04:28:29PM -0500, Larry Finger wrote:
> Mostly I use a KDE applet named "System load" and look at the "average
> clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
> When the bug triggers, the system gets very slow, and the cpu fan stops even
> though the cpu is still busy.
> 
> Commit f7816ad, which had run for 7 days without showing the bug, failed
> after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh
> well, that's the way it goes!

Is it possible there is no bug and instead you have a hardware problem?

What I am thinking:

CPU fan stops, then CPU gets busy, CPU overheats, thermal throtling
kicks in to protect CPU and it gets VERY slow.

So maybe you have a bad CPU fan that is getting stuck.  Perhaps even if
you have a motherboard that varies the CPU fan depending on need and the
fan doesn't like the lowest speed and sometimes gets stuck when asked
to go slow.

Of course if the CPU fan is the problem that could explain why it takes
varying amounts of time to see the problem.

I suggest checking what the cpu temperature sensors are showing next
time it gets slow.

-- 
Len Sorensen


Re: Regression in 4.8 - CPU speed set very low

2016-09-27 Thread Rafael J. Wysocki
On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger
 wrote:
> On 09/26/2016 10:12 PM, Doug Smythies wrote:
>>
>> On 2016.09.26 18:31 Srinivas Pandruvada wrote:
>>>
>>> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:

 On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
>
> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
> But for both we need a reproducer anyway.

 I do not have a reliable reproducer. The condition has always
 happened when
 running a high-compute job such as a 'make -j8' on the kernel, or
 building the
 RPM for openSUSE's implementation of VirtualBox. The latter is what
 I'm using
 for most of my testing.
>>
>>
>> Run some CPU stressor and get all your CPU's going at 100% load.
>> And watch your core temperatures while you do so.
>
>
> for i in 1 2 3 4; do while : ; do : ; done & done
>
> triggered the fault in a few minutes.
>>
>>
>>>
> It also would be good to rule out the thermal throttling (as per
> the Srinivas' comments).
>>
>>
>> It is almost certainly thermal throttling, or similar causing
>> Clock modulation, of it seems 50%.
>
>
> While the infinite loops were running, the temps were:
>
> finger@linux-1t8h:~/rtlwifi_new> sensors
> coretemp-isa-
> Adapter: ISA adapter
> Physical id 0:  +83.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 0: +83.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 1: +74.0°C  (high = +84.0°C, crit = +100.0°C)

It looks like the trip point (high) temperature was exceeded causing
thermal throttling to kick in.

> After the fault occurs, I get
>
> finger@linux-1t8h:~/rtlwifi_new> sensors
> coretemp-isa-
> Adapter: ISA adapter
> Physical id 0:  +44.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 0: +43.0°C  (high = +84.0°C, crit = +100.0°C)
> Core 1: +41.0°C  (high = +84.0°C, crit = +100.0°C)

So after that it stays at 400 MHz forever, right?

>
> For now, please tell me what's in
> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq

 80
>>>
>>> Your effective freq is lower than 800MHz. One of the possible reason is
>>> thermal throttling.
>>>
>>> What distro you are using?
>>
>>
>> And what make and model of LapTop?
>
>
> Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @
> 2.90GHz. That is a dual-core unit with hyperthreading.
>
> @Rafael: As I write this, the system has been running the infinite loop test
> for almost 5 hours with kernel 4.7. I will leave that running while I'm
> gone, but I am certain that it is OK.

OK, and what temperatures do you see while doing this?

Thanks,
Rafael


Re: Regression in 4.8 - CPU speed set very low

2016-09-27 Thread Larry Finger

On 09/26/2016 10:12 PM, Doug Smythies wrote:

On 2016.09.26 18:31 Srinivas Pandruvada wrote:

On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:

On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:

On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
But for both we need a reproducer anyway.

I do not have a reliable reproducer. The condition has always
happened when
running a high-compute job such as a 'make -j8' on the kernel, or
building the
RPM for openSUSE's implementation of VirtualBox. The latter is what
I'm using
for most of my testing.


Run some CPU stressor and get all your CPU's going at 100% load.
And watch your core temperatures while you do so.


for i in 1 2 3 4; do while : ; do : ; done & done

triggered the fault in a few minutes.





It also would be good to rule out the thermal throttling (as per
the Srinivas' comments).


It is almost certainly thermal throttling, or similar causing
Clock modulation, of it seems 50%.


While the infinite loops were running, the temps were:

finger@linux-1t8h:~/rtlwifi_new> sensors
coretemp-isa-
Adapter: ISA adapter
Physical id 0:  +83.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0: +83.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1: +74.0°C  (high = +84.0°C, crit = +100.0°C)

After the fault occurs, I get

finger@linux-1t8h:~/rtlwifi_new> sensors
coretemp-isa-
Adapter: ISA adapter
Physical id 0:  +44.0°C  (high = +84.0°C, crit = +100.0°C)
Core 0: +43.0°C  (high = +84.0°C, crit = +100.0°C)
Core 1: +41.0°C  (high = +84.0°C, crit = +100.0°C)





For now, please tell me what's in
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq

80

Your effective freq is lower than 800MHz. One of the possible reason is
thermal throttling.

What distro you are using?


And what make and model of LapTop?


Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @ 
2.90GHz. That is a dual-core unit with hyperthreading.


@Rafael: As I write this, the system has been running the infinite loop test for 
almost 5 hours with kernel 4.7. I will leave that running while I'm gone, but I 
am certain that it is OK.


Larry





RE: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Doug Smythies
On 2016.09.26 18:31 Srinivas Pandruvada wrote:
> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
>> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: 
>>> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote:
>>> But for both we need a reproducer anyway.
>> I do not have a reliable reproducer. The condition has always
>> happened when 
>> running a high-compute job such as a 'make -j8' on the kernel, or
>> building the 
>> RPM for openSUSE's implementation of VirtualBox. The latter is what
>> I'm using 
>> for most of my testing.

Run some CPU stressor and get all your CPU's going at 100% load.
And watch your core temperatures while you do so.

> 
>>> It also would be good to rule out the thermal throttling (as per
>>> the Srinivas' comments).

It is almost certainly thermal throttling, or similar causing
Clock modulation, of it seems 50%.

>>> 
>>> For now, please tell me what's in
>>> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
>> 80
> Your effective freq is lower than 800MHz. One of the possible reason is
> thermal throttling.
>
> What distro you are using?

And what make and model of LapTop?





Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Larry Finger

On 09/26/2016 08:30 PM, Srinivas Pandruvada wrote:

On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:

On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:


On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger  wrote:


On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote:



On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger

 wrote:




Maybe it's better to try diagnose the problem instead of
spending more
time on bisection.


In my original post, I asked for such help, but nothing until
today. I had
no idea what to check, but now I have a better idea.



I'd like to know whether or not 4.7 was definitely good,
though.


I never saw this problem with 4.7, but given the difficulty in
triggering
the problem, my tests may not have been definitive.






If it is one of them, it may be a while before I dare call
this one
"good".
In one respect, that is good as I will be traveling tomorrow
and
Wednesday.


What does "cat
/sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?


intel_pstate

You probably don't need to worry about all of the cpufreq changes
in
4.8-rc, then.  Only a few of them affect intel_pstate and I don't
see
how any of them may lead to the observed symptoms.

First off, if you have a reproducer, please run it on 4.7 and see
if
you can trigger the issue in there.

I'm running 4.8-rc7 at the moment hoping to trigger the problem and
get the data
requested by Srinivas. Once I get that, I will try 4.7 again.



Second, it would be good to have a look at the output from the
cpu_frequency and pstate_sample tracepoints around when the issue
triggers.  The pstate_sample one would be more interesting.

But for both we need a reproducer anyway.

I do not have a reliable reproducer. The condition has always
happened when
running a high-compute job such as a 'make -j8' on the kernel, or
building the
RPM for openSUSE's implementation of VirtualBox. The latter is what
I'm using
for most of my testing.



It also would be good to rule out the thermal throttling (as per
the
Srinivas' comments).

For now, please tell me what's in
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq

80

Your effective freq is lower than 800MHz. One of the possible reason is
thermal throttling.

What distro you are using?


openSUSE Leap 42.1.

Larry




Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Srinivas Pandruvada
On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote:
> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:
> > 
> > On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger  > r.net> wrote:
> > > 
> > > On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote:
> > > > 
> > > > 
> > > > On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger
> > >  wrote:
> > > > 
> > > > 
> > > > 
> > > > Maybe it's better to try diagnose the problem instead of
> > > > spending more
> > > > time on bisection.
> > > 
> > > In my original post, I asked for such help, but nothing until
> > > today. I had
> > > no idea what to check, but now I have a better idea.
> > > 
> > > > 
> > > > I'd like to know whether or not 4.7 was definitely good,
> > > > though.
> > > 
> > > I never saw this problem with 4.7, but given the difficulty in
> > > triggering
> > > the problem, my tests may not have been definitive.
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > If it is one of them, it may be a while before I dare call
> > > > > this one
> > > > > "good".
> > > > > In one respect, that is good as I will be traveling tomorrow
> > > > > and
> > > > > Wednesday.
> > > > 
> > > > What does "cat
> > > > /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?
> > > 
> > > intel_pstate
> > You probably don't need to worry about all of the cpufreq changes
> > in
> > 4.8-rc, then.  Only a few of them affect intel_pstate and I don't
> > see
> > how any of them may lead to the observed symptoms.
> > 
> > First off, if you have a reproducer, please run it on 4.7 and see
> > if
> > you can trigger the issue in there.
> I'm running 4.8-rc7 at the moment hoping to trigger the problem and
> get the data 
> requested by Srinivas. Once I get that, I will try 4.7 again.
> > 
> > 
> > Second, it would be good to have a look at the output from the
> > cpu_frequency and pstate_sample tracepoints around when the issue
> > triggers.  The pstate_sample one would be more interesting.
> > 
> > But for both we need a reproducer anyway.
> I do not have a reliable reproducer. The condition has always
> happened when 
> running a high-compute job such as a 'make -j8' on the kernel, or
> building the 
> RPM for openSUSE's implementation of VirtualBox. The latter is what
> I'm using 
> for most of my testing.
> 
> > 
> > It also would be good to rule out the thermal throttling (as per
> > the
> > Srinivas' comments).
> > 
> > For now, please tell me what's in
> > /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq
> 80
Your effective freq is lower than 800MHz. One of the possible reason is
thermal throttling.

What distro you are using?


Thanks,
Srinivas



Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Larry Finger

On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote:

On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger  wrote:

On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote:


On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger


 wrote:



Maybe it's better to try diagnose the problem instead of spending more
time on bisection.



In my original post, I asked for such help, but nothing until today. I had
no idea what to check, but now I have a better idea.


I'd like to know whether or not 4.7 was definitely good, though.



I never saw this problem with 4.7, but given the difficulty in triggering
the problem, my tests may not have been definitive.




If it is one of them, it may be a while before I dare call this one
"good".
In one respect, that is good as I will be traveling tomorrow and
Wednesday.



What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?



intel_pstate


You probably don't need to worry about all of the cpufreq changes in
4.8-rc, then.  Only a few of them affect intel_pstate and I don't see
how any of them may lead to the observed symptoms.

First off, if you have a reproducer, please run it on 4.7 and see if
you can trigger the issue in there.


I'm running 4.8-rc7 at the moment hoping to trigger the problem and get the data 
requested by Srinivas. Once I get that, I will try 4.7 again.


Second, it would be good to have a look at the output from the
cpu_frequency and pstate_sample tracepoints around when the issue
triggers.  The pstate_sample one would be more interesting.

But for both we need a reproducer anyway.


I do not have a reliable reproducer. The condition has always happened when 
running a high-compute job such as a 'make -j8' on the kernel, or building the 
RPM for openSUSE's implementation of VirtualBox. The latter is what I'm using 
for most of my testing.



It also would be good to rule out the thermal throttling (as per the
Srinivas' comments).

For now, please tell me what's in
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq


80

Larry



Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Rafael J. Wysocki
On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger  wrote:
> On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote:
>>
>> On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger
>
>  wrote:
>>
>>
>> Maybe it's better to try diagnose the problem instead of spending more
>> time on bisection.
>
>
> In my original post, I asked for such help, but nothing until today. I had
> no idea what to check, but now I have a better idea.
>
>> I'd like to know whether or not 4.7 was definitely good, though.
>
>
> I never saw this problem with 4.7, but given the difficulty in triggering
> the problem, my tests may not have been definitive.
>>
>>
>>> If it is one of them, it may be a while before I dare call this one
>>> "good".
>>> In one respect, that is good as I will be traveling tomorrow and
>>> Wednesday.
>>
>>
>> What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?
>
>
> intel_pstate

You probably don't need to worry about all of the cpufreq changes in
4.8-rc, then.  Only a few of them affect intel_pstate and I don't see
how any of them may lead to the observed symptoms.

First off, if you have a reproducer, please run it on 4.7 and see if
you can trigger the issue in there.

Second, it would be good to have a look at the output from the
cpu_frequency and pstate_sample tracepoints around when the issue
triggers.  The pstate_sample one would be more interesting.

But for both we need a reproducer anyway.

It also would be good to rule out the thermal throttling (as per the
Srinivas' comments).

For now, please tell me what's in
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq

Thanks,
Rafael


Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Larry Finger

On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote:

On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger

 wrote:


Maybe it's better to try diagnose the problem instead of spending more
time on bisection.


In my original post, I asked for such help, but nothing until today. I had no 
idea what to check, but now I have a better idea.



I'd like to know whether or not 4.7 was definitely good, though.


I never saw this problem with 4.7, but given the difficulty in triggering the 
problem, my tests may not have been definitive.



If it is one of them, it may be a while before I dare call this one "good".
In one respect, that is good as I will be traveling tomorrow and Wednesday.


What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?


intel_pstate

Larry




Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Rafael J. Wysocki
On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger
 wrote:
> On 09/26/2016 04:37 PM, Rafael J. Wysocki wrote:
>>
>> On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
>>  wrote:
>>>
>>> On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:


 On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:

[cut]

>>>
>>> Mostly I use a KDE applet named "System load" and look at the "average
>>> clock", but the same info is also available in /proc/cpuinfo as "cpu
>>> MHz".
>>> When the bug triggers, the system gets very slow, and the cpu fan stops
>>> even
>>> though the cpu is still busy.
>>
>>
>> That sounds like thermal throttling kicking in.
>
>
> I think it is because the cpu is idling. If a thermal throttling is
> responsible, why would it not fail for 168 hours, and then fail in 2?
>
>> What's there under /sys/class/thermal/ on your system?
>
>
> It contains the following directories:
>
> cooling_device0  cooling_device1  cooling_device2  cooling_device3
> cooling_device4  thermal_zone0  thermal_zone1
>>
>>
>>> Commit f7816ad, which had run for 7 days without showing the bug, failed
>>> after about 2 hours today. All my testing since Sept. 9 has been wasted.
>>> Oh
>>> well, that's the way it goes!
>>
>>
>> Are you confident that the issue was not reproducible before 4.8-rc2?
>> In particular, what about 4.8-rc1?
>
>
> 4.8-rc1 is definitely bad. I am now testing commit 5539204. In the bisect
> visualization, there are a number of cpufreq commits before the test case.

Maybe it's better to try diagnose the problem instead of spending more
time on bisection.

I'd like to know whether or not 4.7 was definitely good, though.

> If it is one of them, it may be a while before I dare call this one "good".
> In one respect, that is good as I will be traveling tomorrow and Wednesday.

What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say?

Thanks,
Rafael


Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Larry Finger

On 09/26/2016 04:46 PM, Srinivas Pandruvada wrote:

On Mon, 2016-09-26 at 23:37 +0200, Rafael J. Wysocki wrote:

On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
 wrote:


On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:



On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:



On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:



On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:



On 09/18/2016 09:54 PM, Larry Finger wrote:



On 09/14/2016 11:00 AM, Larry Finger wrote:



On 09/09/2016 12:39 PM, Larry Finger wrote:



I have found a regression in kernel 4.8-rc2 that
causes the speed of
my laptop
with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to
suddenly have a
maximum cpu
frequency of ~400 MHz. Unfortunately, I do not know
how to trigger
this problem,
thus a bisection is not possible. It usually happens
under heavy
load, such as a
kernel build or the RPM build of VirtualBox, but it
does not always
fail with
these loads. In my most recent failure, 'hwinfo --
cpu' reports cpu
MHz of
396.130 for #3. The bogomips value is 5787.73, and
the cpu clock
before the
fault is 3437 MHz. Nothing is logged when this
happens.

If I were to get a patch that would show a backtrace
when the
maximum CPU
frequency is changed, perhaps it would be possible to
track this
bug.



I have not yet found the bad commit, but I have reduced
the range of
commits a
bit. This bug has been difficult to trigger. So far, it
has not taken
over 1/2
day to appear in bad kernels, thus I am allowing three
days before
deciding that
a given trial is good. I never saw the problem with 4.7
kernels, but
I did in
4.8-rc1. I also know that it appeared before commit
581e0cd. Commit
1b05cf6 did
not show the bug.

Testing continues.



And still does. My bisection seemed to be trending toward
an
improbable set of
commits, and I needed to do some other work with the
machine, thus I
started
running 4.8-rc6. It failed nearly 48 hours after the
reboot, which
indicated
that using 3 days to indicate a "good" trial was likely
too short. I
am
currently testing the first of the trial and will run it
for at least
a week. It
is unlikely that these tests will be complete before 4,8
is released,
even if
-rc8 is needed. I will keep attempting to find the faulty
commit.



My debugging continues. After 7 days of beating on commit
f7816ad, I
have
concluded that it is likely good. Thus I think the bug lies
between
commit
581e0cd (bad) and f7816ad (good). I will need to do a long
test on
commit
1b05cf6, which did not fail with a shorter run.



581e0cd is not a valid mainline commit hash AFAICS.



That was a typo. The correct value is 581e0c7.




What cpufreq driver do you use?



My "Default CPUFreq governor" is on demand.

Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config'
results in

CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
CONFIG_X86_PCC_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ_CPB=y

Commit 1b05cf6 did fail on longer testing, thus my bisection
had ended up
going
wrong. Further tests have shown that commit 351a4ded is bad.
Once again,
by
bisection seems to be converging to a set of commits that seem
unlikely
to cause
this problem. Perhaps commit f7816ad is not really good even
though it
survived
7 days of heavy CPU usage.

I have been reluctant to post my entire .config on the list. It
is
available at
http://pastebin.com/aMZaAKwL.



If the governor is ondemand, the driver is acpi-cpufreq, most
likely.

How do you measure the frequency?



Mostly I use a KDE applet named "System load" and look at the
"average
clock", but the same info is also available in /proc/cpuinfo as
"cpu MHz".
When the bug triggers, the system gets very slow, and the cpu fan
stops even
though the cpu is still busy.


That sounds like thermal throttling kicking in.


This will help to know, if there is thermal throttle from OS.
# cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
# grep -r . /sys/class/thermal/thermal_zone*


With the system OK, I get

finger@linux-1t8h:~/wireless-drivers-next> cat 
/sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq

360
360
360
360

finger@linux-1t8h:~/wireless-drivers-next> grep -r . 
/sys/class/thermal/thermal_zone*

grep: /sys/class/thermal/thermal_zone0/k_d: Input/output error
grep: /sys/class/thermal/thermal_zone0/k_i: Input/output error
grep: /sys/class/thermal/thermal_zone0/k_po: Inp

Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Larry Finger

On 09/26/2016 04:37 PM, Rafael J. Wysocki wrote:

On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
 wrote:

On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:


On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:


On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:


On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:


On 09/18/2016 09:54 PM, Larry Finger wrote:


On 09/14/2016 11:00 AM, Larry Finger wrote:


On 09/09/2016 12:39 PM, Larry Finger wrote:


I have found a regression in kernel 4.8-rc2 that causes the speed of
my laptop
with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a
maximum cpu
frequency of ~400 MHz. Unfortunately, I do not know how to trigger
this problem,
thus a bisection is not possible. It usually happens under heavy
load, such as a
kernel build or the RPM build of VirtualBox, but it does not always
fail with
these loads. In my most recent failure, 'hwinfo --cpu' reports cpu
MHz of
396.130 for #3. The bogomips value is 5787.73, and the cpu clock
before the
fault is 3437 MHz. Nothing is logged when this happens.

If I were to get a patch that would show a backtrace when the
maximum CPU
frequency is changed, perhaps it would be possible to track this
bug.



I have not yet found the bad commit, but I have reduced the range of
commits a
bit. This bug has been difficult to trigger. So far, it has not taken
over 1/2
day to appear in bad kernels, thus I am allowing three days before
deciding that
a given trial is good. I never saw the problem with 4.7 kernels, but
I did in
4.8-rc1. I also know that it appeared before commit 581e0cd. Commit
1b05cf6 did
not show the bug.

Testing continues.



And still does. My bisection seemed to be trending toward an
improbable set of
commits, and I needed to do some other work with the machine, thus I
started
running 4.8-rc6. It failed nearly 48 hours after the reboot, which
indicated
that using 3 days to indicate a "good" trial was likely too short. I
am
currently testing the first of the trial and will run it for at least
a week. It
is unlikely that these tests will be complete before 4,8 is released,
even if
-rc8 is needed. I will keep attempting to find the faulty commit.



My debugging continues. After 7 days of beating on commit f7816ad, I
have
concluded that it is likely good. Thus I think the bug lies between
commit
581e0cd (bad) and f7816ad (good). I will need to do a long test on
commit
1b05cf6, which did not fail with a shorter run.



581e0cd is not a valid mainline commit hash AFAICS.



That was a typo. The correct value is 581e0c7.



What cpufreq driver do you use?



My "Default CPUFreq governor" is on demand.

Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in

CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
CONFIG_X86_PCC_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ_CPB=y

Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up
going
wrong. Further tests have shown that commit 351a4ded is bad. Once again,
by
bisection seems to be converging to a set of commits that seem unlikely
to cause
this problem. Perhaps commit f7816ad is not really good even though it
survived
7 days of heavy CPU usage.

I have been reluctant to post my entire .config on the list. It is
available at
http://pastebin.com/aMZaAKwL.



If the governor is ondemand, the driver is acpi-cpufreq, most likely.

How do you measure the frequency?



Mostly I use a KDE applet named "System load" and look at the "average
clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
When the bug triggers, the system gets very slow, and the cpu fan stops even
though the cpu is still busy.


That sounds like thermal throttling kicking in.


I think it is because the cpu is idling. If a thermal throttling is responsible, 
why would it not fail for 168 hours, and then fail in 2?



What's there under /sys/class/thermal/ on your system?


It contains the following directories:

cooling_device0  cooling_device1  cooling_device2  cooling_device3 
cooling_device4  thermal_zone0  thermal_zone1



Commit f7816ad, which had run for 7 days without showing the bug, failed
after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh
well, that's the way it goes!


Are you confident that the issue was not reproducible before 4.8-rc2?
In particular, what about 4.8-rc1?


4.8-rc1 is definitely bad. I 

Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Rafael J. Wysocki
On Mon, Sep 26, 2016 at 11:41 PM, Srinivas Pandruvada
 wrote:
> On Mon, 2016-09-26 at 23:30 +0200, Rafael J. Wysocki wrote:
>> On Mon, Sep 26, 2016 at 11:26 PM, Srinivas Pandruvada
>>  wrote:
>> >
>> > On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote:
>> > >
>> > > On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
>> >
>> > [...]
>> >
>> > >
>> > > >
>> > > > I have been reluctant to post my entire .config on the list. It
>> > > > is
>> > > > available at
>> > > > http://pastebin.com/aMZaAKwL.
>> > >
>> > > If the governor is ondemand, the driver is acpi-cpufreq, most
>> > > likely.
>> > >
>> > > How do you measure the frequency?
>> > >
>> > Also
>> > When you get into this situation, please dump:
>> > # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
>> > # cat /sys/devices/system/cpu/intel_pstate/*
>>
>> The driver is not intel_pstate.
> I guessed from
> CONFIG_X86_INTEL_PSTATE=y
> and
> Frequency is not 400 but something like 396.130

Ah.  Good catch!


Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Srinivas Pandruvada
On Mon, 2016-09-26 at 23:37 +0200, Rafael J. Wysocki wrote:
> On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
>  wrote:
> > 
> > On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:
> > > 
> > > 
> > > On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
> > > > 
> > > > 
> > > > On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:
> > > > > 
> > > > > 
> > > > > On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
> > > > > > 
> > > > > > 
> > > > > > On 09/18/2016 09:54 PM, Larry Finger wrote:
> > > > > > > 
> > > > > > > 
> > > > > > > On 09/14/2016 11:00 AM, Larry Finger wrote:
> > > > > > > > 
> > > > > > > > 
> > > > > > > > On 09/09/2016 12:39 PM, Larry Finger wrote:
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I have found a regression in kernel 4.8-rc2 that
> > > > > > > > > causes the speed of
> > > > > > > > > my laptop
> > > > > > > > > with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to
> > > > > > > > > suddenly have a
> > > > > > > > > maximum cpu
> > > > > > > > > frequency of ~400 MHz. Unfortunately, I do not know
> > > > > > > > > how to trigger
> > > > > > > > > this problem,
> > > > > > > > > thus a bisection is not possible. It usually happens
> > > > > > > > > under heavy
> > > > > > > > > load, such as a
> > > > > > > > > kernel build or the RPM build of VirtualBox, but it
> > > > > > > > > does not always
> > > > > > > > > fail with
> > > > > > > > > these loads. In my most recent failure, 'hwinfo --
> > > > > > > > > cpu' reports cpu
> > > > > > > > > MHz of
> > > > > > > > > 396.130 for #3. The bogomips value is 5787.73, and
> > > > > > > > > the cpu clock
> > > > > > > > > before the
> > > > > > > > > fault is 3437 MHz. Nothing is logged when this
> > > > > > > > > happens.
> > > > > > > > > 
> > > > > > > > > If I were to get a patch that would show a backtrace
> > > > > > > > > when the
> > > > > > > > > maximum CPU
> > > > > > > > > frequency is changed, perhaps it would be possible to
> > > > > > > > > track this
> > > > > > > > > bug.
> > > > > > > > 
> > > > > > > > 
> > > > > > > > I have not yet found the bad commit, but I have reduced
> > > > > > > > the range of
> > > > > > > > commits a
> > > > > > > > bit. This bug has been difficult to trigger. So far, it
> > > > > > > > has not taken
> > > > > > > > over 1/2
> > > > > > > > day to appear in bad kernels, thus I am allowing three
> > > > > > > > days before
> > > > > > > > deciding that
> > > > > > > > a given trial is good. I never saw the problem with 4.7
> > > > > > > > kernels, but
> > > > > > > > I did in
> > > > > > > > 4.8-rc1. I also know that it appeared before commit
> > > > > > > > 581e0cd. Commit
> > > > > > > > 1b05cf6 did
> > > > > > > > not show the bug.
> > > > > > > > 
> > > > > > > > Testing continues.
> > > > > > > 
> > > > > > > 
> > > > > > > And still does. My bisection seemed to be trending toward
> > > > > > > an
> > > > > > > improbable set of
> > > > > > > commits, and I needed to do some other work with the
> > > > > > > machine, thus I
> > > > > > > started
> > > > > > > running 4.8-rc6. It failed nearly 48 hours after the
> > > > > > > reboot, which
> > > > > > > indicated
> > > > > > > that using 3 days to indicate a "good" trial was likely
> > > > > > > too short. I
> > > > > > > am
> > > > > > > currently testing the first of the trial and will run it
> > > > > > > for at least
> > > > > > > a week. It
> > > > > > > is unlikely that these tests will be complete before 4,8
> > > > > > > is released,
> > > > > > > even if
> > > > > > > -rc8 is needed. I will keep attempting to find the faulty
> > > > > > > commit.
> > > > > > 
> > > > > > 
> > > > > > My debugging continues. After 7 days of beating on commit
> > > > > > f7816ad, I
> > > > > > have
> > > > > > concluded that it is likely good. Thus I think the bug lies
> > > > > > between
> > > > > > commit
> > > > > > 581e0cd (bad) and f7816ad (good). I will need to do a long
> > > > > > test on
> > > > > > commit
> > > > > > 1b05cf6, which did not fail with a shorter run.
> > > > > 
> > > > > 
> > > > > 581e0cd is not a valid mainline commit hash AFAICS.
> > > > 
> > > > 
> > > > That was a typo. The correct value is 581e0c7.
> > > > > 
> > > > > 
> > > > > 
> > > > > What cpufreq driver do you use?
> > > > 
> > > > 
> > > > My "Default CPUFreq governor" is on demand.
> > > > 
> > > > Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config'
> > > > results in
> > > > 
> > > > CONFIG_ACPI_CPU_FREQ_PSS=y
> > > > CONFIG_CPU_FREQ=y
> > > > CONFIG_CPU_FREQ_GOV_ATTR_SET=y
> > > > CONFIG_CPU_FREQ_GOV_COMMON=y
> > > > # CONFIG_CPU_FREQ_STAT is not set
> > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
> > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
> > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
> > > > CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
> > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
> > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
> > > > CONFIG_CPU_FREQ_GOV_PERFO

Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Srinivas Pandruvada
On Mon, 2016-09-26 at 23:30 +0200, Rafael J. Wysocki wrote:
> On Mon, Sep 26, 2016 at 11:26 PM, Srinivas Pandruvada
>  wrote:
> > 
> > On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote:
> > > 
> > > On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
> > 
> > [...]
> > 
> > > 
> > > > 
> > > > I have been reluctant to post my entire .config on the list. It
> > > > is
> > > > available at
> > > > http://pastebin.com/aMZaAKwL.
> > > 
> > > If the governor is ondemand, the driver is acpi-cpufreq, most
> > > likely.
> > > 
> > > How do you measure the frequency?
> > > 
> > Also
> > When you get into this situation, please dump:
> > # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
> > # cat /sys/devices/system/cpu/intel_pstate/*
> 
> The driver is not intel_pstate.
I guessed from
CONFIG_X86_INTEL_PSTATE=y
and
Frequency is not 400 but something like 396.130


Thanks,
Srinivas



Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Rafael J. Wysocki
On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger
 wrote:
> On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:
>>
>> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
>>>
>>> On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:

 On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
>
> On 09/18/2016 09:54 PM, Larry Finger wrote:
>>
>> On 09/14/2016 11:00 AM, Larry Finger wrote:
>>>
>>> On 09/09/2016 12:39 PM, Larry Finger wrote:

 I have found a regression in kernel 4.8-rc2 that causes the speed of
 my laptop
 with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a
 maximum cpu
 frequency of ~400 MHz. Unfortunately, I do not know how to trigger
 this problem,
 thus a bisection is not possible. It usually happens under heavy
 load, such as a
 kernel build or the RPM build of VirtualBox, but it does not always
 fail with
 these loads. In my most recent failure, 'hwinfo --cpu' reports cpu
 MHz of
 396.130 for #3. The bogomips value is 5787.73, and the cpu clock
 before the
 fault is 3437 MHz. Nothing is logged when this happens.

 If I were to get a patch that would show a backtrace when the
 maximum CPU
 frequency is changed, perhaps it would be possible to track this
 bug.
>>>
>>>
>>> I have not yet found the bad commit, but I have reduced the range of
>>> commits a
>>> bit. This bug has been difficult to trigger. So far, it has not taken
>>> over 1/2
>>> day to appear in bad kernels, thus I am allowing three days before
>>> deciding that
>>> a given trial is good. I never saw the problem with 4.7 kernels, but
>>> I did in
>>> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit
>>> 1b05cf6 did
>>> not show the bug.
>>>
>>> Testing continues.
>>
>>
>> And still does. My bisection seemed to be trending toward an
>> improbable set of
>> commits, and I needed to do some other work with the machine, thus I
>> started
>> running 4.8-rc6. It failed nearly 48 hours after the reboot, which
>> indicated
>> that using 3 days to indicate a "good" trial was likely too short. I
>> am
>> currently testing the first of the trial and will run it for at least
>> a week. It
>> is unlikely that these tests will be complete before 4,8 is released,
>> even if
>> -rc8 is needed. I will keep attempting to find the faulty commit.
>
>
> My debugging continues. After 7 days of beating on commit f7816ad, I
> have
> concluded that it is likely good. Thus I think the bug lies between
> commit
> 581e0cd (bad) and f7816ad (good). I will need to do a long test on
> commit
> 1b05cf6, which did not fail with a shorter run.


 581e0cd is not a valid mainline commit hash AFAICS.
>>>
>>>
>>> That was a typo. The correct value is 581e0c7.


 What cpufreq driver do you use?
>>>
>>>
>>> My "Default CPUFreq governor" is on demand.
>>>
>>> Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in
>>>
>>> CONFIG_ACPI_CPU_FREQ_PSS=y
>>> CONFIG_CPU_FREQ=y
>>> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
>>> CONFIG_CPU_FREQ_GOV_COMMON=y
>>> # CONFIG_CPU_FREQ_STAT is not set
>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
>>> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
>>> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
>>> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
>>> CONFIG_CPU_FREQ_GOV_POWERSAVE=m
>>> CONFIG_CPU_FREQ_GOV_USERSPACE=m
>>> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
>>> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
>>> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
>>> CONFIG_X86_PCC_CPUFREQ=m
>>> CONFIG_X86_ACPI_CPUFREQ=m
>>> CONFIG_X86_ACPI_CPUFREQ_CPB=y
>>>
>>> Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up
>>> going
>>> wrong. Further tests have shown that commit 351a4ded is bad. Once again,
>>> by
>>> bisection seems to be converging to a set of commits that seem unlikely
>>> to cause
>>> this problem. Perhaps commit f7816ad is not really good even though it
>>> survived
>>> 7 days of heavy CPU usage.
>>>
>>> I have been reluctant to post my entire .config on the list. It is
>>> available at
>>> http://pastebin.com/aMZaAKwL.
>>
>>
>> If the governor is ondemand, the driver is acpi-cpufreq, most likely.
>>
>> How do you measure the frequency?
>
>
> Mostly I use a KDE applet named "System load" and look at the "average
> clock", but the same info is also available in /proc/cpuinfo as "cpu MHz".
> When the bug triggers, the system gets very slow, and the cpu fan stops even
> though the cpu is still busy.

That sounds like thermal throttling kicking in.

What's the

Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Rafael J. Wysocki
On Mon, Sep 26, 2016 at 11:26 PM, Srinivas Pandruvada
 wrote:
> On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote:
>> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
>
> [...]
>
>> > I have been reluctant to post my entire .config on the list. It is
>> > available at
>> > http://pastebin.com/aMZaAKwL.
>>
>> If the governor is ondemand, the driver is acpi-cpufreq, most likely.
>>
>> How do you measure the frequency?
>>
> Also
> When you get into this situation, please dump:
> # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
> # cat /sys/devices/system/cpu/intel_pstate/*

The driver is not intel_pstate.

Thanks,
Rafael


Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Larry Finger

On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote:

On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:

On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:

On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:

On 09/18/2016 09:54 PM, Larry Finger wrote:

On 09/14/2016 11:00 AM, Larry Finger wrote:

On 09/09/2016 12:39 PM, Larry Finger wrote:

I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
thus a bisection is not possible. It usually happens under heavy load, such as a
kernel build or the RPM build of VirtualBox, but it does not always fail with
these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
fault is 3437 MHz. Nothing is logged when this happens.

If I were to get a patch that would show a backtrace when the maximum CPU
frequency is changed, perhaps it would be possible to track this bug.


I have not yet found the bad commit, but I have reduced the range of commits a
bit. This bug has been difficult to trigger. So far, it has not taken over 1/2
day to appear in bad kernels, thus I am allowing three days before deciding that
a given trial is good. I never saw the problem with 4.7 kernels, but I did in
4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did
not show the bug.

Testing continues.


And still does. My bisection seemed to be trending toward an improbable set of
commits, and I needed to do some other work with the machine, thus I started
running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated
that using 3 days to indicate a "good" trial was likely too short. I am
currently testing the first of the trial and will run it for at least a week. It
is unlikely that these tests will be complete before 4,8 is released, even if
-rc8 is needed. I will keep attempting to find the faulty commit.


My debugging continues. After 7 days of beating on commit f7816ad, I have
concluded that it is likely good. Thus I think the bug lies between commit
581e0cd (bad) and f7816ad (good). I will need to do a long test on commit
1b05cf6, which did not fail with a shorter run.


581e0cd is not a valid mainline commit hash AFAICS.


That was a typo. The correct value is 581e0c7.


What cpufreq driver do you use?


My "Default CPUFreq governor" is on demand.

Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in

CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
CONFIG_X86_PCC_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ_CPB=y

Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up going
wrong. Further tests have shown that commit 351a4ded is bad. Once again, by
bisection seems to be converging to a set of commits that seem unlikely to cause
this problem. Perhaps commit f7816ad is not really good even though it survived
7 days of heavy CPU usage.

I have been reluctant to post my entire .config on the list. It is available at
http://pastebin.com/aMZaAKwL.


If the governor is ondemand, the driver is acpi-cpufreq, most likely.

How do you measure the frequency?


Mostly I use a KDE applet named "System load" and look at the "average clock", 
but the same info is also available in /proc/cpuinfo as "cpu MHz". When the bug 
triggers, the system gets very slow, and the cpu fan stops even though the cpu 
is still busy.


Commit f7816ad, which had run for 7 days without showing the bug, failed after 
about 2 hours today. All my testing since Sept. 9 has been wasted. Oh well, 
that's the way it goes!


Thanks,

Larry




Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Srinivas Pandruvada
On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote:
> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:

[...]

> > I have been reluctant to post my entire .config on the list. It is
> > available at 
> > http://pastebin.com/aMZaAKwL.
> 
> If the governor is ondemand, the driver is acpi-cpufreq, most likely.
> 
> How do you measure the frequency?
> 
Also
When you get into this situation, please dump:
# cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq
# cat /sys/devices/system/cpu/intel_pstate/*


Thanks,
Srinivas



Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Rafael J. Wysocki
On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote:
> On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:
> > On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
> >> On 09/18/2016 09:54 PM, Larry Finger wrote:
> >>> On 09/14/2016 11:00 AM, Larry Finger wrote:
>  On 09/09/2016 12:39 PM, Larry Finger wrote:
> > I have found a regression in kernel 4.8-rc2 that causes the speed of my 
> > laptop
> > with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a 
> > maximum cpu
> > frequency of ~400 MHz. Unfortunately, I do not know how to trigger this 
> > problem,
> > thus a bisection is not possible. It usually happens under heavy load, 
> > such as a
> > kernel build or the RPM build of VirtualBox, but it does not always 
> > fail with
> > these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz 
> > of
> > 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before 
> > the
> > fault is 3437 MHz. Nothing is logged when this happens.
> >
> > If I were to get a patch that would show a backtrace when the maximum 
> > CPU
> > frequency is changed, perhaps it would be possible to track this bug.
> 
>  I have not yet found the bad commit, but I have reduced the range of 
>  commits a
>  bit. This bug has been difficult to trigger. So far, it has not taken 
>  over 1/2
>  day to appear in bad kernels, thus I am allowing three days before 
>  deciding that
>  a given trial is good. I never saw the problem with 4.7 kernels, but I 
>  did in
>  4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 
>  1b05cf6 did
>  not show the bug.
> 
>  Testing continues.
> >>>
> >>> And still does. My bisection seemed to be trending toward an improbable 
> >>> set of
> >>> commits, and I needed to do some other work with the machine, thus I 
> >>> started
> >>> running 4.8-rc6. It failed nearly 48 hours after the reboot, which 
> >>> indicated
> >>> that using 3 days to indicate a "good" trial was likely too short. I am
> >>> currently testing the first of the trial and will run it for at least a 
> >>> week. It
> >>> is unlikely that these tests will be complete before 4,8 is released, 
> >>> even if
> >>> -rc8 is needed. I will keep attempting to find the faulty commit.
> >>
> >> My debugging continues. After 7 days of beating on commit f7816ad, I have
> >> concluded that it is likely good. Thus I think the bug lies between commit
> >> 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit
> >> 1b05cf6, which did not fail with a shorter run.
> >
> > 581e0cd is not a valid mainline commit hash AFAICS.
> 
> That was a typo. The correct value is 581e0c7.
> >
> > What cpufreq driver do you use?
> 
> My "Default CPUFreq governor" is on demand.
> 
> Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in
> 
> CONFIG_ACPI_CPU_FREQ_PSS=y
> CONFIG_CPU_FREQ=y
> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
> CONFIG_CPU_FREQ_GOV_COMMON=y
> # CONFIG_CPU_FREQ_STAT is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
> CONFIG_CPU_FREQ_GOV_POWERSAVE=m
> CONFIG_CPU_FREQ_GOV_USERSPACE=m
> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
> CONFIG_X86_PCC_CPUFREQ=m
> CONFIG_X86_ACPI_CPUFREQ=m
> CONFIG_X86_ACPI_CPUFREQ_CPB=y
> 
> Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up 
> going 
> wrong. Further tests have shown that commit 351a4ded is bad. Once again, by 
> bisection seems to be converging to a set of commits that seem unlikely to 
> cause 
> this problem. Perhaps commit f7816ad is not really good even though it 
> survived 
> 7 days of heavy CPU usage.
> 
> I have been reluctant to post my entire .config on the list. It is available 
> at 
> http://pastebin.com/aMZaAKwL.

If the governor is ondemand, the driver is acpi-cpufreq, most likely.

How do you measure the frequency?

Thanks,
Rafael



Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Larry Finger

On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote:

On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:

On 09/18/2016 09:54 PM, Larry Finger wrote:

On 09/14/2016 11:00 AM, Larry Finger wrote:

On 09/09/2016 12:39 PM, Larry Finger wrote:

I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
thus a bisection is not possible. It usually happens under heavy load, such as a
kernel build or the RPM build of VirtualBox, but it does not always fail with
these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
fault is 3437 MHz. Nothing is logged when this happens.

If I were to get a patch that would show a backtrace when the maximum CPU
frequency is changed, perhaps it would be possible to track this bug.


I have not yet found the bad commit, but I have reduced the range of commits a
bit. This bug has been difficult to trigger. So far, it has not taken over 1/2
day to appear in bad kernels, thus I am allowing three days before deciding that
a given trial is good. I never saw the problem with 4.7 kernels, but I did in
4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did
not show the bug.

Testing continues.


And still does. My bisection seemed to be trending toward an improbable set of
commits, and I needed to do some other work with the machine, thus I started
running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated
that using 3 days to indicate a "good" trial was likely too short. I am
currently testing the first of the trial and will run it for at least a week. It
is unlikely that these tests will be complete before 4,8 is released, even if
-rc8 is needed. I will keep attempting to find the faulty commit.


My debugging continues. After 7 days of beating on commit f7816ad, I have
concluded that it is likely good. Thus I think the bug lies between commit
581e0cd (bad) and f7816ad (good). I will need to do a long test on commit
1b05cf6, which did not fail with a shorter run.


581e0cd is not a valid mainline commit hash AFAICS.


That was a typo. The correct value is 581e0c7.


What cpufreq driver do you use?


My "Default CPUFreq governor" is on demand.

Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in

CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=m
CONFIG_CPU_FREQ_GOV_USERSPACE=m
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
CONFIG_X86_PCC_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ=m
CONFIG_X86_ACPI_CPUFREQ_CPB=y

Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up going 
wrong. Further tests have shown that commit 351a4ded is bad. Once again, by 
bisection seems to be converging to a set of commits that seem unlikely to cause 
this problem. Perhaps commit f7816ad is not really good even though it survived 
7 days of heavy CPU usage.


I have been reluctant to post my entire .config on the list. It is available at 
http://pastebin.com/aMZaAKwL.


Larry



Re: Regression in 4.8 - CPU speed set very low

2016-09-26 Thread Rafael J. Wysocki
On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote:
> On 09/18/2016 09:54 PM, Larry Finger wrote:
> > On 09/14/2016 11:00 AM, Larry Finger wrote:
> >> On 09/09/2016 12:39 PM, Larry Finger wrote:
> >>> I have found a regression in kernel 4.8-rc2 that causes the speed of my 
> >>> laptop
> >>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a 
> >>> maximum cpu
> >>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger this 
> >>> problem,
> >>> thus a bisection is not possible. It usually happens under heavy load, 
> >>> such as a
> >>> kernel build or the RPM build of VirtualBox, but it does not always fail 
> >>> with
> >>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
> >>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before 
> >>> the
> >>> fault is 3437 MHz. Nothing is logged when this happens.
> >>>
> >>> If I were to get a patch that would show a backtrace when the maximum CPU
> >>> frequency is changed, perhaps it would be possible to track this bug.
> >>
> >> I have not yet found the bad commit, but I have reduced the range of 
> >> commits a
> >> bit. This bug has been difficult to trigger. So far, it has not taken over 
> >> 1/2
> >> day to appear in bad kernels, thus I am allowing three days before 
> >> deciding that
> >> a given trial is good. I never saw the problem with 4.7 kernels, but I did 
> >> in
> >> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 
> >> 1b05cf6 did
> >> not show the bug.
> >>
> >> Testing continues.
> >
> > And still does. My bisection seemed to be trending toward an improbable set 
> > of
> > commits, and I needed to do some other work with the machine, thus I started
> > running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated
> > that using 3 days to indicate a "good" trial was likely too short. I am
> > currently testing the first of the trial and will run it for at least a 
> > week. It
> > is unlikely that these tests will be complete before 4,8 is released, even 
> > if
> > -rc8 is needed. I will keep attempting to find the faulty commit.
> 
> My debugging continues. After 7 days of beating on commit f7816ad, I have 
> concluded that it is likely good. Thus I think the bug lies between commit 
> 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit 
> 1b05cf6, which did not fail with a shorter run.

581e0cd is not a valid mainline commit hash AFAICS.

What cpufreq driver do you use?

Thanks,
Rafael



Re: Regression in 4.8 - CPU speed set very low

2016-09-23 Thread Larry Finger

On 09/18/2016 09:54 PM, Larry Finger wrote:

On 09/14/2016 11:00 AM, Larry Finger wrote:

On 09/09/2016 12:39 PM, Larry Finger wrote:

I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
thus a bisection is not possible. It usually happens under heavy load, such as a
kernel build or the RPM build of VirtualBox, but it does not always fail with
these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
fault is 3437 MHz. Nothing is logged when this happens.

If I were to get a patch that would show a backtrace when the maximum CPU
frequency is changed, perhaps it would be possible to track this bug.


I have not yet found the bad commit, but I have reduced the range of commits a
bit. This bug has been difficult to trigger. So far, it has not taken over 1/2
day to appear in bad kernels, thus I am allowing three days before deciding that
a given trial is good. I never saw the problem with 4.7 kernels, but I did in
4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did
not show the bug.

Testing continues.


And still does. My bisection seemed to be trending toward an improbable set of
commits, and I needed to do some other work with the machine, thus I started
running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated
that using 3 days to indicate a "good" trial was likely too short. I am
currently testing the first of the trial and will run it for at least a week. It
is unlikely that these tests will be complete before 4,8 is released, even if
-rc8 is needed. I will keep attempting to find the faulty commit.


My debugging continues. After 7 days of beating on commit f7816ad, I have 
concluded that it is likely good. Thus I think the bug lies between commit 
581e0cd (bad) and f7816ad (good). I will need to do a long test on commit 
1b05cf6, which did not fail with a shorter run.


Larry




Re: Regression in 4.8 - CPU speed set very low

2016-09-18 Thread Larry Finger

On 09/14/2016 11:00 AM, Larry Finger wrote:

On 09/09/2016 12:39 PM, Larry Finger wrote:

I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
thus a bisection is not possible. It usually happens under heavy load, such as a
kernel build or the RPM build of VirtualBox, but it does not always fail with
these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
fault is 3437 MHz. Nothing is logged when this happens.

If I were to get a patch that would show a backtrace when the maximum CPU
frequency is changed, perhaps it would be possible to track this bug.


I have not yet found the bad commit, but I have reduced the range of commits a
bit. This bug has been difficult to trigger. So far, it has not taken over 1/2
day to appear in bad kernels, thus I am allowing three days before deciding that
a given trial is good. I never saw the problem with 4.7 kernels, but I did in
4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did
not show the bug.

Testing continues.


And still does. My bisection seemed to be trending toward an improbable set of 
commits, and I needed to do some other work with the machine, thus I started 
running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated 
that using 3 days to indicate a "good" trial was likely too short. I am 
currently testing the first of the trial and will run it for at least a week. It 
is unlikely that these tests will be complete before 4,8 is released, even if 
-rc8 is needed. I will keep attempting to find the faulty commit.


Larry




Re: Regression in 4.8 - CPU speed set very low

2016-09-14 Thread Larry Finger

On 09/09/2016 12:39 PM, Larry Finger wrote:

I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop
with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu
frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem,
thus a bisection is not possible. It usually happens under heavy load, such as a
kernel build or the RPM build of VirtualBox, but it does not always fail with
these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of
396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the
fault is 3437 MHz. Nothing is logged when this happens.

If I were to get a patch that would show a backtrace when the maximum CPU
frequency is changed, perhaps it would be possible to track this bug.


I have not yet found the bad commit, but I have reduced the range of commits a 
bit. This bug has been difficult to trigger. So far, it has not taken over 1/2 
day to appear in bad kernels, thus I am allowing three days before deciding that 
a given trial is good. I never saw the problem with 4.7 kernels, but I did in 
4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did 
not show the bug.


Testing continues.

Larry