Re: Regression in 4.8 - CPU speed set very low
On 09/29/2016 10:56 AM, Srinivas Pandruvada wrote: On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote: [...] My laptop was inadvertently put to sleep while I was gone. I forgot to leave a note for my wife and she quieted the noisy cpu fan. :) It looks like in 4.8-rc we made a change that caused the "high" trip point to be acted on. This high trip point we don't expose in thermal subsystem (the thermal zone dump didn't show this anywhere as a trip). This is exposed by core-dts driver only. This is the point BIOS is supposed to act, I guess that's why you are seeing 50% clock modulation. Are you running thermald What is? # ps -e | grep thermald Srinivas, Rui, do you recall what that can be? One more question (I think I asked it previously): In the failing case (4.8-rc1 and later), when the frequency drops down to the 400 MHz, does it ever go back higher or is it stuck at that level forever? In any case, it may help to file a bug at bugzilla.kernel.org against CPU/thermal or similar and let me know the bug number. We'll need to collect some tracepoint data to debug this and some place to put them into for easy reference. Yes, this is good idea. To complete the record in this thread, the problem also happened with kernel 4.7, thus it is not a regression in 4.8-rcX. The full discussion is at https://bugzilla.kernel.org/show_bug.cgi?id=173361. Larry
Re: Regression in 4.8 - CPU speed set very low
On Thursday, September 29, 2016 08:56:16 AM Srinivas Pandruvada wrote: > On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote: > > [...] > > > > My laptop was inadvertently put to sleep while I was gone. I forgot > > > to leave a > > > note for my wife and she quieted the noisy cpu fan. :) > > It looks like in 4.8-rc we made a change that caused the "high" trip > > point to > > be acted on. > This high trip point we don't expose in thermal subsystem (the thermal > zone dump didn't show this anywhere as a trip). This is exposed by > core-dts driver only. This is the point BIOS is supposed to act, I > guess that's why you are seeing 50% clock modulation. Right. That's SMM kicking in. The real problem is that we get stuck at 400 MHz. Thanks, Rafael
Re: Regression in 4.8 - CPU speed set very low
On 09/29/2016 10:56 AM, Srinivas Pandruvada wrote: On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote: [...] My laptop was inadvertently put to sleep while I was gone. I forgot to leave a note for my wife and she quieted the noisy cpu fan. :) It looks like in 4.8-rc we made a change that caused the "high" trip point to be acted on. This high trip point we don't expose in thermal subsystem (the thermal zone dump didn't show this anywhere as a trip). This is exposed by core-dts driver only. This is the point BIOS is supposed to act, I guess that's why you are seeing 50% clock modulation. Are you running thermald What is? # ps -e | grep thermald The output is blank. I am not running thermald. Larry
Re: Regression in 4.8 - CPU speed set very low
On Thu, 2016-09-29 at 14:19 +0200, Rafael J. Wysocki wrote: [...] > > My laptop was inadvertently put to sleep while I was gone. I forgot > > to leave a > > note for my wife and she quieted the noisy cpu fan. :) > It looks like in 4.8-rc we made a change that caused the "high" trip > point to > be acted on. This high trip point we don't expose in thermal subsystem (the thermal zone dump didn't show this anywhere as a trip). This is exposed by core-dts driver only. This is the point BIOS is supposed to act, I guess that's why you are seeing 50% clock modulation. Are you running thermald What is? # ps -e | grep thermald > > Srinivas, Rui, do you recall what that can be? > > One more question (I think I asked it previously): In the failing > case > (4.8-rc1 and later), when the frequency drops down to the 400 MHz, > does it > ever go back higher or is it stuck at that level forever? > > In any case, it may help to file a bug at bugzilla.kernel.org against > CPU/thermal or similar and let me know the bug number. We'll need to > collect some tracepoint data to debug this and some place to put them > into for easy reference. Yes, this is good idea. Thanks, Srinivas
Re: Regression in 4.8 - CPU speed set very low
On 09/29/2016 07:19 AM, Rafael J. Wysocki wrote: On Wednesday, September 28, 2016 09:22:59 PM Larry Finger wrote: On 09/27/2016 06:46 AM, Rafael J. Wysocki wrote: On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger wrote: On 09/26/2016 10:12 PM, Doug Smythies wrote: On 2016.09.26 18:31 Srinivas Pandruvada wrote: On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote: On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote: But for both we need a reproducer anyway. I do not have a reliable reproducer. The condition has always happened when running a high-compute job such as a 'make -j8' on the kernel, or building the RPM for openSUSE's implementation of VirtualBox. The latter is what I'm using for most of my testing. Run some CPU stressor and get all your CPU's going at 100% load. And watch your core temperatures while you do so. for i in 1 2 3 4; do while : ; do : ; done & done triggered the fault in a few minutes. It also would be good to rule out the thermal throttling (as per the Srinivas' comments). It is almost certainly thermal throttling, or similar causing Clock modulation, of it seems 50%. While the infinite loops were running, the temps were: finger@linux-1t8h:~/rtlwifi_new> sensors coretemp-isa- Adapter: ISA adapter Physical id 0: +83.0°C (high = +84.0°C, crit = +100.0°C) Core 0: +83.0°C (high = +84.0°C, crit = +100.0°C) Core 1: +74.0°C (high = +84.0°C, crit = +100.0°C) It looks like the trip point (high) temperature was exceeded causing thermal throttling to kick in. After the fault occurs, I get finger@linux-1t8h:~/rtlwifi_new> sensors coretemp-isa- Adapter: ISA adapter Physical id 0: +44.0°C (high = +84.0°C, crit = +100.0°C) Core 0: +43.0°C (high = +84.0°C, crit = +100.0°C) Core 1: +41.0°C (high = +84.0°C, crit = +100.0°C) So after that it stays at 400 MHz forever, right? For now, please tell me what's in /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq 80 Your effective freq is lower than 800MHz. One of the possible reason is thermal throttling. What distro you are using? And what make and model of LapTop? Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz. That is a dual-core unit with hyperthreading. @Rafael: As I write this, the system has been running the infinite loop test for almost 5 hours with kernel 4.7. I will leave that running while I'm gone, but I am certain that it is OK. OK, and what temperatures do you see while doing this? finger@linux-1t8h:~/linux-2.6> sensors coretemp-isa- Adapter: ISA adapter Physical id 0: +90.0°C (high = +84.0°C, crit = +100.0°C) Core 0: +90.0°C (high = +84.0°C, crit = +100.0°C) Core 1: +78.0°C (high = +84.0°C, crit = +100.0°C) Once again, the CPU temp is greater than the "high" value; however, the clock rate continues to hold near 3600 MHz. My laptop was inadvertently put to sleep while I was gone. I forgot to leave a note for my wife and she quieted the noisy cpu fan. :) It looks like in 4.8-rc we made a change that caused the "high" trip point to be acted on. Srinivas, Rui, do you recall what that can be? One more question (I think I asked it previously): In the failing case (4.8-rc1 and later), when the frequency drops down to the 400 MHz, does it ever go back higher or is it stuck at that level forever? In any case, it may help to file a bug at bugzilla.kernel.org against CPU/thermal or similar and let me know the bug number. We'll need to collect some tracepoint data to debug this and some place to put them into for easy reference. Sorry if I missed that earlier question. The CPU is stuck at that lower frequency until I reboot. Bug report at https://bugzilla.kernel.org/show_bug.cgi?id=173361. I tried to cover the main points of the discussion. Please add the ones that I missed. Larry
Re: Regression in 4.8 - CPU speed set very low
On Wed, Sep 28, 2016 at 09:26:42PM -0500, Larry Finger wrote: > By the time it gets slow, the CPU's cool, and one cannot see the temp just > before that event happened. Hmm, I would not expect the CPU to drop from 80 to 40 degrees in a few seconds if the fan is not spinning. I wouldn't even expect it if the fan was spinning. I would think at least 30 to 60 seconds if not more. The only way I would think the temperature could change quickly would be if the heatsink isn't even touching the CPU anymore so there is very little material to hold the heat in the CPU. > The reason I suspect a bug is that it fails with 4.8-rcX, but not with 4.7. > Of course, it could be something subtle that slightly changes the heat load, > which causes the CPU temp to be a little higher so that the effect is > triggered. > > I am reasonably confident that it is not a hardware problem, but we may have > to wait until 4.8 is released and gets wider usage. If no one else reports a > problem, then I am certainly wrong. Well hard to reproduce bugs are always really annoying. This old bug sounds a lot like what you are seeing: https://bugzilla.redhat.com/show_bug.cgi?id=924570 and it links to this: https://www.phoronix.com/scan.php?page=news_item&px=Linux-4.6-Thermal-Updates Apparently turning off turbo boost seems to stop the problem for a lot of people in that case. Doesn't explain why it started happening recently. And of course that may have been a different problem in the past. -- Len Sorensen
Re: Regression in 4.8 - CPU speed set very low
On Wednesday, September 28, 2016 09:22:59 PM Larry Finger wrote: > On 09/27/2016 06:46 AM, Rafael J. Wysocki wrote: > > On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger > > wrote: > >> On 09/26/2016 10:12 PM, Doug Smythies wrote: > >>> > >>> On 2016.09.26 18:31 Srinivas Pandruvada wrote: > > On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote: > > > > On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: > >> > >> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote: > >> But for both we need a reproducer anyway. > > > > I do not have a reliable reproducer. The condition has always > > happened when > > running a high-compute job such as a 'make -j8' on the kernel, or > > building the > > RPM for openSUSE's implementation of VirtualBox. The latter is what > > I'm using > > for most of my testing. > >>> > >>> > >>> Run some CPU stressor and get all your CPU's going at 100% load. > >>> And watch your core temperatures while you do so. > >> > >> > >> for i in 1 2 3 4; do while : ; do : ; done & done > >> > >> triggered the fault in a few minutes. > >>> > >>> > > >> It also would be good to rule out the thermal throttling (as per > >> the Srinivas' comments). > >>> > >>> > >>> It is almost certainly thermal throttling, or similar causing > >>> Clock modulation, of it seems 50%. > >> > >> > >> While the infinite loops were running, the temps were: > >> > >> finger@linux-1t8h:~/rtlwifi_new> sensors > >> coretemp-isa- > >> Adapter: ISA adapter > >> Physical id 0: +83.0°C (high = +84.0°C, crit = +100.0°C) > >> Core 0: +83.0°C (high = +84.0°C, crit = +100.0°C) > >> Core 1: +74.0°C (high = +84.0°C, crit = +100.0°C) > > > > It looks like the trip point (high) temperature was exceeded causing > > thermal throttling to kick in. > > > >> After the fault occurs, I get > >> > >> finger@linux-1t8h:~/rtlwifi_new> sensors > >> coretemp-isa- > >> Adapter: ISA adapter > >> Physical id 0: +44.0°C (high = +84.0°C, crit = +100.0°C) > >> Core 0: +43.0°C (high = +84.0°C, crit = +100.0°C) > >> Core 1: +41.0°C (high = +84.0°C, crit = +100.0°C) > > > > So after that it stays at 400 MHz forever, right? > > > >> > >> For now, please tell me what's in > >> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq > > > > 80 > > Your effective freq is lower than 800MHz. One of the possible reason is > thermal throttling. > > What distro you are using? > >>> > >>> > >>> And what make and model of LapTop? > >> > >> > >> Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU > >> @ > >> 2.90GHz. That is a dual-core unit with hyperthreading. > >> > >> @Rafael: As I write this, the system has been running the infinite loop > >> test > >> for almost 5 hours with kernel 4.7. I will leave that running while I'm > >> gone, but I am certain that it is OK. > > > > OK, and what temperatures do you see while doing this? > > finger@linux-1t8h:~/linux-2.6> sensors > coretemp-isa- > Adapter: ISA adapter > Physical id 0: +90.0°C (high = +84.0°C, crit = +100.0°C) > Core 0: +90.0°C (high = +84.0°C, crit = +100.0°C) > Core 1: +78.0°C (high = +84.0°C, crit = +100.0°C) > > Once again, the CPU temp is greater than the "high" value; however, the clock > rate continues to hold near 3600 MHz. > > My laptop was inadvertently put to sleep while I was gone. I forgot to leave > a > note for my wife and she quieted the noisy cpu fan. :) It looks like in 4.8-rc we made a change that caused the "high" trip point to be acted on. Srinivas, Rui, do you recall what that can be? One more question (I think I asked it previously): In the failing case (4.8-rc1 and later), when the frequency drops down to the 400 MHz, does it ever go back higher or is it stuck at that level forever? In any case, it may help to file a bug at bugzilla.kernel.org against CPU/thermal or similar and let me know the bug number. We'll need to collect some tracepoint data to debug this and some place to put them into for easy reference. Thanks, Rafael
Re: Regression in 4.8 - CPU speed set very low
On 09/27/2016 09:51 AM, Lennart Sorensen wrote: On Mon, Sep 26, 2016 at 04:28:29PM -0500, Larry Finger wrote: Mostly I use a KDE applet named "System load" and look at the "average clock", but the same info is also available in /proc/cpuinfo as "cpu MHz". When the bug triggers, the system gets very slow, and the cpu fan stops even though the cpu is still busy. Commit f7816ad, which had run for 7 days without showing the bug, failed after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh well, that's the way it goes! Is it possible there is no bug and instead you have a hardware problem? What I am thinking: CPU fan stops, then CPU gets busy, CPU overheats, thermal throtling kicks in to protect CPU and it gets VERY slow. So maybe you have a bad CPU fan that is getting stuck. Perhaps even if you have a motherboard that varies the CPU fan depending on need and the fan doesn't like the lowest speed and sometimes gets stuck when asked to go slow. Of course if the CPU fan is the problem that could explain why it takes varying amounts of time to see the problem. I suggest checking what the cpu temperature sensors are showing next time it gets slow. By the time it gets slow, the CPU's cool, and one cannot see the temp just before that event happened. The reason I suspect a bug is that it fails with 4.8-rcX, but not with 4.7. Of course, it could be something subtle that slightly changes the heat load, which causes the CPU temp to be a little higher so that the effect is triggered. I am reasonably confident that it is not a hardware problem, but we may have to wait until 4.8 is released and gets wider usage. If no one else reports a problem, then I am certainly wrong. Larry
Re: Regression in 4.8 - CPU speed set very low
On 09/27/2016 06:46 AM, Rafael J. Wysocki wrote: On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger wrote: On 09/26/2016 10:12 PM, Doug Smythies wrote: On 2016.09.26 18:31 Srinivas Pandruvada wrote: On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote: On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote: But for both we need a reproducer anyway. I do not have a reliable reproducer. The condition has always happened when running a high-compute job such as a 'make -j8' on the kernel, or building the RPM for openSUSE's implementation of VirtualBox. The latter is what I'm using for most of my testing. Run some CPU stressor and get all your CPU's going at 100% load. And watch your core temperatures while you do so. for i in 1 2 3 4; do while : ; do : ; done & done triggered the fault in a few minutes. It also would be good to rule out the thermal throttling (as per the Srinivas' comments). It is almost certainly thermal throttling, or similar causing Clock modulation, of it seems 50%. While the infinite loops were running, the temps were: finger@linux-1t8h:~/rtlwifi_new> sensors coretemp-isa- Adapter: ISA adapter Physical id 0: +83.0°C (high = +84.0°C, crit = +100.0°C) Core 0: +83.0°C (high = +84.0°C, crit = +100.0°C) Core 1: +74.0°C (high = +84.0°C, crit = +100.0°C) It looks like the trip point (high) temperature was exceeded causing thermal throttling to kick in. After the fault occurs, I get finger@linux-1t8h:~/rtlwifi_new> sensors coretemp-isa- Adapter: ISA adapter Physical id 0: +44.0°C (high = +84.0°C, crit = +100.0°C) Core 0: +43.0°C (high = +84.0°C, crit = +100.0°C) Core 1: +41.0°C (high = +84.0°C, crit = +100.0°C) So after that it stays at 400 MHz forever, right? For now, please tell me what's in /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq 80 Your effective freq is lower than 800MHz. One of the possible reason is thermal throttling. What distro you are using? And what make and model of LapTop? Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz. That is a dual-core unit with hyperthreading. @Rafael: As I write this, the system has been running the infinite loop test for almost 5 hours with kernel 4.7. I will leave that running while I'm gone, but I am certain that it is OK. OK, and what temperatures do you see while doing this? finger@linux-1t8h:~/linux-2.6> sensors coretemp-isa- Adapter: ISA adapter Physical id 0: +90.0°C (high = +84.0°C, crit = +100.0°C) Core 0: +90.0°C (high = +84.0°C, crit = +100.0°C) Core 1: +78.0°C (high = +84.0°C, crit = +100.0°C) Once again, the CPU temp is greater than the "high" value; however, the clock rate continues to hold near 3600 MHz. My laptop was inadvertently put to sleep while I was gone. I forgot to leave a note for my wife and she quieted the noisy cpu fan. :) Larry
Re: Regression in 4.8 - CPU speed set very low
On Mon, Sep 26, 2016 at 04:28:29PM -0500, Larry Finger wrote: > Mostly I use a KDE applet named "System load" and look at the "average > clock", but the same info is also available in /proc/cpuinfo as "cpu MHz". > When the bug triggers, the system gets very slow, and the cpu fan stops even > though the cpu is still busy. > > Commit f7816ad, which had run for 7 days without showing the bug, failed > after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh > well, that's the way it goes! Is it possible there is no bug and instead you have a hardware problem? What I am thinking: CPU fan stops, then CPU gets busy, CPU overheats, thermal throtling kicks in to protect CPU and it gets VERY slow. So maybe you have a bad CPU fan that is getting stuck. Perhaps even if you have a motherboard that varies the CPU fan depending on need and the fan doesn't like the lowest speed and sometimes gets stuck when asked to go slow. Of course if the CPU fan is the problem that could explain why it takes varying amounts of time to see the problem. I suggest checking what the cpu temperature sensors are showing next time it gets slow. -- Len Sorensen
Re: Regression in 4.8 - CPU speed set very low
On Tue, Sep 27, 2016 at 10:48 AM, Larry Finger wrote: > On 09/26/2016 10:12 PM, Doug Smythies wrote: >> >> On 2016.09.26 18:31 Srinivas Pandruvada wrote: >>> >>> On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote: On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: > > On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote: > But for both we need a reproducer anyway. I do not have a reliable reproducer. The condition has always happened when running a high-compute job such as a 'make -j8' on the kernel, or building the RPM for openSUSE's implementation of VirtualBox. The latter is what I'm using for most of my testing. >> >> >> Run some CPU stressor and get all your CPU's going at 100% load. >> And watch your core temperatures while you do so. > > > for i in 1 2 3 4; do while : ; do : ; done & done > > triggered the fault in a few minutes. >> >> >>> > It also would be good to rule out the thermal throttling (as per > the Srinivas' comments). >> >> >> It is almost certainly thermal throttling, or similar causing >> Clock modulation, of it seems 50%. > > > While the infinite loops were running, the temps were: > > finger@linux-1t8h:~/rtlwifi_new> sensors > coretemp-isa- > Adapter: ISA adapter > Physical id 0: +83.0°C (high = +84.0°C, crit = +100.0°C) > Core 0: +83.0°C (high = +84.0°C, crit = +100.0°C) > Core 1: +74.0°C (high = +84.0°C, crit = +100.0°C) It looks like the trip point (high) temperature was exceeded causing thermal throttling to kick in. > After the fault occurs, I get > > finger@linux-1t8h:~/rtlwifi_new> sensors > coretemp-isa- > Adapter: ISA adapter > Physical id 0: +44.0°C (high = +84.0°C, crit = +100.0°C) > Core 0: +43.0°C (high = +84.0°C, crit = +100.0°C) > Core 1: +41.0°C (high = +84.0°C, crit = +100.0°C) So after that it stays at 400 MHz forever, right? > > For now, please tell me what's in > /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq 80 >>> >>> Your effective freq is lower than 800MHz. One of the possible reason is >>> thermal throttling. >>> >>> What distro you are using? >> >> >> And what make and model of LapTop? > > > Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @ > 2.90GHz. That is a dual-core unit with hyperthreading. > > @Rafael: As I write this, the system has been running the infinite loop test > for almost 5 hours with kernel 4.7. I will leave that running while I'm > gone, but I am certain that it is OK. OK, and what temperatures do you see while doing this? Thanks, Rafael
Re: Regression in 4.8 - CPU speed set very low
On 09/26/2016 10:12 PM, Doug Smythies wrote: On 2016.09.26 18:31 Srinivas Pandruvada wrote: On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote: On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote: But for both we need a reproducer anyway. I do not have a reliable reproducer. The condition has always happened when running a high-compute job such as a 'make -j8' on the kernel, or building the RPM for openSUSE's implementation of VirtualBox. The latter is what I'm using for most of my testing. Run some CPU stressor and get all your CPU's going at 100% load. And watch your core temperatures while you do so. for i in 1 2 3 4; do while : ; do : ; done & done triggered the fault in a few minutes. It also would be good to rule out the thermal throttling (as per the Srinivas' comments). It is almost certainly thermal throttling, or similar causing Clock modulation, of it seems 50%. While the infinite loops were running, the temps were: finger@linux-1t8h:~/rtlwifi_new> sensors coretemp-isa- Adapter: ISA adapter Physical id 0: +83.0°C (high = +84.0°C, crit = +100.0°C) Core 0: +83.0°C (high = +84.0°C, crit = +100.0°C) Core 1: +74.0°C (high = +84.0°C, crit = +100.0°C) After the fault occurs, I get finger@linux-1t8h:~/rtlwifi_new> sensors coretemp-isa- Adapter: ISA adapter Physical id 0: +44.0°C (high = +84.0°C, crit = +100.0°C) Core 0: +43.0°C (high = +84.0°C, crit = +100.0°C) Core 1: +41.0°C (high = +84.0°C, crit = +100.0°C) For now, please tell me what's in /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq 80 Your effective freq is lower than 800MHz. One of the possible reason is thermal throttling. What distro you are using? And what make and model of LapTop? Toshiba Tecra A50-A with CPU Model: 6.60.3 "Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz. That is a dual-core unit with hyperthreading. @Rafael: As I write this, the system has been running the infinite loop test for almost 5 hours with kernel 4.7. I will leave that running while I'm gone, but I am certain that it is OK. Larry
RE: Regression in 4.8 - CPU speed set very low
On 2016.09.26 18:31 Srinivas Pandruvada wrote: > On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote: >> On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: >>> On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote: >>> But for both we need a reproducer anyway. >> I do not have a reliable reproducer. The condition has always >> happened when >> running a high-compute job such as a 'make -j8' on the kernel, or >> building the >> RPM for openSUSE's implementation of VirtualBox. The latter is what >> I'm using >> for most of my testing. Run some CPU stressor and get all your CPU's going at 100% load. And watch your core temperatures while you do so. > >>> It also would be good to rule out the thermal throttling (as per >>> the Srinivas' comments). It is almost certainly thermal throttling, or similar causing Clock modulation, of it seems 50%. >>> >>> For now, please tell me what's in >>> /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq >> 80 > Your effective freq is lower than 800MHz. One of the possible reason is > thermal throttling. > > What distro you are using? And what make and model of LapTop?
Re: Regression in 4.8 - CPU speed set very low
On 09/26/2016 08:30 PM, Srinivas Pandruvada wrote: On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote: On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote: On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote: On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger wrote: Maybe it's better to try diagnose the problem instead of spending more time on bisection. In my original post, I asked for such help, but nothing until today. I had no idea what to check, but now I have a better idea. I'd like to know whether or not 4.7 was definitely good, though. I never saw this problem with 4.7, but given the difficulty in triggering the problem, my tests may not have been definitive. If it is one of them, it may be a while before I dare call this one "good". In one respect, that is good as I will be traveling tomorrow and Wednesday. What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say? intel_pstate You probably don't need to worry about all of the cpufreq changes in 4.8-rc, then. Only a few of them affect intel_pstate and I don't see how any of them may lead to the observed symptoms. First off, if you have a reproducer, please run it on 4.7 and see if you can trigger the issue in there. I'm running 4.8-rc7 at the moment hoping to trigger the problem and get the data requested by Srinivas. Once I get that, I will try 4.7 again. Second, it would be good to have a look at the output from the cpu_frequency and pstate_sample tracepoints around when the issue triggers. The pstate_sample one would be more interesting. But for both we need a reproducer anyway. I do not have a reliable reproducer. The condition has always happened when running a high-compute job such as a 'make -j8' on the kernel, or building the RPM for openSUSE's implementation of VirtualBox. The latter is what I'm using for most of my testing. It also would be good to rule out the thermal throttling (as per the Srinivas' comments). For now, please tell me what's in /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq 80 Your effective freq is lower than 800MHz. One of the possible reason is thermal throttling. What distro you are using? openSUSE Leap 42.1. Larry
Re: Regression in 4.8 - CPU speed set very low
On Mon, 2016-09-26 at 19:48 -0500, Larry Finger wrote: > On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: > > > > On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger > r.net> wrote: > > > > > > On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote: > > > > > > > > > > > > On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger > > > wrote: > > > > > > > > > > > > > > > > Maybe it's better to try diagnose the problem instead of > > > > spending more > > > > time on bisection. > > > > > > In my original post, I asked for such help, but nothing until > > > today. I had > > > no idea what to check, but now I have a better idea. > > > > > > > > > > > I'd like to know whether or not 4.7 was definitely good, > > > > though. > > > > > > I never saw this problem with 4.7, but given the difficulty in > > > triggering > > > the problem, my tests may not have been definitive. > > > > > > > > > > > > > > > > > > > > > > If it is one of them, it may be a while before I dare call > > > > > this one > > > > > "good". > > > > > In one respect, that is good as I will be traveling tomorrow > > > > > and > > > > > Wednesday. > > > > > > > > What does "cat > > > > /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say? > > > > > > intel_pstate > > You probably don't need to worry about all of the cpufreq changes > > in > > 4.8-rc, then. Only a few of them affect intel_pstate and I don't > > see > > how any of them may lead to the observed symptoms. > > > > First off, if you have a reproducer, please run it on 4.7 and see > > if > > you can trigger the issue in there. > I'm running 4.8-rc7 at the moment hoping to trigger the problem and > get the data > requested by Srinivas. Once I get that, I will try 4.7 again. > > > > > > Second, it would be good to have a look at the output from the > > cpu_frequency and pstate_sample tracepoints around when the issue > > triggers. The pstate_sample one would be more interesting. > > > > But for both we need a reproducer anyway. > I do not have a reliable reproducer. The condition has always > happened when > running a high-compute job such as a 'make -j8' on the kernel, or > building the > RPM for openSUSE's implementation of VirtualBox. The latter is what > I'm using > for most of my testing. > > > > > It also would be good to rule out the thermal throttling (as per > > the > > Srinivas' comments). > > > > For now, please tell me what's in > > /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq > 80 Your effective freq is lower than 800MHz. One of the possible reason is thermal throttling. What distro you are using? Thanks, Srinivas
Re: Regression in 4.8 - CPU speed set very low
On 09/26/2016 07:21 PM, Rafael J. Wysocki wrote: On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote: On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote: On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger wrote: Maybe it's better to try diagnose the problem instead of spending more time on bisection. In my original post, I asked for such help, but nothing until today. I had no idea what to check, but now I have a better idea. I'd like to know whether or not 4.7 was definitely good, though. I never saw this problem with 4.7, but given the difficulty in triggering the problem, my tests may not have been definitive. If it is one of them, it may be a while before I dare call this one "good". In one respect, that is good as I will be traveling tomorrow and Wednesday. What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say? intel_pstate You probably don't need to worry about all of the cpufreq changes in 4.8-rc, then. Only a few of them affect intel_pstate and I don't see how any of them may lead to the observed symptoms. First off, if you have a reproducer, please run it on 4.7 and see if you can trigger the issue in there. I'm running 4.8-rc7 at the moment hoping to trigger the problem and get the data requested by Srinivas. Once I get that, I will try 4.7 again. Second, it would be good to have a look at the output from the cpu_frequency and pstate_sample tracepoints around when the issue triggers. The pstate_sample one would be more interesting. But for both we need a reproducer anyway. I do not have a reliable reproducer. The condition has always happened when running a high-compute job such as a 'make -j8' on the kernel, or building the RPM for openSUSE's implementation of VirtualBox. The latter is what I'm using for most of my testing. It also would be good to rule out the thermal throttling (as per the Srinivas' comments). For now, please tell me what's in /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq 80 Larry
Re: Regression in 4.8 - CPU speed set very low
On Tue, Sep 27, 2016 at 1:53 AM, Larry Finger wrote: > On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote: >> >> On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger > > wrote: >> >> >> Maybe it's better to try diagnose the problem instead of spending more >> time on bisection. > > > In my original post, I asked for such help, but nothing until today. I had > no idea what to check, but now I have a better idea. > >> I'd like to know whether or not 4.7 was definitely good, though. > > > I never saw this problem with 4.7, but given the difficulty in triggering > the problem, my tests may not have been definitive. >> >> >>> If it is one of them, it may be a while before I dare call this one >>> "good". >>> In one respect, that is good as I will be traveling tomorrow and >>> Wednesday. >> >> >> What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say? > > > intel_pstate You probably don't need to worry about all of the cpufreq changes in 4.8-rc, then. Only a few of them affect intel_pstate and I don't see how any of them may lead to the observed symptoms. First off, if you have a reproducer, please run it on 4.7 and see if you can trigger the issue in there. Second, it would be good to have a look at the output from the cpu_frequency and pstate_sample tracepoints around when the issue triggers. The pstate_sample one would be more interesting. But for both we need a reproducer anyway. It also would be good to rule out the thermal throttling (as per the Srinivas' comments). For now, please tell me what's in /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq Thanks, Rafael
Re: Regression in 4.8 - CPU speed set very low
On 09/26/2016 05:16 PM, Rafael J. Wysocki wrote: On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger wrote: Maybe it's better to try diagnose the problem instead of spending more time on bisection. In my original post, I asked for such help, but nothing until today. I had no idea what to check, but now I have a better idea. I'd like to know whether or not 4.7 was definitely good, though. I never saw this problem with 4.7, but given the difficulty in triggering the problem, my tests may not have been definitive. If it is one of them, it may be a while before I dare call this one "good". In one respect, that is good as I will be traveling tomorrow and Wednesday. What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say? intel_pstate Larry
Re: Regression in 4.8 - CPU speed set very low
On Tue, Sep 27, 2016 at 12:09 AM, Larry Finger wrote: > On 09/26/2016 04:37 PM, Rafael J. Wysocki wrote: >> >> On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger >> wrote: >>> >>> On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote: On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: [cut] >>> >>> Mostly I use a KDE applet named "System load" and look at the "average >>> clock", but the same info is also available in /proc/cpuinfo as "cpu >>> MHz". >>> When the bug triggers, the system gets very slow, and the cpu fan stops >>> even >>> though the cpu is still busy. >> >> >> That sounds like thermal throttling kicking in. > > > I think it is because the cpu is idling. If a thermal throttling is > responsible, why would it not fail for 168 hours, and then fail in 2? > >> What's there under /sys/class/thermal/ on your system? > > > It contains the following directories: > > cooling_device0 cooling_device1 cooling_device2 cooling_device3 > cooling_device4 thermal_zone0 thermal_zone1 >> >> >>> Commit f7816ad, which had run for 7 days without showing the bug, failed >>> after about 2 hours today. All my testing since Sept. 9 has been wasted. >>> Oh >>> well, that's the way it goes! >> >> >> Are you confident that the issue was not reproducible before 4.8-rc2? >> In particular, what about 4.8-rc1? > > > 4.8-rc1 is definitely bad. I am now testing commit 5539204. In the bisect > visualization, there are a number of cpufreq commits before the test case. Maybe it's better to try diagnose the problem instead of spending more time on bisection. I'd like to know whether or not 4.7 was definitely good, though. > If it is one of them, it may be a while before I dare call this one "good". > In one respect, that is good as I will be traveling tomorrow and Wednesday. What does "cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver" say? Thanks, Rafael
Re: Regression in 4.8 - CPU speed set very low
On 09/26/2016 04:46 PM, Srinivas Pandruvada wrote: On Mon, 2016-09-26 at 23:37 +0200, Rafael J. Wysocki wrote: On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger wrote: On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote: On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote: On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote: On 09/18/2016 09:54 PM, Larry Finger wrote: On 09/14/2016 11:00 AM, Larry Finger wrote: On 09/09/2016 12:39 PM, Larry Finger wrote: I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem, thus a bisection is not possible. It usually happens under heavy load, such as a kernel build or the RPM build of VirtualBox, but it does not always fail with these loads. In my most recent failure, 'hwinfo -- cpu' reports cpu MHz of 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the fault is 3437 MHz. Nothing is logged when this happens. If I were to get a patch that would show a backtrace when the maximum CPU frequency is changed, perhaps it would be possible to track this bug. I have not yet found the bad commit, but I have reduced the range of commits a bit. This bug has been difficult to trigger. So far, it has not taken over 1/2 day to appear in bad kernels, thus I am allowing three days before deciding that a given trial is good. I never saw the problem with 4.7 kernels, but I did in 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did not show the bug. Testing continues. And still does. My bisection seemed to be trending toward an improbable set of commits, and I needed to do some other work with the machine, thus I started running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated that using 3 days to indicate a "good" trial was likely too short. I am currently testing the first of the trial and will run it for at least a week. It is unlikely that these tests will be complete before 4,8 is released, even if -rc8 is needed. I will keep attempting to find the faulty commit. My debugging continues. After 7 days of beating on commit f7816ad, I have concluded that it is likely good. Thus I think the bug lies between commit 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit 1b05cf6, which did not fail with a shorter run. 581e0cd is not a valid mainline commit hash AFAICS. That was a typo. The correct value is 581e0c7. What cpufreq driver do you use? My "Default CPUFreq governor" is on demand. Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in CONFIG_ACPI_CPU_FREQ_PSS=y CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_GOV_ATTR_SET=y CONFIG_CPU_FREQ_GOV_COMMON=y # CONFIG_CPU_FREQ_STAT is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=m CONFIG_CPU_FREQ_GOV_USERSPACE=m CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set CONFIG_X86_PCC_CPUFREQ=m CONFIG_X86_ACPI_CPUFREQ=m CONFIG_X86_ACPI_CPUFREQ_CPB=y Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up going wrong. Further tests have shown that commit 351a4ded is bad. Once again, by bisection seems to be converging to a set of commits that seem unlikely to cause this problem. Perhaps commit f7816ad is not really good even though it survived 7 days of heavy CPU usage. I have been reluctant to post my entire .config on the list. It is available at http://pastebin.com/aMZaAKwL. If the governor is ondemand, the driver is acpi-cpufreq, most likely. How do you measure the frequency? Mostly I use a KDE applet named "System load" and look at the "average clock", but the same info is also available in /proc/cpuinfo as "cpu MHz". When the bug triggers, the system gets very slow, and the cpu fan stops even though the cpu is still busy. That sounds like thermal throttling kicking in. This will help to know, if there is thermal throttle from OS. # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq # grep -r . /sys/class/thermal/thermal_zone* With the system OK, I get finger@linux-1t8h:~/wireless-drivers-next> cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq 360 360 360 360 finger@linux-1t8h:~/wireless-drivers-next> grep -r . /sys/class/thermal/thermal_zone* grep: /sys/class/thermal/thermal_zone0/k_d: Input/output error grep: /sys/class/thermal/thermal_zone0/k_i: Input/output error grep: /sys/class/thermal/thermal_zone0/k_po: Inp
Re: Regression in 4.8 - CPU speed set very low
On 09/26/2016 04:37 PM, Rafael J. Wysocki wrote: On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger wrote: On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote: On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote: On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote: On 09/18/2016 09:54 PM, Larry Finger wrote: On 09/14/2016 11:00 AM, Larry Finger wrote: On 09/09/2016 12:39 PM, Larry Finger wrote: I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem, thus a bisection is not possible. It usually happens under heavy load, such as a kernel build or the RPM build of VirtualBox, but it does not always fail with these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the fault is 3437 MHz. Nothing is logged when this happens. If I were to get a patch that would show a backtrace when the maximum CPU frequency is changed, perhaps it would be possible to track this bug. I have not yet found the bad commit, but I have reduced the range of commits a bit. This bug has been difficult to trigger. So far, it has not taken over 1/2 day to appear in bad kernels, thus I am allowing three days before deciding that a given trial is good. I never saw the problem with 4.7 kernels, but I did in 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did not show the bug. Testing continues. And still does. My bisection seemed to be trending toward an improbable set of commits, and I needed to do some other work with the machine, thus I started running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated that using 3 days to indicate a "good" trial was likely too short. I am currently testing the first of the trial and will run it for at least a week. It is unlikely that these tests will be complete before 4,8 is released, even if -rc8 is needed. I will keep attempting to find the faulty commit. My debugging continues. After 7 days of beating on commit f7816ad, I have concluded that it is likely good. Thus I think the bug lies between commit 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit 1b05cf6, which did not fail with a shorter run. 581e0cd is not a valid mainline commit hash AFAICS. That was a typo. The correct value is 581e0c7. What cpufreq driver do you use? My "Default CPUFreq governor" is on demand. Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in CONFIG_ACPI_CPU_FREQ_PSS=y CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_GOV_ATTR_SET=y CONFIG_CPU_FREQ_GOV_COMMON=y # CONFIG_CPU_FREQ_STAT is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=m CONFIG_CPU_FREQ_GOV_USERSPACE=m CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set CONFIG_X86_PCC_CPUFREQ=m CONFIG_X86_ACPI_CPUFREQ=m CONFIG_X86_ACPI_CPUFREQ_CPB=y Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up going wrong. Further tests have shown that commit 351a4ded is bad. Once again, by bisection seems to be converging to a set of commits that seem unlikely to cause this problem. Perhaps commit f7816ad is not really good even though it survived 7 days of heavy CPU usage. I have been reluctant to post my entire .config on the list. It is available at http://pastebin.com/aMZaAKwL. If the governor is ondemand, the driver is acpi-cpufreq, most likely. How do you measure the frequency? Mostly I use a KDE applet named "System load" and look at the "average clock", but the same info is also available in /proc/cpuinfo as "cpu MHz". When the bug triggers, the system gets very slow, and the cpu fan stops even though the cpu is still busy. That sounds like thermal throttling kicking in. I think it is because the cpu is idling. If a thermal throttling is responsible, why would it not fail for 168 hours, and then fail in 2? What's there under /sys/class/thermal/ on your system? It contains the following directories: cooling_device0 cooling_device1 cooling_device2 cooling_device3 cooling_device4 thermal_zone0 thermal_zone1 Commit f7816ad, which had run for 7 days without showing the bug, failed after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh well, that's the way it goes! Are you confident that the issue was not reproducible before 4.8-rc2? In particular, what about 4.8-rc1? 4.8-rc1 is definitely bad. I
Re: Regression in 4.8 - CPU speed set very low
On Mon, Sep 26, 2016 at 11:41 PM, Srinivas Pandruvada wrote: > On Mon, 2016-09-26 at 23:30 +0200, Rafael J. Wysocki wrote: >> On Mon, Sep 26, 2016 at 11:26 PM, Srinivas Pandruvada >> wrote: >> > >> > On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote: >> > > >> > > On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: >> > >> > [...] >> > >> > > >> > > > >> > > > I have been reluctant to post my entire .config on the list. It >> > > > is >> > > > available at >> > > > http://pastebin.com/aMZaAKwL. >> > > >> > > If the governor is ondemand, the driver is acpi-cpufreq, most >> > > likely. >> > > >> > > How do you measure the frequency? >> > > >> > Also >> > When you get into this situation, please dump: >> > # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq >> > # cat /sys/devices/system/cpu/intel_pstate/* >> >> The driver is not intel_pstate. > I guessed from > CONFIG_X86_INTEL_PSTATE=y > and > Frequency is not 400 but something like 396.130 Ah. Good catch!
Re: Regression in 4.8 - CPU speed set very low
On Mon, 2016-09-26 at 23:37 +0200, Rafael J. Wysocki wrote: > On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger > wrote: > > > > On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote: > > > > > > > > > On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: > > > > > > > > > > > > On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote: > > > > > > > > > > > > > > > On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote: > > > > > > > > > > > > > > > > > > On 09/18/2016 09:54 PM, Larry Finger wrote: > > > > > > > > > > > > > > > > > > > > > On 09/14/2016 11:00 AM, Larry Finger wrote: > > > > > > > > > > > > > > > > > > > > > > > > On 09/09/2016 12:39 PM, Larry Finger wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > I have found a regression in kernel 4.8-rc2 that > > > > > > > > > causes the speed of > > > > > > > > > my laptop > > > > > > > > > with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to > > > > > > > > > suddenly have a > > > > > > > > > maximum cpu > > > > > > > > > frequency of ~400 MHz. Unfortunately, I do not know > > > > > > > > > how to trigger > > > > > > > > > this problem, > > > > > > > > > thus a bisection is not possible. It usually happens > > > > > > > > > under heavy > > > > > > > > > load, such as a > > > > > > > > > kernel build or the RPM build of VirtualBox, but it > > > > > > > > > does not always > > > > > > > > > fail with > > > > > > > > > these loads. In my most recent failure, 'hwinfo -- > > > > > > > > > cpu' reports cpu > > > > > > > > > MHz of > > > > > > > > > 396.130 for #3. The bogomips value is 5787.73, and > > > > > > > > > the cpu clock > > > > > > > > > before the > > > > > > > > > fault is 3437 MHz. Nothing is logged when this > > > > > > > > > happens. > > > > > > > > > > > > > > > > > > If I were to get a patch that would show a backtrace > > > > > > > > > when the > > > > > > > > > maximum CPU > > > > > > > > > frequency is changed, perhaps it would be possible to > > > > > > > > > track this > > > > > > > > > bug. > > > > > > > > > > > > > > > > > > > > > > > > I have not yet found the bad commit, but I have reduced > > > > > > > > the range of > > > > > > > > commits a > > > > > > > > bit. This bug has been difficult to trigger. So far, it > > > > > > > > has not taken > > > > > > > > over 1/2 > > > > > > > > day to appear in bad kernels, thus I am allowing three > > > > > > > > days before > > > > > > > > deciding that > > > > > > > > a given trial is good. I never saw the problem with 4.7 > > > > > > > > kernels, but > > > > > > > > I did in > > > > > > > > 4.8-rc1. I also know that it appeared before commit > > > > > > > > 581e0cd. Commit > > > > > > > > 1b05cf6 did > > > > > > > > not show the bug. > > > > > > > > > > > > > > > > Testing continues. > > > > > > > > > > > > > > > > > > > > > And still does. My bisection seemed to be trending toward > > > > > > > an > > > > > > > improbable set of > > > > > > > commits, and I needed to do some other work with the > > > > > > > machine, thus I > > > > > > > started > > > > > > > running 4.8-rc6. It failed nearly 48 hours after the > > > > > > > reboot, which > > > > > > > indicated > > > > > > > that using 3 days to indicate a "good" trial was likely > > > > > > > too short. I > > > > > > > am > > > > > > > currently testing the first of the trial and will run it > > > > > > > for at least > > > > > > > a week. It > > > > > > > is unlikely that these tests will be complete before 4,8 > > > > > > > is released, > > > > > > > even if > > > > > > > -rc8 is needed. I will keep attempting to find the faulty > > > > > > > commit. > > > > > > > > > > > > > > > > > > My debugging continues. After 7 days of beating on commit > > > > > > f7816ad, I > > > > > > have > > > > > > concluded that it is likely good. Thus I think the bug lies > > > > > > between > > > > > > commit > > > > > > 581e0cd (bad) and f7816ad (good). I will need to do a long > > > > > > test on > > > > > > commit > > > > > > 1b05cf6, which did not fail with a shorter run. > > > > > > > > > > > > > > > 581e0cd is not a valid mainline commit hash AFAICS. > > > > > > > > > > > > That was a typo. The correct value is 581e0c7. > > > > > > > > > > > > > > > > > > > > What cpufreq driver do you use? > > > > > > > > > > > > My "Default CPUFreq governor" is on demand. > > > > > > > > Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' > > > > results in > > > > > > > > CONFIG_ACPI_CPU_FREQ_PSS=y > > > > CONFIG_CPU_FREQ=y > > > > CONFIG_CPU_FREQ_GOV_ATTR_SET=y > > > > CONFIG_CPU_FREQ_GOV_COMMON=y > > > > # CONFIG_CPU_FREQ_STAT is not set > > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set > > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set > > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set > > > > CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y > > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set > > > > # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set > > > > CONFIG_CPU_FREQ_GOV_PERFO
Re: Regression in 4.8 - CPU speed set very low
On Mon, 2016-09-26 at 23:30 +0200, Rafael J. Wysocki wrote: > On Mon, Sep 26, 2016 at 11:26 PM, Srinivas Pandruvada > wrote: > > > > On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote: > > > > > > On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: > > > > [...] > > > > > > > > > > > > > I have been reluctant to post my entire .config on the list. It > > > > is > > > > available at > > > > http://pastebin.com/aMZaAKwL. > > > > > > If the governor is ondemand, the driver is acpi-cpufreq, most > > > likely. > > > > > > How do you measure the frequency? > > > > > Also > > When you get into this situation, please dump: > > # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq > > # cat /sys/devices/system/cpu/intel_pstate/* > > The driver is not intel_pstate. I guessed from CONFIG_X86_INTEL_PSTATE=y and Frequency is not 400 but something like 396.130 Thanks, Srinivas
Re: Regression in 4.8 - CPU speed set very low
On Mon, Sep 26, 2016 at 11:28 PM, Larry Finger wrote: > On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote: >> >> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: >>> >>> On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote: On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote: > > On 09/18/2016 09:54 PM, Larry Finger wrote: >> >> On 09/14/2016 11:00 AM, Larry Finger wrote: >>> >>> On 09/09/2016 12:39 PM, Larry Finger wrote: I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem, thus a bisection is not possible. It usually happens under heavy load, such as a kernel build or the RPM build of VirtualBox, but it does not always fail with these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the fault is 3437 MHz. Nothing is logged when this happens. If I were to get a patch that would show a backtrace when the maximum CPU frequency is changed, perhaps it would be possible to track this bug. >>> >>> >>> I have not yet found the bad commit, but I have reduced the range of >>> commits a >>> bit. This bug has been difficult to trigger. So far, it has not taken >>> over 1/2 >>> day to appear in bad kernels, thus I am allowing three days before >>> deciding that >>> a given trial is good. I never saw the problem with 4.7 kernels, but >>> I did in >>> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit >>> 1b05cf6 did >>> not show the bug. >>> >>> Testing continues. >> >> >> And still does. My bisection seemed to be trending toward an >> improbable set of >> commits, and I needed to do some other work with the machine, thus I >> started >> running 4.8-rc6. It failed nearly 48 hours after the reboot, which >> indicated >> that using 3 days to indicate a "good" trial was likely too short. I >> am >> currently testing the first of the trial and will run it for at least >> a week. It >> is unlikely that these tests will be complete before 4,8 is released, >> even if >> -rc8 is needed. I will keep attempting to find the faulty commit. > > > My debugging continues. After 7 days of beating on commit f7816ad, I > have > concluded that it is likely good. Thus I think the bug lies between > commit > 581e0cd (bad) and f7816ad (good). I will need to do a long test on > commit > 1b05cf6, which did not fail with a shorter run. 581e0cd is not a valid mainline commit hash AFAICS. >>> >>> >>> That was a typo. The correct value is 581e0c7. What cpufreq driver do you use? >>> >>> >>> My "Default CPUFreq governor" is on demand. >>> >>> Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in >>> >>> CONFIG_ACPI_CPU_FREQ_PSS=y >>> CONFIG_CPU_FREQ=y >>> CONFIG_CPU_FREQ_GOV_ATTR_SET=y >>> CONFIG_CPU_FREQ_GOV_COMMON=y >>> # CONFIG_CPU_FREQ_STAT is not set >>> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set >>> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set >>> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set >>> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y >>> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set >>> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set >>> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y >>> CONFIG_CPU_FREQ_GOV_POWERSAVE=m >>> CONFIG_CPU_FREQ_GOV_USERSPACE=m >>> CONFIG_CPU_FREQ_GOV_ONDEMAND=y >>> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m >>> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set >>> CONFIG_X86_PCC_CPUFREQ=m >>> CONFIG_X86_ACPI_CPUFREQ=m >>> CONFIG_X86_ACPI_CPUFREQ_CPB=y >>> >>> Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up >>> going >>> wrong. Further tests have shown that commit 351a4ded is bad. Once again, >>> by >>> bisection seems to be converging to a set of commits that seem unlikely >>> to cause >>> this problem. Perhaps commit f7816ad is not really good even though it >>> survived >>> 7 days of heavy CPU usage. >>> >>> I have been reluctant to post my entire .config on the list. It is >>> available at >>> http://pastebin.com/aMZaAKwL. >> >> >> If the governor is ondemand, the driver is acpi-cpufreq, most likely. >> >> How do you measure the frequency? > > > Mostly I use a KDE applet named "System load" and look at the "average > clock", but the same info is also available in /proc/cpuinfo as "cpu MHz". > When the bug triggers, the system gets very slow, and the cpu fan stops even > though the cpu is still busy. That sounds like thermal throttling kicking in. What's the
Re: Regression in 4.8 - CPU speed set very low
On Mon, Sep 26, 2016 at 11:26 PM, Srinivas Pandruvada wrote: > On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote: >> On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: > > [...] > >> > I have been reluctant to post my entire .config on the list. It is >> > available at >> > http://pastebin.com/aMZaAKwL. >> >> If the governor is ondemand, the driver is acpi-cpufreq, most likely. >> >> How do you measure the frequency? >> > Also > When you get into this situation, please dump: > # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq > # cat /sys/devices/system/cpu/intel_pstate/* The driver is not intel_pstate. Thanks, Rafael
Re: Regression in 4.8 - CPU speed set very low
On 09/26/2016 04:06 PM, Rafael J. Wysocki wrote: On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote: On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote: On 09/18/2016 09:54 PM, Larry Finger wrote: On 09/14/2016 11:00 AM, Larry Finger wrote: On 09/09/2016 12:39 PM, Larry Finger wrote: I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem, thus a bisection is not possible. It usually happens under heavy load, such as a kernel build or the RPM build of VirtualBox, but it does not always fail with these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the fault is 3437 MHz. Nothing is logged when this happens. If I were to get a patch that would show a backtrace when the maximum CPU frequency is changed, perhaps it would be possible to track this bug. I have not yet found the bad commit, but I have reduced the range of commits a bit. This bug has been difficult to trigger. So far, it has not taken over 1/2 day to appear in bad kernels, thus I am allowing three days before deciding that a given trial is good. I never saw the problem with 4.7 kernels, but I did in 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did not show the bug. Testing continues. And still does. My bisection seemed to be trending toward an improbable set of commits, and I needed to do some other work with the machine, thus I started running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated that using 3 days to indicate a "good" trial was likely too short. I am currently testing the first of the trial and will run it for at least a week. It is unlikely that these tests will be complete before 4,8 is released, even if -rc8 is needed. I will keep attempting to find the faulty commit. My debugging continues. After 7 days of beating on commit f7816ad, I have concluded that it is likely good. Thus I think the bug lies between commit 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit 1b05cf6, which did not fail with a shorter run. 581e0cd is not a valid mainline commit hash AFAICS. That was a typo. The correct value is 581e0c7. What cpufreq driver do you use? My "Default CPUFreq governor" is on demand. Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in CONFIG_ACPI_CPU_FREQ_PSS=y CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_GOV_ATTR_SET=y CONFIG_CPU_FREQ_GOV_COMMON=y # CONFIG_CPU_FREQ_STAT is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=m CONFIG_CPU_FREQ_GOV_USERSPACE=m CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set CONFIG_X86_PCC_CPUFREQ=m CONFIG_X86_ACPI_CPUFREQ=m CONFIG_X86_ACPI_CPUFREQ_CPB=y Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up going wrong. Further tests have shown that commit 351a4ded is bad. Once again, by bisection seems to be converging to a set of commits that seem unlikely to cause this problem. Perhaps commit f7816ad is not really good even though it survived 7 days of heavy CPU usage. I have been reluctant to post my entire .config on the list. It is available at http://pastebin.com/aMZaAKwL. If the governor is ondemand, the driver is acpi-cpufreq, most likely. How do you measure the frequency? Mostly I use a KDE applet named "System load" and look at the "average clock", but the same info is also available in /proc/cpuinfo as "cpu MHz". When the bug triggers, the system gets very slow, and the cpu fan stops even though the cpu is still busy. Commit f7816ad, which had run for 7 days without showing the bug, failed after about 2 hours today. All my testing since Sept. 9 has been wasted. Oh well, that's the way it goes! Thanks, Larry
Re: Regression in 4.8 - CPU speed set very low
On Mon, 2016-09-26 at 23:06 +0200, Rafael J. Wysocki wrote: > On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: [...] > > I have been reluctant to post my entire .config on the list. It is > > available at > > http://pastebin.com/aMZaAKwL. > > If the governor is ondemand, the driver is acpi-cpufreq, most likely. > > How do you measure the frequency? > Also When you get into this situation, please dump: # cat /sys/devices/system/cpu/cpufreq/policy?/scaling_max_freq # cat /sys/devices/system/cpu/intel_pstate/* Thanks, Srinivas
Re: Regression in 4.8 - CPU speed set very low
On Monday, September 26, 2016 11:15:45 AM Larry Finger wrote: > On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote: > > On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote: > >> On 09/18/2016 09:54 PM, Larry Finger wrote: > >>> On 09/14/2016 11:00 AM, Larry Finger wrote: > On 09/09/2016 12:39 PM, Larry Finger wrote: > > I have found a regression in kernel 4.8-rc2 that causes the speed of my > > laptop > > with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a > > maximum cpu > > frequency of ~400 MHz. Unfortunately, I do not know how to trigger this > > problem, > > thus a bisection is not possible. It usually happens under heavy load, > > such as a > > kernel build or the RPM build of VirtualBox, but it does not always > > fail with > > these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz > > of > > 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before > > the > > fault is 3437 MHz. Nothing is logged when this happens. > > > > If I were to get a patch that would show a backtrace when the maximum > > CPU > > frequency is changed, perhaps it would be possible to track this bug. > > I have not yet found the bad commit, but I have reduced the range of > commits a > bit. This bug has been difficult to trigger. So far, it has not taken > over 1/2 > day to appear in bad kernels, thus I am allowing three days before > deciding that > a given trial is good. I never saw the problem with 4.7 kernels, but I > did in > 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit > 1b05cf6 did > not show the bug. > > Testing continues. > >>> > >>> And still does. My bisection seemed to be trending toward an improbable > >>> set of > >>> commits, and I needed to do some other work with the machine, thus I > >>> started > >>> running 4.8-rc6. It failed nearly 48 hours after the reboot, which > >>> indicated > >>> that using 3 days to indicate a "good" trial was likely too short. I am > >>> currently testing the first of the trial and will run it for at least a > >>> week. It > >>> is unlikely that these tests will be complete before 4,8 is released, > >>> even if > >>> -rc8 is needed. I will keep attempting to find the faulty commit. > >> > >> My debugging continues. After 7 days of beating on commit f7816ad, I have > >> concluded that it is likely good. Thus I think the bug lies between commit > >> 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit > >> 1b05cf6, which did not fail with a shorter run. > > > > 581e0cd is not a valid mainline commit hash AFAICS. > > That was a typo. The correct value is 581e0c7. > > > > What cpufreq driver do you use? > > My "Default CPUFreq governor" is on demand. > > Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in > > CONFIG_ACPI_CPU_FREQ_PSS=y > CONFIG_CPU_FREQ=y > CONFIG_CPU_FREQ_GOV_ATTR_SET=y > CONFIG_CPU_FREQ_GOV_COMMON=y > # CONFIG_CPU_FREQ_STAT is not set > # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set > # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set > # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set > CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y > # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set > # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set > CONFIG_CPU_FREQ_GOV_PERFORMANCE=y > CONFIG_CPU_FREQ_GOV_POWERSAVE=m > CONFIG_CPU_FREQ_GOV_USERSPACE=m > CONFIG_CPU_FREQ_GOV_ONDEMAND=y > CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m > # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set > CONFIG_X86_PCC_CPUFREQ=m > CONFIG_X86_ACPI_CPUFREQ=m > CONFIG_X86_ACPI_CPUFREQ_CPB=y > > Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up > going > wrong. Further tests have shown that commit 351a4ded is bad. Once again, by > bisection seems to be converging to a set of commits that seem unlikely to > cause > this problem. Perhaps commit f7816ad is not really good even though it > survived > 7 days of heavy CPU usage. > > I have been reluctant to post my entire .config on the list. It is available > at > http://pastebin.com/aMZaAKwL. If the governor is ondemand, the driver is acpi-cpufreq, most likely. How do you measure the frequency? Thanks, Rafael
Re: Regression in 4.8 - CPU speed set very low
On 09/26/2016 06:37 AM, Rafael J. Wysocki wrote: On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote: On 09/18/2016 09:54 PM, Larry Finger wrote: On 09/14/2016 11:00 AM, Larry Finger wrote: On 09/09/2016 12:39 PM, Larry Finger wrote: I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem, thus a bisection is not possible. It usually happens under heavy load, such as a kernel build or the RPM build of VirtualBox, but it does not always fail with these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the fault is 3437 MHz. Nothing is logged when this happens. If I were to get a patch that would show a backtrace when the maximum CPU frequency is changed, perhaps it would be possible to track this bug. I have not yet found the bad commit, but I have reduced the range of commits a bit. This bug has been difficult to trigger. So far, it has not taken over 1/2 day to appear in bad kernels, thus I am allowing three days before deciding that a given trial is good. I never saw the problem with 4.7 kernels, but I did in 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did not show the bug. Testing continues. And still does. My bisection seemed to be trending toward an improbable set of commits, and I needed to do some other work with the machine, thus I started running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated that using 3 days to indicate a "good" trial was likely too short. I am currently testing the first of the trial and will run it for at least a week. It is unlikely that these tests will be complete before 4,8 is released, even if -rc8 is needed. I will keep attempting to find the faulty commit. My debugging continues. After 7 days of beating on commit f7816ad, I have concluded that it is likely good. Thus I think the bug lies between commit 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit 1b05cf6, which did not fail with a shorter run. 581e0cd is not a valid mainline commit hash AFAICS. That was a typo. The correct value is 581e0c7. What cpufreq driver do you use? My "Default CPUFreq governor" is on demand. Running the command 'egrep -r "CPU_FREQ|CPUFREQ" .config' results in CONFIG_ACPI_CPU_FREQ_PSS=y CONFIG_CPU_FREQ=y CONFIG_CPU_FREQ_GOV_ATTR_SET=y CONFIG_CPU_FREQ_GOV_COMMON=y # CONFIG_CPU_FREQ_STAT is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set CONFIG_CPU_FREQ_GOV_PERFORMANCE=y CONFIG_CPU_FREQ_GOV_POWERSAVE=m CONFIG_CPU_FREQ_GOV_USERSPACE=m CONFIG_CPU_FREQ_GOV_ONDEMAND=y CONFIG_CPU_FREQ_GOV_CONSERVATIVE=m # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set CONFIG_X86_PCC_CPUFREQ=m CONFIG_X86_ACPI_CPUFREQ=m CONFIG_X86_ACPI_CPUFREQ_CPB=y Commit 1b05cf6 did fail on longer testing, thus my bisection had ended up going wrong. Further tests have shown that commit 351a4ded is bad. Once again, by bisection seems to be converging to a set of commits that seem unlikely to cause this problem. Perhaps commit f7816ad is not really good even though it survived 7 days of heavy CPU usage. I have been reluctant to post my entire .config on the list. It is available at http://pastebin.com/aMZaAKwL. Larry
Re: Regression in 4.8 - CPU speed set very low
On Friday, September 23, 2016 09:45:09 PM Larry Finger wrote: > On 09/18/2016 09:54 PM, Larry Finger wrote: > > On 09/14/2016 11:00 AM, Larry Finger wrote: > >> On 09/09/2016 12:39 PM, Larry Finger wrote: > >>> I have found a regression in kernel 4.8-rc2 that causes the speed of my > >>> laptop > >>> with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a > >>> maximum cpu > >>> frequency of ~400 MHz. Unfortunately, I do not know how to trigger this > >>> problem, > >>> thus a bisection is not possible. It usually happens under heavy load, > >>> such as a > >>> kernel build or the RPM build of VirtualBox, but it does not always fail > >>> with > >>> these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of > >>> 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before > >>> the > >>> fault is 3437 MHz. Nothing is logged when this happens. > >>> > >>> If I were to get a patch that would show a backtrace when the maximum CPU > >>> frequency is changed, perhaps it would be possible to track this bug. > >> > >> I have not yet found the bad commit, but I have reduced the range of > >> commits a > >> bit. This bug has been difficult to trigger. So far, it has not taken over > >> 1/2 > >> day to appear in bad kernels, thus I am allowing three days before > >> deciding that > >> a given trial is good. I never saw the problem with 4.7 kernels, but I did > >> in > >> 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit > >> 1b05cf6 did > >> not show the bug. > >> > >> Testing continues. > > > > And still does. My bisection seemed to be trending toward an improbable set > > of > > commits, and I needed to do some other work with the machine, thus I started > > running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated > > that using 3 days to indicate a "good" trial was likely too short. I am > > currently testing the first of the trial and will run it for at least a > > week. It > > is unlikely that these tests will be complete before 4,8 is released, even > > if > > -rc8 is needed. I will keep attempting to find the faulty commit. > > My debugging continues. After 7 days of beating on commit f7816ad, I have > concluded that it is likely good. Thus I think the bug lies between commit > 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit > 1b05cf6, which did not fail with a shorter run. 581e0cd is not a valid mainline commit hash AFAICS. What cpufreq driver do you use? Thanks, Rafael
Re: Regression in 4.8 - CPU speed set very low
On 09/18/2016 09:54 PM, Larry Finger wrote: On 09/14/2016 11:00 AM, Larry Finger wrote: On 09/09/2016 12:39 PM, Larry Finger wrote: I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem, thus a bisection is not possible. It usually happens under heavy load, such as a kernel build or the RPM build of VirtualBox, but it does not always fail with these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the fault is 3437 MHz. Nothing is logged when this happens. If I were to get a patch that would show a backtrace when the maximum CPU frequency is changed, perhaps it would be possible to track this bug. I have not yet found the bad commit, but I have reduced the range of commits a bit. This bug has been difficult to trigger. So far, it has not taken over 1/2 day to appear in bad kernels, thus I am allowing three days before deciding that a given trial is good. I never saw the problem with 4.7 kernels, but I did in 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did not show the bug. Testing continues. And still does. My bisection seemed to be trending toward an improbable set of commits, and I needed to do some other work with the machine, thus I started running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated that using 3 days to indicate a "good" trial was likely too short. I am currently testing the first of the trial and will run it for at least a week. It is unlikely that these tests will be complete before 4,8 is released, even if -rc8 is needed. I will keep attempting to find the faulty commit. My debugging continues. After 7 days of beating on commit f7816ad, I have concluded that it is likely good. Thus I think the bug lies between commit 581e0cd (bad) and f7816ad (good). I will need to do a long test on commit 1b05cf6, which did not fail with a shorter run. Larry
Re: Regression in 4.8 - CPU speed set very low
On 09/14/2016 11:00 AM, Larry Finger wrote: On 09/09/2016 12:39 PM, Larry Finger wrote: I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem, thus a bisection is not possible. It usually happens under heavy load, such as a kernel build or the RPM build of VirtualBox, but it does not always fail with these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the fault is 3437 MHz. Nothing is logged when this happens. If I were to get a patch that would show a backtrace when the maximum CPU frequency is changed, perhaps it would be possible to track this bug. I have not yet found the bad commit, but I have reduced the range of commits a bit. This bug has been difficult to trigger. So far, it has not taken over 1/2 day to appear in bad kernels, thus I am allowing three days before deciding that a given trial is good. I never saw the problem with 4.7 kernels, but I did in 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did not show the bug. Testing continues. And still does. My bisection seemed to be trending toward an improbable set of commits, and I needed to do some other work with the machine, thus I started running 4.8-rc6. It failed nearly 48 hours after the reboot, which indicated that using 3 days to indicate a "good" trial was likely too short. I am currently testing the first of the trial and will run it for at least a week. It is unlikely that these tests will be complete before 4,8 is released, even if -rc8 is needed. I will keep attempting to find the faulty commit. Larry
Re: Regression in 4.8 - CPU speed set very low
On 09/09/2016 12:39 PM, Larry Finger wrote: I have found a regression in kernel 4.8-rc2 that causes the speed of my laptop with an Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz to suddenly have a maximum cpu frequency of ~400 MHz. Unfortunately, I do not know how to trigger this problem, thus a bisection is not possible. It usually happens under heavy load, such as a kernel build or the RPM build of VirtualBox, but it does not always fail with these loads. In my most recent failure, 'hwinfo --cpu' reports cpu MHz of 396.130 for #3. The bogomips value is 5787.73, and the cpu clock before the fault is 3437 MHz. Nothing is logged when this happens. If I were to get a patch that would show a backtrace when the maximum CPU frequency is changed, perhaps it would be possible to track this bug. I have not yet found the bad commit, but I have reduced the range of commits a bit. This bug has been difficult to trigger. So far, it has not taken over 1/2 day to appear in bad kernels, thus I am allowing three days before deciding that a given trial is good. I never saw the problem with 4.7 kernels, but I did in 4.8-rc1. I also know that it appeared before commit 581e0cd. Commit 1b05cf6 did not show the bug. Testing continues. Larry