Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-19 Thread Borislav Petkov
On Fri, Oct 18, 2019 at 01:38:32PM -0700, Luck, Tony wrote: > Sorry to have caused confusion. Ditto. But us causing confusion is fine - this way we can talk about what we really wanna do! :-))) > The thoughts behind that statement are that we currently have an issue > with too many noisy high

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-18 Thread Luck, Tony
On Fri, Oct 18, 2019 at 09:45:03PM +0200, Borislav Petkov wrote: > On Fri, Oct 18, 2019 at 11:02:57AM -0700, Luck, Tony wrote: > > So what should we do next? > > I was simply keying off this statement of yours: > > "Depending on what we end up with from Srinivas ... we may want to > reconsider

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-18 Thread Borislav Petkov
On Fri, Oct 18, 2019 at 11:02:57AM -0700, Luck, Tony wrote: > So what should we do next? I was simply keying off this statement of yours: "Depending on what we end up with from Srinivas ... we may want to reconsider the severity." and I don't think that having KERN_CRIT severity for those

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-18 Thread Borislav Petkov
On Fri, Oct 18, 2019 at 08:55:17AM -0700, Srinivas Pandruvada wrote: > I assume that someone is having performance issues or occasion reboots, > look at the logs. Is it a fair assumption? Yes, that is a valid use case IMO. > But if a system is running at up to 87.5% of duty cycle on top of >

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-18 Thread Luck, Tony
On Fri, Oct 18, 2019 at 03:23:09PM +0200, Borislav Petkov wrote: > On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote: > > Server/desktops generally rely on the embedded controller for FAN > > control, which kernel have no control. For them this warning helps to > > either bring

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-18 Thread Srinivas Pandruvada
On Fri, 2019-10-18 at 15:23 +0200, Borislav Petkov wrote: > On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote: > > Server/desktops generally rely on the embedded controller for FAN > > control, which kernel have no control. For them this warning helps > > to > > either bring in

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-18 Thread Borislav Petkov
On Fri, Oct 18, 2019 at 05:26:36AM -0700, Srinivas Pandruvada wrote: > Server/desktops generally rely on the embedded controller for FAN > control, which kernel have no control. For them this warning helps to > either bring in additional cooling or fix existing cooling. How exactly does this

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-18 Thread Srinivas Pandruvada
On Thu, 2019-10-17 at 23:44 +0200, Borislav Petkov wrote: > On Thu, Oct 17, 2019 at 09:31:30PM +, Luck, Tony wrote: > > That sounds like the right short term action. > > > > Depending on what we end up with from Srinivas ... we may want > > to reconsider the severity. The basic premise of

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-18 Thread Peter Zijlstra
On Thu, Oct 17, 2019 at 11:44:45PM +0200, Borislav Petkov wrote: > On Thu, Oct 17, 2019 at 09:31:30PM +, Luck, Tony wrote: > > That sounds like the right short term action. > > > > Depending on what we end up with from Srinivas ... we may want > > to reconsider the severity. The basic

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-18 Thread Borislav Petkov
On Thu, Oct 17, 2019 at 11:53:18PM +, Luck, Tony wrote: > > * we throttle the machine from within the kernel - whatever that may mean > > * if that doesn't help, we stop scheduling !root tasks > > * if that doesn't help, we halt > > The silicon will do that "halt" step all by itself if the

RE: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-17 Thread Luck, Tony
> * we throttle the machine from within the kernel - whatever that may mean > * if that doesn't help, we stop scheduling !root tasks > * if that doesn't help, we halt The silicon will do that "halt" step all by itself if the temperature continues to rise and hits the highest of the temperature

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-17 Thread Borislav Petkov
On Thu, Oct 17, 2019 at 09:31:30PM +, Luck, Tony wrote: > That sounds like the right short term action. > > Depending on what we end up with from Srinivas ... we may want > to reconsider the severity. The basic premise of Srinivas' patch > is to avoid printing anything for short excursions

RE: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-17 Thread Luck, Tony
>> That all sounds like the printk should be downgraded too, it is not a >> KERN_CRIT warning. It is more a notification that we're getting warm. > > Right, and I think we should take Benjamin's patch after all - perhaps > even tag it for stable if that message is annoying people too much - and >

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-16 Thread Borislav Petkov
On Wed, Oct 16, 2019 at 10:14:05AM +0200, Peter Zijlstra wrote: > That all sounds like the printk should be downgraded too, it is not a > KERN_CRIT warning. It is more a notification that we're getting warm. Right, and I think we should take Benjamin's patch after all - perhaps even tag it for

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-16 Thread Peter Zijlstra
On Tue, Oct 15, 2019 at 06:31:46AM -0700, Srinivas Pandruvada wrote: > On Tue, 2019-10-15 at 10:48 +0200, Peter Zijlstra wrote: > > On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > > > Some modern systems have very tight thermal tolerances. Because of > > > this > > > they

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-15 Thread Srinivas Pandruvada
On Tue, 2019-10-15 at 10:46 +0200, Borislav Petkov wrote: > On Mon, Oct 14, 2019 at 03:41:38PM -0700, Srinivas Pandruvada wrote: > > So some users who had issues in their systems can try with this > > patch. > > We can get rid of this, till it becomes real issue. > > We don't add command line

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-15 Thread Srinivas Pandruvada
On Tue, 2019-10-15 at 10:52 +0200, Peter Zijlstra wrote: > On Mon, Oct 14, 2019 at 03:27:35PM -0700, Luck, Tony wrote: > > On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote: > > > This description is already *begging* for this delay value to be > > > automatically set by the kernel.

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-15 Thread Srinivas Pandruvada
On Tue, 2019-10-15 at 10:48 +0200, Peter Zijlstra wrote: > On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > > Some modern systems have very tight thermal tolerances. Because of > > this > > they may cross thermal thresholds when running normal workloads > > (even > > during

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-15 Thread Peter Zijlstra
On Mon, Oct 14, 2019 at 03:27:35PM -0700, Luck, Tony wrote: > On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote: > > This description is already *begging* for this delay value to be > > automatically set by the kernel. Putting yet another knob in front of > > the user who doesn't

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-15 Thread Peter Zijlstra
On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > Some modern systems have very tight thermal tolerances. Because of this > they may cross thermal thresholds when running normal workloads (even > during boot). The CPU hardware will react by limiting power/frequency > and using

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-15 Thread Borislav Petkov
On Mon, Oct 14, 2019 at 03:41:38PM -0700, Srinivas Pandruvada wrote: > So some users who had issues in their systems can try with this patch. > We can get rid of this, till it becomes real issue. We don't add command line parameters which we maybe can get rid of later. > The temperature is

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-15 Thread Borislav Petkov
On Mon, Oct 14, 2019 at 03:27:35PM -0700, Luck, Tony wrote: > You need a plausible start point for the "when to worry the user" > message. Maybe that is your "max value"? Yes, that would be a good start. You need that anyway because the experimentations you guys did to get your numbers have

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-14 Thread Srinivas Pandruvada
On Mon, 2019-10-14 at 23:36 +0200, Borislav Petkov wrote: > On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > > Some modern systems have very tight thermal tolerances. Because of > > this > > they may cross thermal thresholds when running normal workloads > > (even > > during

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-14 Thread Luck, Tony
On Mon, Oct 14, 2019 at 11:36:18PM +0200, Borislav Petkov wrote: > This description is already *begging* for this delay value to be > automatically set by the kernel. Putting yet another knob in front of > the user who doesn't have a clue most of the time shows one more time > that we haven't done

Re: [PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-14 Thread Borislav Petkov
On Mon, Oct 14, 2019 at 02:21:00PM -0700, Srinivas Pandruvada wrote: > Some modern systems have very tight thermal tolerances. Because of this > they may cross thermal thresholds when running normal workloads (even > during boot). The CPU hardware will react by limiting power/frequency > and using

[PATCH 1/2] x86, mce, therm_throt: Optimize logging of thermal throttle messages

2019-10-14 Thread Srinivas Pandruvada
Some modern systems have very tight thermal tolerances. Because of this they may cross thermal thresholds when running normal workloads (even during boot). The CPU hardware will react by limiting power/frequency and using duty cycles to bring the temperature back into normal range. Thus users may