[Kernel-packages] [Bug 1917813] Re: HWP and C1E are incompatible - Intel processors

2023-12-06 Thread Bug Watch Updater
Launchpad has imported 45 comments from the remote bug at
https://bugzilla.kernel.org/show_bug.cgi?id=210741.

If you reply to an imported comment from within Launchpad, your comment
will be sent to the remote bug automatically. Read more about
Launchpad's inter-bugtracker facilities at
https://help.launchpad.net/InterBugTracking.


On 2020-12-17T01:13:11+00:00 dsmythies wrote:

Created attachment 294171
Graph of load sweep up and down at 347 Hertz.

Consider a steady state periodic single threaded workflow, with a work/sleep 
frequency of 347 Hertz and a load somewhere in the ~75% range at the steady 
state operating point.
For the intel-cpufreq CPU frequency scaling driver and powersave governor and 
hwp disabled, it goes indefinitely without any issues.
For the acpi-cpufreq CPU frequency scaling driver and ondemand governor, it 
goes indefinitely without any issues.
For the intel-cpufreq CPU frequency scaling driver and powersave governor and 
hwp enabled, it suffers from overruns.

Why?

For unknown reasons, HWP seems to incorrectly decide that the processor
is idle and spins the PLL down to a very low frequency. Upon exit from
the sleep portion of the periodic workflow it takes a very long time (on
the order of 20 milliseconds (supporting data for that statement will
added in a later posting)), resulting in the periodic job no being able
to complete its work before the next interval, whereas it normally has
plenty of time to do its work. Actually, typical worst case overruns are
around 12 milliseconds, or several work/sleep periods (i.e. it takes a
very long time to catch up.)

The probability of this occurring is about 3%, but varies significantly.
Obviously, the recovery time is also a function of EPP, but mostly this
work has been done with the default EPP of 128. I believe this to be a
sampling and anti-aliasing issue, but can not prove it because HWP is
black box. My best GUESS is:

If the periodic load is busy on a jiffy boundary, such that the tick is on.
Then if it is sleeping at the next jiffy boundary, with a pending wake such 
that idle state 2 was used.
  Then if the rest of the system was idle such that HWP decides to spin down 
the PLL.
Then it is highly probable that upon that idle state 2 exit, the PLL is too 
slow to ramp up and the task will overrun as a result.
Else everything will be fine.

For a 1000 Hz kernel the above suggests that a work/sleep frequency of 500 Hz 
should behave in a binary way, either lots of overruns or none. It does.
For a 1000 Hz kernel the above suggests that a work/sleep frequency of 333.333 
Hz should behave in a binary way, either lots of overruns or none. It does.
Note: in all cases the sleep time has to be within the window of opportunity.

Now, actually I can not prove if the idle state 2 part is a cause or
consequence, but it never happens with it disabled, but at the cost of
significant power.

Another way this issue would manifest itself is as seeming to be an
extraordinary idle exit latency, but would be rather difficult to
isolate as the cause.

processors tested:
Intel(R) Core(TM) i5-9600K CPU @ 3.70GHz (mine)
Intel(R) Core(TM) i5-6200U CPU @ 2.30GHz (not mine)

HWP has been around for years, why am I just reporting this now?

I never owned an HWP capable processor before. My older i7-2600K based
test computer was getting a little old, so I built a new test computer.
I noticed this issue the same day I first enabled HWP. That was months
ago (notice the dates on the graphs that will eventually be added to
this), and I tried, repeatedly, to get help from Intel via the linux-pm
e-mail list.

Now, given the above system response issue, a new test was developed to
focus specifically on this issue, dubbed the "Inverse Impulse Response"
test. It examines in great detail the CPU frequency rise time after a
brief, less than 1 millisecond, gap in an otherwise continuous workflow.
I'll attach graphs and details in subsequent postings to this bug
report.

While I believe this is an issue entirely within HWP, I have not been
able to prove that there was nothing sent from the kernel somehow
telling HWP to spin down.

Notes:

CPU affinity does not need to be forced, but sometimes is for data
acquisition.

1000 hertz kernels were tested back to kernel 5.2, all failed.

Kernel 5.10-rc7 (I have yet to compile 5.10) also fails.

A 250 hertz kernel was tested, and it did not have this issue in this
area. Perhaps elsewhere, I didn't look.

Both teo and menu idle governors were tested, and while both suffer from
the unexpected CPU frequency drop, teo seems much worse. However failure
points for both governors are repeatable.

The test computers were always checked for any throttling log sticky
bits, and regardless were never anywhere even close to throttling.

Note, however that every HWP capable computer I was to acquire data from
has at least one of those sticky bits set after boot, so 

[Kernel-packages] [Bug 1917813] Re: HWP and C1E are incompatible - Intel processors

2021-03-04 Thread Bug Watch Updater
** Changed in: linux
   Status: Unknown => Confirmed

** Changed in: linux
   Importance: Unknown => Medium

-- 
You received this bug notification because you are a member of Kernel
Packages, which is subscribed to linux in Ubuntu.
https://bugs.launchpad.net/bugs/1917813

Title:
  HWP and C1E are incompatible - Intel processors

Status in Linux:
  Confirmed
Status in linux package in Ubuntu:
  Confirmed

Bug description:
  Modern Intel Processors (since Skylake) with HWP (HardWare Pstate)
  control enabled and Idle State 2, C1E, enabled can incorrectly drop
  the CPU frequency with an extremely slow recovery time.

  The fault is not within HWP itself, but within the internal idle
  detection logic. One difference between OS driven pstate control and
  HWP driven pstate control is that the OS knows the system was not
  actually idle, but HWP does not. Another difference is the incredibly
  sluggish recovery with HWP.

  The problem only occurs when Idle State 2, C1E, is involved. Not all
  processors have the C1E idle state. The issue is independent of C1E
  auto-promotion, which is turned off in general, as far as I know.

  With all idle states enabled the issue is rare. The issue would
  manifest itself in periodic workflows, and would be extremely
  difficult to isolate (It took me over 1/2 a year).

  The purpose of this bug report is to link to the upstream bug report,
  where readers can find tons of detail. I'll also set it to confirmed,
  as it has already been verified on 4 different processor models, and I
  do not want the bot asking me for files that are not required.

  Workarounds include:
  . don't use HWP.
  . disable idle state 2, C1E
  . change the C1E idle state to use MWAIT 0x03 instead of MWAIT 0x01 (still in 
test. documentation on the MWAIT least significant nibble is scant).

To manage notifications about this bug go to:
https://bugs.launchpad.net/linux/+bug/1917813/+subscriptions

-- 
Mailing list: https://launchpad.net/~kernel-packages
Post to : kernel-packages@lists.launchpad.net
Unsubscribe : https://launchpad.net/~kernel-packages
More help   : https://help.launchpad.net/ListHelp