Re: [PATCH] cpuidle: Add 'above' and 'below' idle state metrics

2018-12-07 Thread Rafael J. Wysocki
On Friday, December 7, 2018 1:02:05 PM CET Quentin Perret wrote:
> Hi Rafael,
> 
> On Friday 07 Dec 2018 at 12:57:00 (+0100), Rafael J. Wysocki wrote:
> > --- linux-pm.orig/Documentation/ABI/testing/sysfs-devices-system-cpu
> > +++ linux-pm/Documentation/ABI/testing/sysfs-devices-system-cpu
> > @@ -145,6 +145,8 @@ What:   /sys/devices/system/cpu/cpuX/cpui
> > /sys/devices/system/cpu/cpuX/cpuidle/stateN/power
> > /sys/devices/system/cpu/cpuX/cpuidle/stateN/time
> > /sys/devices/system/cpu/cpuX/cpuidle/stateN/usage
> > +   /sys/devices/system/cpu/cpuX/cpuidle/stateN/high
> 
> That should be s/high/above I suppose.

Right, thanks for spotting this. :-)

> > +   /sys/devices/system/cpu/cpuX/cpuidle/stateN/below
> 
> Other than that this seems really useful :-)

Thanks!



Re: [RFC/RFT,v7] cpuidle: New timer events oriented governor for tickless systems

2018-12-07 Thread Rafael J. Wysocki
On Fri, Dec 7, 2018 at 9:00 AM Chen, Hu  wrote:
>
> From: Chen Hu 
>
> Hi Rafael,
>
> I run several popular Android performance benchmarks on teov7, using kernel
> 4.19.0 as my baseline because I happen to work on it. To backport teov7 to
> kernel 4.19.0, I also backport patch 5f26bdc: "cpuidle: menu: Fix wakeup
> statistics updates for polling state". The teov7 doesn't show regressions on
> such perf KPIs.
>
> Compare "4.19 + 5f26bdc" and "4.19 + 5f26bdc + teov7" on Android with Intel
> Apollo Lake SoC:
>
> Test Case   Diff after appling teov7
> Antutu_60.51%
> GFX5_openGL_Car_Chase   0.43%
> GFX5_openGL_Car_Chase_offscreen 0.25%
> GFX5_openGL_Manhattan31 0.75%
> GFX5_openGL_Manhattan31_1080_offscreen  -0.10%
> Geekbench3.3-0.13%
> H264_1080P_60FPS0.00%
> H264_2160P_60FPS0.00%
> H265_2K_10bit   0.00%
> H265_2K_8bit0.00%
> Resume_time 1.31%
> full_boot   -0.39%

Thank you!


Re: Linux 4.20-rc5: Lab setup broken by build-related x86 change

2018-12-06 Thread Rafael J. Wysocki
On Wednesday, December 5, 2018 3:33:03 PM CET Borislav Petkov wrote:
> On Wed, Dec 05, 2018 at 01:15:10PM +0100, Rafael J. Wysocki wrote:
> > After commit
> > 
> > commit 4cd24de3a0980bf3100c9dcb08ef65ca7c31af48
> > Author: Zhenzhong Duan 
> > Date:   Fri Nov 2 01:45:41 2018 -0700
> > 
> > x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support
> 
> I see Ingo committed a fix for this thing:
> 
> https://git.kernel.org/tip/25896d073d8a0403b07e6dec56f58e6c33678207
> 
> Does it work with that cherry-picked ontop?

Yes, it does, thanks!



Re: [PATCH][next] drivers: base: remove need for a temporary string for the node name

2018-12-06 Thread Rafael J. Wysocki
On Thu, Dec 6, 2018 at 3:57 PM Greg Kroah-Hartman
 wrote:
>
> On Thu, Nov 29, 2018 at 06:08:44PM +, Colin King wrote:
> > From: Colin Ian King 
> >
> > Currently the node name is being formatting into a temporary string
> > node_name, however, kobject_init_and_add allows one to format up
> > a node name, so use that instead. This removes the need for the
> > node_name string and also cleans up the following warning:
> >
> > Fixes clang warning:
> > warning: format string is not a string literal (potentially
> > insecure) [-Wformat-security]
> >
> > Signed-off-by: Colin Ian King 
> > ---
> >  drivers/base/swnode.c | 5 +
>
> I don't know where this file comes from.  It's not in my
> driver-core-next branch, who is creating it?

It is in my linux-next one, comes from a series of device properties
patches by Heikki (CCed to you).

This is the patch adding it: https://patchwork.kernel.org/patch/10676063/

So Colin, please resend and CC linux-a...@vger.kernel.org.


Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-06 Thread Rafael J. Wysocki
On Thu, Dec 6, 2018 at 12:06 AM Doug Smythies  wrote:
>
> On 2018.12.03 03:48 Rafael J. Wysocki wrote:
>
> >>> There is an additional issue where if idle state 0 is disabled (with the 
> >>> above suggested code patch),
> >>> idle state usage seems to fall to deeper states than idle state 1.
> >>> This is not the expected behaviour.
> >>
> >> No, it isn't.
> >>
> >>> Kernel 4.20-rc3 works as expected.
> >>> I have not figured this issue out yet, in the code.
> >>>
> >>> Example (1 minute per sample. Number of entries/exits per state):
> >>> State 0 State 1 State 2 State 3 State 4Watts
> >>>28235143, 83, 26, 17,837,  64.900
> >>> 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
> >>> Transition sample, after idle state 0 disabled
> >>>   0, 793517,7186099,   10559878,   38485721,  61.900 << 
> >>> ?? should have all gone into Idle state 1
> >>>   0, 795414,7340703,   10553117,   38513456,  62.050
> >>>   0, 807028,7288195,   10574113,   38523524,  62.167
> >>>   0, 814983,7403534,   10575108,   38571228,  62.167
> >>>   0, 838302,7747127,   10552289,   38556054,  62.183
> >>> 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
> >>> Transition sample, after idle state 0 enabled
> >>>27893504, 96, 40,  9,912,  66.500
> >>>26556343, 83, 29,  7,814,  66.683
> >>>27929227, 64, 20, 10,931,  66.683
> >>
> >> I see.
> >>
> >> OK, I'll look into this too, thanks!
> >
> > This probably is the artifact of the fix for the teo_find_shallower_state()
> > issue.
> >
> > Anyway, I'm not able to reproduce this with the teo_find_shallower_state() 
> > issue
> > fixed differently.
>
> I am not able to reproduce with your teo_find_shallower_state(), or teo V 7,
> either. Everything is graceful now, as states are disabled:
> (10 seconds per sample. Number of entries/exits per state):
>
> State 0 State 1 State 2 State 3 State 4Watts
>   0,  6,  4,  1,414,   3.700
>   2,  4, 30,  3,578,   3.700  << No 
> load
>  168619, 37, 39,  4,480,   5.600  << 
> Transition sample
> 4643618, 45,  8,  1,137,  61.200  << All 
> idle states enabled
> 4736227, 40,  3,  5,111,  61.800
> 1888417,4369314, 25,  2, 89,  62.000  << 
> Transition sample
>   0,7266864,  9,  0,  0,  62.200  << 
> state 0 disabled
>   0,7193372,  9,  0,  0,  62.700
>   0,5539898,1744007,  0,  0,  63.500  << 
> Transition sample
>   0,  0,8152956,  0,  0,  63.700  << 
> states 0,1 disabled
>   0,  0,8015151,  0,  0,  63.900
>   0,  0,4146806,6349619,  0,  63.000  << 
> Transition sample
>   0,  0,  0,   13252144,  0,  61.600  << 
> states 0,1,2 disabled
>   0,  0,  0,   13258313,  0,  61.800
>   0,  0,  0,   10417428,1984451,  61.200  << 
> Transition sample
>   0,  0,  0,  0,9247172,  58.500  << 
> states 0,1,2,3 disabled
>   0,  0,  0,  0,9242657,  58.500
>   0,  0,  0,  0,9233749,  58.600
>   0,  0,  0,  0,9238444,  58.700
>   0,  0,  0,  0,9236345,  58.600
>
> For reference, this is kernel 4.20-rc5 (with your other proposed patches):
>
> State 0 State 1 State 2 State 3 State 4Watts
>   0,  4,  8,  6,426,   3.700
> 1592870,279,149, 96,831,  21.800
> 5071279,154, 25,  6,105,  61.200
> 5095090, 78, 21,  1, 86,  61.800
> 5001493, 94, 30,  4,101,  62.200
>  616019,

Re: [PATCH] x86/kernel: Fix more -Wmissing-prototypes warnings

2018-12-05 Thread Rafael J. Wysocki
On Wed, Dec 5, 2018 at 11:11 AM Borislav Petkov  wrote:
>
> From: Borislav Petkov 
>
> ... with the goal of eventually enabling -Wmissing-prototypes by
> default. At least on x86.
>
> Make functions static where possible, otherwise add prototypes or make
> them visible through includes.
>
> asm/trace/ changes courtesy of Steven Rostedt .
>
> Signed-off-by: Borislav Petkov 

Acked-by: Rafael J. Wysocki 

for the ACPI and cpufreq bits.


Re: Linux 4.20-rc5: Lab setup broken by build-related x86 change

2018-12-05 Thread Rafael J. Wysocki
On Monday, December 3, 2018 12:30:24 AM CET Linus Torvalds wrote:
> Hmm.. I'd like to say it was a normal week, but I'd be lying. rc5 is
> the biggest rc so far (with the obvious exception of rc1), and it
> looks fairly unusual in the diffstat too, with almost a third being
> arch updates. Yes, another third is drivers (normal), but even there
> almost half is in sound. The rest is tooling, mm, core networking and
> some fs updates.
> 
> So it all looks a bit odd, although none of it is hugely _alarming_.
> One of the reasons the arch side is a bit bigger than usual at this
> stage is that we got the STIPB performance regression sorted out, for
> example.

One of the changes apparently related to this broke my lab setup in which
I build everything on a central machine, export the build directory over
NFS (read-only) to some other systems and run "make modules_install" on
them from there.

After commit

commit 4cd24de3a0980bf3100c9dcb08ef65ca7c31af48
Author: Zhenzhong Duan 
Date:   Fri Nov 2 01:45:41 2018 -0700

x86/retpoline: Make CONFIG_RETPOLINE depend on compiler support

Since retpoline capable compilers are widely available, make
CONFIG_RETPOLINE hard depend on the compiler capability.

Break the build when CONFIG_RETPOLINE is enabled and the compiler does not
support it. Emit an error message in that case:

 "arch/x86/Makefile:226: *** You are building kernel with non-retpoline
  compiler, please update your compiler..  Stop."

[dwmw: Fail the build with non-retpoline compiler]

which is new in -rc5, the "make modules_install" step fails with the above
message, which doesn't make sense even, because the compiler on the clients
is the same as on the central machine in the lab and it can do retpolines
just fine.

Obviously this is due to the RETPOLINE_CFLAGS check added by the commit above.
Annoying that. :-)

Cheers,
Rafael



[GIT PULL] Power management fix for v4.20-rc6

2018-12-05 Thread Rafael J. Wysocki
Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 pm-4.20-rc6

with top-most commit a72173ecfc6774cf2d55de9fb29421ce69e3428c

 Revert "exec: make de_thread() freezable"

on top of commit 2595646791c319cadfdbf271563aac97d0843dc7

 Linux 4.20-rc5

to receive a power management fix for 4.20-rc6.

This reverts a problematic recent commit that attempted to
fix a system-wide suspend issue related to the freezer.

Thanks!


-------

Rafael J. Wysocki (1):
  Revert "exec: make de_thread() freezable"

---

 fs/exec.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)


Re: [PATCH 05/12] PCI: aardvark: add suspend to RAM support

2018-12-04 Thread Rafael J. Wysocki
On Tuesday, December 4, 2018 10:45:58 AM CET Lorenzo Pieralisi wrote:
> On Mon, Dec 03, 2018 at 11:00:20PM +0100, Rafael J. Wysocki wrote:
> > On Monday, December 3, 2018 4:38:46 PM CET Miquel Raynal wrote:
> > > Hi Lorenzo,
> > > 
> > > Lorenzo Pieralisi  wrote on Mon, 3 Dec 2018
> > > 10:27:08 +:
> > > 
> > > > [+Rafael, Sudeep]
> > > > 
> > > > On Fri, Nov 23, 2018 at 03:18:24PM +0100, Miquel Raynal wrote:
> > > > > Add suspend and resume callbacks. The priority of these are
> > > > > "_noirq()", to workaround early access to the registers done by the
> > > > > PCI core through the ->read()/->write() callbacks at resume time.
> > > > > 
> > > > > Signed-off-by: Miquel Raynal 
> > > > > ---
> > > > >  drivers/pci/controller/pci-aardvark.c | 52 
> > > > > +++
> > > > >  1 file changed, 52 insertions(+)
> > > > > 
> > > > > diff --git a/drivers/pci/controller/pci-aardvark.c 
> > > > > b/drivers/pci/controller/pci-aardvark.c
> > > > > index 108b3f15c410..7ecf1ac4036b 100644
> > > > > --- a/drivers/pci/controller/pci-aardvark.c
> > > > > +++ b/drivers/pci/controller/pci-aardvark.c
> > > > > @@ -1108,6 +1108,55 @@ static int advk_pcie_setup_clk(struct 
> > > > > advk_pcie *pci
> > > > >   return ret;
> > > > >  }
> > > > >  
> > > > > +static int __maybe_unused advk_pcie_suspend(struct device *dev)
> > > > > +{
> > > > > + struct advk_pcie *pcie = dev_get_drvdata(dev);
> > > > > +
> > > > > + advk_pcie_disable_phy(pcie);
> > > > > +
> > > > > + clk_disable_unprepare(pcie->clk);  
> > > > 
> > > > I have noticed it is common practice, still, I would like to check 
> > > > whether
> > > > it is allowed to call functions that may sleep in a NOIRQ suspend/resume
> > > > callback ?
> > > 
> > > You are right this is weird. I double checked and for instance,
> > > pcie-mediatek.c, pci-tegra.c and pci-imx6.c do the exact same thing. 
> > > There are
> > > probably other cases where drivers call functions that may sleep from a 
> > > NOIRQ
> > > context. I am interested to know if this is valid and if not, what is the
> > > alternative?
> > > 
> > 
> > Yes, it is valid.  _noirq means that the high-level action handlers
> > will not be invoked for interrupts occurring during that period, but
> > that doesn't apply to timer interrupts.
> > 
> > IOW, don't expect *your* IRQ handler to be invoked then (if this is
> > not a timer IRQ), but you can sleep.
> 
> Hi Rafael, all,
> 
> I did not ask my question (that may be silly) properly apologies. I know
> that the S2R context allows sleeping the question is, in case
> clk_disable_unprepare() (and resume counterparts) sleeps,

If it just sleeps, then this is not a problem, but if it actually *waits*
for something meaningful to happen (which I guess is what you really mean),
then things may go awry.

> what is going to wake it up, given that we are in the S2R NOIRQ phase and as
> you said the action handlers (that are possibly required to wake up the eg
> clk_disable_unprepare() caller) are disabled (unless, AFAIK,
> IRQF_NO_SUSPEND is passed at IRQ request time in the respective driver).

So if it waits for an action handler to do something and wake it up, it may
very well deadlock.  I have no idea if that really happens, though.

> The clk API implementations back-ends are beyond my depth, I just wanted
> to make sure I understand how the S2R flow is expected to work in this
> specific case.

Action handlers won't run unless the IRQs are marked as IRQF_NO_SUSPEND
(well, there are a few more complications I don't recall exactly, but
that's the basic rule).  If anything depends on them to run, it will block.

Thanks,
Rafael



[RFC/RFT][PATCH v7] cpuidle: New timer events oriented governor for tickless systems

2018-12-04 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

The venerable menu governor does some thigns that are quite
questionable in my view.

First, it includes timer wakeups in the pattern detection data and
mixes them up with wakeups from other sources which in some cases
causes it to expect what essentially would be a timer wakeup in a
time frame in which no timer wakeups are possible (becuase it knows
the time until the next timer event and that is later than the
expected wakeup time).

Second, it uses the extra exit latency limit based on the predicted
idle duration and depending on the number of tasks waiting on I/O,
even though those tasks may run on a different CPU when they are
woken up.  Moreover, the time ranges used by it for the sleep length
correction factors depend on whether or not there are tasks waiting
on I/O, which again doesn't imply anything in particular, and they
are not correlated to the list of available idle states in any way
whatever.

Also, the pattern detection code in menu may end up considering
values that are too large to matter at all, in which cases running
it is a waste of time.

A major rework of the menu governor would be required to address
these issues and the performance of at least some workloads (tuned
specifically to the current behavior of the menu governor) is likely
to suffer from that.  It is thus better to introduce an entirely new
governor without them and let everybody use the governor that works
better with their actual workloads.

The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it applies a different approach to that problem.

First, it doesn't use "correction factors" for the time till the
closest timer, but instead it tries to correlate the measured idle
duration values with the available idle states and use that
information to pick up the idle state that is most likely to "match"
the upcoming CPU idle interval.

Second, it doesn't take the number of "I/O waiters" into account at
all and the pattern detection code in it avoids taking timer wakeups
into account.  It also only uses idle duration values less than the
current time till the closest timer (with the tick excluded) for that
purpose.

Signed-off-by: Rafael J. Wysocki 
---

I prefer the v7 mostly becuase it is simpler and more straightforward,
but if the v6 turns out to be better in practice, I'll take that. :-)

v6 -> v7:
 * Actually use idle duration measured by the governor for everything (as
   it likely is more accurate than the one measured by the core).
 * Fix teo_find_shallower_state() to work if state 0 is disabled (among
   other cases).
 * Avoid using the most recent values of observed idle duration to refine
   the state selection unless the majority of them are below the target
   residency of the candidate idle state selected so far.  [This reduces
   the likelihood of triggering the refinement computation, so it may also
   reduce the overhead slightly (on average).]
 * Do not disregard the maximum observed idle duration falling in the
   interesting range.  [This should cause slightly deeper idle state to be
   selected through the selection refinement.]

v5 -> v6:
 * Avoid applying poll_time_limit to non-polling idle states by mistake.
 * Rename SPIKE to PULSE.
 * Do not run pattern detection upfront.  Instead, use recent idle duration
   values to refine the state selection after finding a candidate idle state.
 * Do not use the expected idle duration as an extra latency constraint
   (exit latency is less than the target residency for all of the idle states
   known to me anyway, so this doesn't change anything in practice).

v4 -> v5:
 * Avoid using shallow idle states when the tick has been stopped already.

v3 -> v4:
 * Make the pattern detection avoid returning too early if the minimum
   sample is too far from the average.
 * Reformat the changelog (as requested by Peter).

v2 -> v3:
 * Simplify the pattern detection code and make it return a value
lower than the time to the closest timer if the majority of recent
idle intervals are below it regardless of their variance (that should
cause it to be slightly more aggressive).
 * Do not count wakeups from state 0 due to the time limit in poll_idle()
   as non-timer.

---
 drivers/cpuidle/Kconfig|   11 
 drivers/cpuidle/governors/Makefile |1 
 drivers/cpuidle/governors/teo.c|  437 +
 3 files changed, 449 insertions(+)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===
--- /dev/null
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -0,0 +1,437 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Timer events oriented CPU idle governor
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Rafael J. Wysock

Re: [Bug] SD card reader in Acer Aspire S5 broken in 4.20-rc

2018-12-04 Thread Rafael J. Wysocki
On Tuesday, December 4, 2018 1:10:09 AM CET Bjorn Helgaas wrote:
> On Wed, Nov 28, 2018 at 02:05:21PM -0600, Bjorn Helgaas wrote:
> > On Wed, Nov 28, 2018 at 6:13 AM Rafael J. Wysocki  
> > wrote:
> > > On Tuesday, November 27, 2018 9:25:14 PM CET Bjorn Helgaas wrote:
> > > > On Mon, Nov 26, 2018 at 11:37:20PM +0100, Rafael J. Wysocki wrote:
> > > > > On Monday, November 26, 2018 7:03:58 PM CET Rafael J. Wysocki wrote:
> > > > > > Hi Bjorn,
> > > > > >
> > > > > > The SD card reader in my Acer Aspire S5 doesn't work with 4.20-rc.
> > > > > >
> > > > > > Here's what lspci -v says about it (in a bad kernel):
> > > > > >
> > > > > > 02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. 
> > > > > > RTS5209 PCI Express Card Reader
> > > > > > (rev 01)
> > > > > > Subsystem: Acer Incorporated [ALI] Device 0704
> > > > > > Flags: bus master, fast devsel, latency 0, IRQ 35
> > > > > > Memory at d9001000 (32-bit, non-prefetchable) [size=4K]
> > > > > > Capabilities: [40] Power Management version 3
> > > > > > Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
> > > > > > Capabilities: [70] Express Endpoint, MSI 00
> > > > > > Capabilities: [100] Advanced Error Reporting
> > > > > > Capabilities: [140] Device Serial Number 
> > > > > > 00-00-00-01-00-4c-e0-00
> > > > > > Kernel driver in use: rtsx_pci
> > > > > > Kernel modules: rtsx_pci
> > > >
> > > > Thanks a lot for bisecting this!
> > > >
> > > > With a good kernel (v4.19 or v4.20-rc with 17c91487364f reverted),
> > > > would you mind collecting "lspci -vv" output, the dmesg log with
> > > > "pci=earlydump", and the FADT dump?
> > >
> > > All of the information is attached to the BZ entry at
> > >
> > > https://bugzilla.kernel.org/show_bug.cgi?id=201801
> > 
> > Thanks!  I hope Patrick has a chance to look at this.  Per the
> > bugzilla mentioned in 17c91487364f, it fixes a problem with a custom
> > proprietary PCIe device, and there's a lot of good detailed analysis
> > there, so hopefully we can figure out a way to address both
> > situations.
> 
> I queued up a revert on for-linus, since we haven't made any progress on
> this yet.  I'll be on vacation much of this week, but I want to get
> the revert (or better, a fix if we can find one) in before -rc6 comes
> out next Sunday.

Thanks!

> If we figure out a fix before then, I'll drop the revert and use the
> fix instead.



Re: [PATCH] cpufreq: nforce2: Remove meaningless return

2018-12-04 Thread Rafael J. Wysocki
On Tue, Dec 4, 2018 at 2:15 AM Frank Lee  wrote:
>
> On Mon, Dec 3, 2018 at 5:14 PM Rafael J. Wysocki  wrote:
> >
> > On Fri, Nov 30, 2018 at 3:26 PM Yangtao Li  wrote:
> > >
> > > In a function whose return type is void, returning on the last line is
> > > not required.So remove it.Also move the module declaration to the end.
> >
> > The last piece is not reflected by the subject.
> >
> > Also, why do you move the MODULE_ stuff around at all?
> When writing a driver, in most cases MODULE_  are put to the end.
> Why not modify this? it is more in line with the habits of most people.

If you write a new driver, then yes.

But why do you want to modify existing drivers this way?  What value
do you see in those changes, exactly?


Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Rafael J. Wysocki
On Thursday, November 29, 2018 12:20:07 AM CET Doug Smythies wrote:
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
> 
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).
> 
> -- above missing-- (see follow up e-mail from Rafael)
> 
>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).
> 
> Hi Rafael,
> 
> I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
> reference kernel.
> 
> Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.
> 
> Note: because it uses the disk, the dbench test is somewhat non-repeatable.
> However, if particular attention is paid to not doing anything else with
> the disk between tests, then it seems to be repeatable to within about 6%.
> 
> Anyway no significant difference observed between kernel 4.20-rc3 and the
> same with the teov6 patch.
> 
> Test 2: Pipe test, non cross core. (And idle state 0 test, really)
> I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
> Thus, pretty much only idle state 0 was ever used.
> Processor package power was similar for both kernels.
> teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
> -rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
> There was a difference in percentage time spent in idle state 0,
> with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
> teov6 at 0.0641%.
> 
> For throughput, teov6 was 1.4% faster.

This may indicate that teov6 is somewhat too aggressive.

> Test 3: was an attempt to sweep through a preference for
> all idle states.
> 
> 40 threads were launched with nothing to do except sleep
> for a variable duration of 1 to 500 uSec, each step was
> run for 1 minute. With 1 minute idle before the test and a few
> minutes idle after, the total test duration was about 505 minutes.
> Recall that when one asks for a short sleep of 1 uSec, they actually
> get about 50 uSec, due to overheads. So I use 40 threads in an attempt
> to get the average time between wakeup events per CPU down somewhat.
> 
> The results are here:
> http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm

And, so long as my understanding of the graphs is correct, the results
here indicate that teov6 tends to prefer relatively shallow idle states
which is good for performance (at least with some workloads), but not
necessarily for energy-efficiency.

I will send a v7 of TEO with some changes to make it a bit more
energy-efficient with respect to the v6.

Thanks,
Rafael



Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Rafael J. Wysocki
On Friday, November 30, 2018 9:51:19 AM CET Rafael J. Wysocki wrote:
> Hi Doug,
> 
> On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies  wrote:
> >
> > Hi Rafael,
> >
> > On 2018.11.23 02:36 Rafael J. Wysocki wrote:
> >
> > ... [snip]...
> >
> > > +/**
> > > + * teo_find_shallower_state - Find shallower idle state matching given 
> > > duration.
> > > + * @drv: cpuidle driver containing state data.
> > > + * @dev: Target CPU.
> > > + * @state_idx: Index of the capping idle state.
> > > + * @duration_us: Idle duration value to match.
> > > + */
> > > +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> > > + struct cpuidle_device *dev, int 
> > > state_idx,
> > > + unsigned int duration_us)
> > > +{
> > > + int i;
> > > +
> > > + for (i = state_idx - 1; i > 0; i--) {
> > > + if (drv->states[i].disabled || dev->states_usage[i].disable)
> > > + continue;
> > > +
> > > + if (drv->states[i].target_residency <= duration_us)
> > > + break;
> > > + }
> > > + return i;
> > > +}
> >
> > I think this subroutine has a problem when idle state 0
> > is disabled.
> 
> You are right, thanks!
> 
> > Perhaps something like this might help:
> >
> > diff --git a/drivers/cpuidle/governors/teo.c 
> > b/drivers/cpuidle/governors/teo.c
> > index bc1c9a2..5b97639 100644
> > --- a/drivers/cpuidle/governors/teo.c
> > +++ b/drivers/cpuidle/governors/teo.c
> > @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, 
> > struct cpuidle_device *dev)
> >  }
> >
> >  /**
> > - * teo_find_shallower_state - Find shallower idle state matching given 
> > duration.
> > + * teo_find_shallower_state - Find shallower idle state matching given
> > + * duration, if possible.
> >   * @drv: cpuidle driver containing state data.
> >   * @dev: Target CPU.
> >   * @state_idx: Index of the capping idle state.
> > @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct 
> > cpuidle_driver *drv,
> >  {
> > int i;
> >
> > -   for (i = state_idx - 1; i > 0; i--) {
> > +   for (i = state_idx - 1; i >= 0; i--) {
> > if (drv->states[i].disabled || dev->states_usage[i].disable)
> > continue;
> >
> > if (drv->states[i].target_residency <= duration_us)
> > break;
> > }
> > +   if (i < 0)
> > +   i = state_idx;
> > return i;
> >  }
> 
> I'll do something slightly similar, but equivalent.

I actually ended up fixing it differently, as the above will cause state_idx
to be returned even if some states shallower than state_idx are enabled, but
their target residencies are higher than duration_us.  In that case, though,
it still is more correct to return the shallowest enabled state rather than
state_idx.

> >
> > @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, 
> > struct cpuidle_device *dev,
> > if (max_early_idx >= 0 &&
> > count < cpu_data->states[i].early_hits)
> > count = cpu_data->states[i].early_hits;
> > -
> > continue;
> > }
> >
> > There is an additional issue where if idle state 0 is disabled (with the 
> > above suggested code patch),
> > idle state usage seems to fall to deeper states than idle state 1.
> > This is not the expected behaviour.
> 
> No, it isn't.
> 
> > Kernel 4.20-rc3 works as expected.
> > I have not figured this issue out yet, in the code.
> >
> > Example (1 minute per sample. Number of entries/exits per state):
> > State 0 State 1 State 2 State 3 State 4Watts
> >28235143, 83, 26, 17,837,  64.900
> > 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
> > Transition sample, after idle state 0 disabled
> >   0, 793517,7186099,   10559878,   38485721,  61.900 << ?? 
> > should have all gone into Idle state 1
> >   0, 795414,7340703,   10553117,   38513456,  62.050
> >   0, 807028,7288195,   10574113,   38523524,  62.167
> >   0, 814983,7403534,   10575108,   38571228,  62.167
> >   0, 838302,7747127,   10552289,   38556054,  62.183
> > 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
> > Transition sample, after idle state 0 enabled
> >27893504, 96, 40,  9,912,  66.500
> >26556343, 83, 29,  7,814,  66.683
> >27929227, 64, 20, 10,931,  66.683
> 
> I see.
> 
> OK, I'll look into this too, thanks!

This probably is the artifact of the fix for the teo_find_shallower_state()
issue.

Anyway, I'm not able to reproduce this with the teo_find_shallower_state() issue
fixed differently.

Thanks,
Rafael



Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-12-03 Thread Rafael J. Wysocki
On Saturday, December 1, 2018 3:18:24 PM CET Giovanni Gherdovich wrote:
> On Fri, 2018-11-23 at 11:35 +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 

[cut]

> > 
> > [snip]
> 
> [NOTE: the tables in this message are quite wide. If this doesn't get to you
> properly formatted you can read a copy of this message at the URL
> https://beta.suse.com/private/ggherdovich/teo-eval/teo-v6-eval.html ]
> 
> All performance concerns manifested in v5 are wiped out by v6. Not only v6
> improves over v5, but is even better than the baseline (menu) in most
> cases. The optimizations in v6 paid off!

This is very encouraging, thank you!

> The overview of the analysis for v5, from the message
> https://lore.kernel.org/lkml/1541877001.17878.5.ca...@suse.cz , was:
> 
> > The quick summary is:
> > 
> > ---> sockperf on loopback over UDP, mode "throughput":
> >  this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is 
> > completely
> >  recovered in v3 and v5. Good stuff.
> > 
> > ---> dbench on xfs:
> >  this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10%
> >  regression. Slight improvement. What's really hurting here is the 
> > single
> >  client scenario.
> > 
> > ---> netperf-udp on loopback:
> >  had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what
> >  happens in v5.
> > 
> > ---> tbench on loopback:
> >  was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a 
> > 12%
> >  regression. As in dbench, it's at low number of clients that the 
> > results
> >  are worst. Note that this machine is different from the one that has 
> > the
> >  dbench regression.
> 
> now the situation is overturned:
> 
> ---> sockperf on loopback over UDP, mode "throughput":
>  No new problems from 48x-HASWELL-NUMA, which stays put at the level of
>  the baseline. OTOH 80x-BROADWELL-NUMA and 8x-SKYLAKE-UMA improve over the
>  baseline of 8% and 10% respectively.

Good.

> ---> dbench on xfs:
>  48x-HASWELL-NUMA rebounds from the previous 10% degradation and it's now
>  at 0, i.e. the baseline level. The 1-client case, responsible for the
>  previous overall degradation (I average results from different number of
>  clients), went from -40% to -20% and is compensated in my table by
>  improvements with 4, 8, 16 and 32 clients (table below).
> 
> ---> netperf-udp on loopback:
>  8x-SKYLAKE-UMA now shows a 9% improvement over  baseline.
>  80x-BROADWELL-NUMA, previously similar to baseline, now improves 7%.

Good.

> ---> tbench on loopback:
>  Impressive change of color for 8x-SKYLAKE-UMA, from 12% regression in v5
>  to 7% improvement in v6. The problematic 1- and 2-clients cases went from
>  -25% and -33% to +13% and +10% respectively.

Awesome. :-)

> Details below.
> 
> Runs are compared against v4.18 with the Menu governor. I know v4.18 is a
> little old now but that's where I measured my baseline. My machine pool didn't
> change:
> 
> * single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA)
> * two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here 
> onwards)
> * two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here 
> onwards)
> 

[cut]

> 
> 
> PREVIOUSLY REGRESSING BENCHMARKS: OVERVIEW
> ==
> 
> * sockperf on loopback over UDP, mode "throughput"
> * global-dhp__network-sockperf-unbound
> 48x-HASWELL-NUMA fixed since v2, the others greatly improved in v6.
> 
> teo-v1  teo-v2  teo-v3  teo-v5  teo-v6
>   
> ---
>   8x-SKYLAKE-UMA1% worse1% worse1% worse1% worse10% 
> better
>   80x-BROADWELL-NUMA3% better   2% better   5% better   3% worse8% 
> better
>   48x-HASWELL-NUMA  4% better   12% worse   no change   no change   no 
> change
> 
> * dbench on xfs
> * global-dhp__io-dbench4-async-xfs
> 48x-HASWELL-NUMA is fixed wrt v5 and earlier versions.
> 
> teo-v1  teo-v2  teo-v3 teo-v5   
> teo-v6   
>   
> ---
>   8x-SKYLAKE-UMA3% better   4% better   6% better  4% better5% 
> better
>   80x-BROADWELL-NUMAno change   no change   1% worse   3% worse 2% 
> better
>   48x-HASWELL-NUMA  6% worse16% worse   8% worse   10% worseno

Re: [PATCH 05/12] PCI: aardvark: add suspend to RAM support

2018-12-03 Thread Rafael J. Wysocki
On Monday, December 3, 2018 4:38:46 PM CET Miquel Raynal wrote:
> Hi Lorenzo,
> 
> Lorenzo Pieralisi  wrote on Mon, 3 Dec 2018
> 10:27:08 +:
> 
> > [+Rafael, Sudeep]
> > 
> > On Fri, Nov 23, 2018 at 03:18:24PM +0100, Miquel Raynal wrote:
> > > Add suspend and resume callbacks. The priority of these are
> > > "_noirq()", to workaround early access to the registers done by the
> > > PCI core through the ->read()/->write() callbacks at resume time.
> > > 
> > > Signed-off-by: Miquel Raynal 
> > > ---
> > >  drivers/pci/controller/pci-aardvark.c | 52 +++
> > >  1 file changed, 52 insertions(+)
> > > 
> > > diff --git a/drivers/pci/controller/pci-aardvark.c 
> > > b/drivers/pci/controller/pci-aardvark.c
> > > index 108b3f15c410..7ecf1ac4036b 100644
> > > --- a/drivers/pci/controller/pci-aardvark.c
> > > +++ b/drivers/pci/controller/pci-aardvark.c
> > > @@ -1108,6 +1108,55 @@ static int advk_pcie_setup_clk(struct advk_pcie 
> > > *pcie)
> > >   return ret;
> > >  }
> > >  
> > > +static int __maybe_unused advk_pcie_suspend(struct device *dev)
> > > +{
> > > + struct advk_pcie *pcie = dev_get_drvdata(dev);
> > > +
> > > + advk_pcie_disable_phy(pcie);
> > > +
> > > + clk_disable_unprepare(pcie->clk);  
> > 
> > I have noticed it is common practice, still, I would like to check whether
> > it is allowed to call functions that may sleep in a NOIRQ suspend/resume
> > callback ?
> 
> You are right this is weird. I double checked and for instance,
> pcie-mediatek.c, pci-tegra.c and pci-imx6.c do the exact same thing. There are
> probably other cases where drivers call functions that may sleep from a NOIRQ
> context. I am interested to know if this is valid and if not, what is the
> alternative?
> 

Yes, it is valid.  _noirq means that the high-level action handlers will not be
invoked for interrupts occurring during that period, but that doesn't apply to
timer interrupts.

IOW, don't expect *your* IRQ handler to be invoked then (if this is not a timer
IRQ), but you can sleep.

Thanks,
Rafael



Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-03 Thread Rafael J. Wysocki
On Monday, December 3, 2018 9:39:42 AM CET Michal Hocko wrote:
> On Mon 03-12-18 08:47:00, Ingo Molnar wrote:
> [...]
> > I reviewed the ->cred_guard_mutex code, and the mutex is held across all 
> > of exec() - and we always did this.
> 
> Yes, this is something that has been pointed out during the review. Oleg
> has argued that making this path freezable is really hard and that we
> should be changing de_thread to sleep withtou cred_guard_mutex long term
> anyway (http://lkml.kernel.org/r/20181114143705.gb13...@redhat.com).
> 
> Failing suspend seems like a real problem while the lockdep one doesn't
> really reflect any real deadlock, right? So while the patch is not
> perfect it shouldn't make the situation much worse. Lockdep splat is
> certainly annoying but is it any worse than a suspend failing?
> 
> Now, I wouldn't mind to revert this because the code is really old and
> we haven't seen many bug reports about failing suspend yet. But what is
> the actual plan to make this work properly? Use
> freezable_schedule_unsafe instead? Freezer code has some
> fundamental design issues which are quite hard to get over.

I agree and we just need to look deeper into this.

I had hoped that this would work since you and Oleg agreed with it, but since
it doesn't, let's do a revert for now and get back to this later.

Thanks,
Rafael



Re: [PATCH] Revert "exec: make de_thread() freezable (was: Re: Linux 4.20-rc4)

2018-12-03 Thread Rafael J. Wysocki
On Monday, December 3, 2018 8:47:00 AM CET Ingo Molnar wrote:
> 
> * Linus Torvalds  wrote:
> 
> > The patch stats this week look a little bit more normal than last tim,
> > probably simply because it's also a normal-sized rc4 rather than the
> > unusually small rc3.
> 
> So there's a new regression in v4.20-rc4, my desktop produces this 
> lockdep splat:
> 
> [ 1772.588771] WARNING: pkexec/4633 still has locks held!
> [ 1772.588773] 4.20.0-rc4-custom-00213-g93a49841322b #1 Not tainted
> [ 1772.588775] 
> [ 1772.588776] 1 lock held by pkexec/4633:
> [ 1772.588778]  #0: ed85fbf8 (>cred_guard_mutex){+.+.}, at: 
> prepare_bprm_creds+0x2a/0x70
> [ 1772.588786] stack backtrace:
> [ 1772.588789] CPU: 7 PID: 4633 Comm: pkexec Not tainted 
> 4.20.0-rc4-custom-00213-g93a49841322b #1
> [ 1772.588792] Call Trace:
> [ 1772.588800]  dump_stack+0x85/0xcb
> [ 1772.588803]  flush_old_exec+0x116/0x890
> [ 1772.588807]  ? load_elf_phdrs+0x72/0xb0
> [ 1772.588809]  load_elf_binary+0x291/0x1620
> [ 1772.588815]  ? sched_clock+0x5/0x10
> [ 1772.588817]  ? search_binary_handler+0x6d/0x240
> [ 1772.588820]  search_binary_handler+0x80/0x240
> [ 1772.588823]  load_script+0x201/0x220
> [ 1772.588825]  search_binary_handler+0x80/0x240
> [ 1772.588828]  __do_execve_file.isra.32+0x7d2/0xa60
> [ 1772.588832]  ? strncpy_from_user+0x40/0x180
> [ 1772.588835]  __x64_sys_execve+0x34/0x40
> [ 1772.588838]  do_syscall_64+0x60/0x1c0
> 
> The warning gets triggered by an ancient lockdep check in the freezer:
> 
> (gdb) list *0x812ece06
> 0x812ece06 is in flush_old_exec (./include/linux/freezer.h:57).
> 52 * DO NOT ADD ANY NEW CALLERS OF THIS FUNCTION
> 53 * If try_to_freeze causes a lockdep warning it means the caller may 
> deadlock
> 54 */
> 55static inline bool try_to_freeze_unsafe(void)
> 56{
> 57might_sleep();
> 58if (likely(!freezing(current)))
> 59return false;
> 60return __refrigerator(false);
> 61}
> 
> I reviewed the ->cred_guard_mutex code, and the mutex is held across all 
> of exec() - and we always did this.
> 
> But there's this recent -rc4 commit:
> 
> > Chanho Min (1):
> >   exec: make de_thread() freezable
> 
>   c22397888f1e: exec: make de_thread() freezable
> 
> I believe this commit is bogus, you cannot call try_to_freeze() from 
> de_thread(), because it's holding the ->cred_guard_mutex.
> 
> Also, I'm 3 times grumpy:

Well, sorry about that.

>  #1: I think this commit was never tested with lockdep turned on, as I 
>  think splat this should trigger 100% of the time with the ELF 
>  binfmt loader.

It has been there in my local builds/tests, so I must have overlooked the
splat.

>  #2: Less than 4 days passed between the commit on Nov 19 and it being 
>  upstream by Nov 23. *Please* give it more testing if you change 
>  something as central as fs/exec.c ...

It has been under review for quite a bit more time though and I had hoped
that there would be some reports from linux-next coverage if there were
problems with it, but nothig has been reported till now.

>  #3  Also, please also proof-read changelogs before committing them:

I do that as a rule and I tend to fix up such things in changelogs.

Not sure why I didn't do that this time.  Because of travel perhaps.

>  >>  The casue is that de_thread() sleeps in TASK_UNINTERRUPTIBLE waiting 
> for
>  >>  [...]
>  >>
>  >>  In our machine, it causes freeze timeout as bellows.
> 
>  That's *three* typos in a single commit:
> 
> s/casue/cause
> s/In our/On our
> s/bellows/below
> 
>  ...
> 
> Grump! :-)
> 
> Note that I haven't tested the revert yet, but the code and the breakage 
> looks pretty obvious. (I'll boot the revert, will follow up if that 
> didn't solve the problem.)

I can queue up a revert unless anyone beats me to that.

Thanks,
Rafael



Re: [PATCH] sched/cpufreq: Add the SPDX tags

2018-12-03 Thread Rafael J. Wysocki
On Mon, Dec 3, 2018 at 11:15 AM Daniel Lezcano
 wrote:
>
> On 03/12/2018 11:12, Rafael J. Wysocki wrote:
> > On Mon, Dec 3, 2018 at 11:07 AM Daniel Lezcano
> >  wrote:
> >>
> >> On 18/10/2018 21:55, Daniel Lezcano wrote:
> >>> The SPDX tags are not present in cpufreq.c and cpufreq_schedutil.c.
> >>>
> >>> Add them and remove the license descriptions
> >>>
> >>> Signed-off-by: Daniel Lezcano 
> >>
> >> Hi Rafael,
> >>
> >> I think this patch passed between the cracks.
> >
> > Well, please resend. :-)
>
> I can resend, but can't you pick this one directly?

No, I can't apply it, because it depends on a patch that was not applied.


Re: [PATCH] sched/cpufreq: Add the SPDX tags

2018-12-03 Thread Rafael J. Wysocki
On Mon, Dec 3, 2018 at 11:07 AM Daniel Lezcano
 wrote:
>
> On 18/10/2018 21:55, Daniel Lezcano wrote:
> > The SPDX tags are not present in cpufreq.c and cpufreq_schedutil.c.
> >
> > Add them and remove the license descriptions
> >
> > Signed-off-by: Daniel Lezcano 
>
> Hi Rafael,
>
> I think this patch passed between the cracks.

Well, please resend. :-)


Re: [PATCH] cpufreq: nforce2: Remove meaningless return

2018-12-03 Thread Rafael J. Wysocki
On Fri, Nov 30, 2018 at 3:26 PM Yangtao Li  wrote:
>
> In a function whose return type is void, returning on the last line is
> not required.So remove it.Also move the module declaration to the end.

The last piece is not reflected by the subject.

Also, why do you move the MODULE_ stuff around at all?

>
> Signed-off-by: Yangtao Li 
> ---
>  drivers/cpufreq/cpufreq-nforce2.c | 11 ---
>  1 file changed, 4 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/cpufreq/cpufreq-nforce2.c 
> b/drivers/cpufreq/cpufreq-nforce2.c
> index dbf82f36d270..ccff1f2a7c25 100644
> --- a/drivers/cpufreq/cpufreq-nforce2.c
> +++ b/drivers/cpufreq/cpufreq-nforce2.c
> @@ -47,10 +47,6 @@ static int fid;
>  static int min_fsb;
>  static int max_fsb;
>
> -MODULE_AUTHOR("Sebastian Witt ");
> -MODULE_DESCRIPTION("nForce2 FSB changing cpufreq driver");
> -MODULE_LICENSE("GPL");
> -
>  module_param(fid, int, 0444);
>  module_param(min_fsb, int, 0444);
>
> @@ -123,8 +119,6 @@ static void nforce2_write_pll(int pll)
> /* Now write the value in all 64 registers */
> for (temp = 0; temp <= 0x3f; temp++)
> pci_write_config_dword(nforce2_dev, NFORCE2_PLLREG, pll);
> -
> -   return;
>  }
>
>  /**
> @@ -436,6 +430,9 @@ static void __exit nforce2_exit(void)
> cpufreq_unregister_driver(_driver);
>  }
>
> +MODULE_AUTHOR("Sebastian Witt ");
> +MODULE_DESCRIPTION("nForce2 FSB changing cpufreq driver");
> +MODULE_LICENSE("GPL");
> +
>  module_init(nforce2_init);
>  module_exit(nforce2_exit);
> -
> --
> 2.17.0
>


Re: [PATCH] driver core: remove define_genpd_open_function() and define_genpd_debugfs_fops()

2018-12-03 Thread Rafael J. Wysocki
On Sat, Dec 1, 2018 at 2:51 AM Yangtao Li  wrote:
>
> We already have the DEFINE_SHOW_ATTRIBUTE,There is no need to define such
> a macro,so remove define_genpd_open_function and define_genpd_debugfs_fops.
> Also use DEFINE_SHOW_ATTRIBUTE to simplify somecode.
>
> Signed-off-by: Yangtao Li 

It would be good at least to split the genpd changes off into a separate patch.

The regmap and comonent changes appear to be independent of each other
too, so I would put them into separate patches either for easier
handling.

> ---
>  drivers/base/component.c  | 12 +
>  drivers/base/power/domain.c   | 71 +--
>  drivers/base/regmap/regcache-rbtree.c | 12 +
>  drivers/base/regmap/regmap-debugfs.c  | 12 +
>  4 files changed, 27 insertions(+), 80 deletions(-)
>
> diff --git a/drivers/base/component.c b/drivers/base/component.c
> index e8d676fad0c9..ddcea8739c12 100644
> --- a/drivers/base/component.c
> +++ b/drivers/base/component.c
> @@ -85,17 +85,7 @@ static int component_devices_show(struct seq_file *s, void 
> *data)
> return 0;
>  }
>
> -static int component_devices_open(struct inode *inode, struct file *file)
> -{
> -   return single_open(file, component_devices_show, inode->i_private);
> -}
> -
> -static const struct file_operations component_devices_fops = {
> -   .open = component_devices_open,
> -   .read = seq_read,
> -   .llseek = seq_lseek,
> -   .release = single_release,
> -};
> +DEFINE_SHOW_ATTRIBUTE(component_devices);
>
>  static int __init component_debug_init(void)
>  {
> diff --git a/drivers/base/power/domain.c b/drivers/base/power/domain.c
> index 7f38a92b444a..10a61d6147d0 100644
> --- a/drivers/base/power/domain.c
> +++ b/drivers/base/power/domain.c
> @@ -2671,7 +2671,7 @@ static int genpd_summary_one(struct seq_file *s,
> return 0;
>  }
>
> -static int genpd_summary_show(struct seq_file *s, void *data)
> +static int summary_show(struct seq_file *s, void *data)
>  {
> struct generic_pm_domain *genpd;
> int ret = 0;
> @@ -2694,7 +2694,7 @@ static int genpd_summary_show(struct seq_file *s, void 
> *data)
> return ret;
>  }
>
> -static int genpd_status_show(struct seq_file *s, void *data)
> +static int status_show(struct seq_file *s, void *data)
>  {
> static const char * const status_lookup[] = {
> [GPD_STATE_ACTIVE] = "on",
> @@ -2721,7 +2721,7 @@ static int genpd_status_show(struct seq_file *s, void 
> *data)
> return ret;
>  }
>
> -static int genpd_sub_domains_show(struct seq_file *s, void *data)
> +static int sub_domains_show(struct seq_file *s, void *data)
>  {
> struct generic_pm_domain *genpd = s->private;
> struct gpd_link *link;
> @@ -2738,7 +2738,7 @@ static int genpd_sub_domains_show(struct seq_file *s, 
> void *data)
> return ret;
>  }
>
> -static int genpd_idle_states_show(struct seq_file *s, void *data)
> +static int idle_states_show(struct seq_file *s, void *data)
>  {
> struct generic_pm_domain *genpd = s->private;
> unsigned int i;
> @@ -2767,7 +2767,7 @@ static int genpd_idle_states_show(struct seq_file *s, 
> void *data)
> return ret;
>  }
>
> -static int genpd_active_time_show(struct seq_file *s, void *data)
> +static int active_time_show(struct seq_file *s, void *data)
>  {
> struct generic_pm_domain *genpd = s->private;
> ktime_t delta = 0;
> @@ -2787,7 +2787,7 @@ static int genpd_active_time_show(struct seq_file *s, 
> void *data)
> return ret;
>  }
>
> -static int genpd_total_idle_time_show(struct seq_file *s, void *data)
> +static int total_idle_time_show(struct seq_file *s, void *data)
>  {
> struct generic_pm_domain *genpd = s->private;
> ktime_t delta = 0, total = 0;
> @@ -2815,7 +2815,7 @@ static int genpd_total_idle_time_show(struct seq_file 
> *s, void *data)
>  }
>
>
> -static int genpd_devices_show(struct seq_file *s, void *data)
> +static int devices_show(struct seq_file *s, void *data)
>  {
> struct generic_pm_domain *genpd = s->private;
> struct pm_domain_data *pm_data;
> @@ -2841,7 +2841,7 @@ static int genpd_devices_show(struct seq_file *s, void 
> *data)
> return ret;
>  }
>
> -static int genpd_perf_state_show(struct seq_file *s, void *data)
> +static int perf_state_show(struct seq_file *s, void *data)
>  {
> struct generic_pm_domain *genpd = s->private;
>
> @@ -2854,37 +2854,14 @@ static int genpd_perf_state_show(struct seq_file *s, 
> void *data)
> return 0;
>  }
>
> -#define define_genpd_open_function(name) \
> -static int genpd_##name##_open(struct inode *inode, struct file *file) \
> -{ \
> -   return single_open(file, genpd_##name##_show, inode->i_private); \
> -}
> -
> -define_genpd_open_function(summary);
> -define_genpd_open_function(status);
> -define_genpd_open_function(sub_domains);
> -define_genpd_open_function(idle_states);
> -define_genpd_open_function(active_time);
> 

Re: [PATCH RFC 06/15] cpufreq: replace **** with a hug

2018-12-03 Thread Rafael J. Wysocki
On Fri, Nov 30, 2018 at 8:28 PM Jarkko Sakkinen
 wrote:
>
> In order to comply with the CoC, replace  with a hug.
>
> Signed-off-by: Jarkko Sakkinen 
> ---
>  drivers/cpufreq/powernow-k7.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/powernow-k7.c b/drivers/cpufreq/powernow-k7.c
> index d6cb052b0a75..302315ad3786 100644
> --- a/drivers/cpufreq/powernow-k7.c
> +++ b/drivers/cpufreq/powernow-k7.c
> @@ -574,7 +574,7 @@ static int acer_cpufreq_pst(const struct dmi_system_id *d)
>  }
>
>  /*
> - * Some Athlon laptops have really fucked PST tables.
> + * Some Athlon laptops have really hugged PST tables.

Can we just say "broken"?

>   * A BIOS update is all that can save them.
>   * Mention this, and disable cpufreq.
>   */
> --
> 2.19.1
>


Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-30 Thread Rafael J. Wysocki
Hi Doug,

On Fri, Nov 30, 2018 at 8:49 AM Doug Smythies  wrote:
>
> Hi Rafael,
>
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
>
> ... [snip]...
>
> > +/**
> > + * teo_find_shallower_state - Find shallower idle state matching given 
> > duration.
> > + * @drv: cpuidle driver containing state data.
> > + * @dev: Target CPU.
> > + * @state_idx: Index of the capping idle state.
> > + * @duration_us: Idle duration value to match.
> > + */
> > +static int teo_find_shallower_state(struct cpuidle_driver *drv,
> > + struct cpuidle_device *dev, int state_idx,
> > + unsigned int duration_us)
> > +{
> > + int i;
> > +
> > + for (i = state_idx - 1; i > 0; i--) {
> > + if (drv->states[i].disabled || dev->states_usage[i].disable)
> > + continue;
> > +
> > + if (drv->states[i].target_residency <= duration_us)
> > + break;
> > + }
> > + return i;
> > +}
>
> I think this subroutine has a problem when idle state 0
> is disabled.

You are right, thanks!

> Perhaps something like this might help:
>
> diff --git a/drivers/cpuidle/governors/teo.c b/drivers/cpuidle/governors/teo.c
> index bc1c9a2..5b97639 100644
> --- a/drivers/cpuidle/governors/teo.c
> +++ b/drivers/cpuidle/governors/teo.c
> @@ -196,7 +196,8 @@ static void teo_update(struct cpuidle_driver *drv, struct 
> cpuidle_device *dev)
>  }
>
>  /**
> - * teo_find_shallower_state - Find shallower idle state matching given 
> duration.
> + * teo_find_shallower_state - Find shallower idle state matching given
> + * duration, if possible.
>   * @drv: cpuidle driver containing state data.
>   * @dev: Target CPU.
>   * @state_idx: Index of the capping idle state.
> @@ -208,13 +209,15 @@ static int teo_find_shallower_state(struct 
> cpuidle_driver *drv,
>  {
> int i;
>
> -   for (i = state_idx - 1; i > 0; i--) {
> +   for (i = state_idx - 1; i >= 0; i--) {
> if (drv->states[i].disabled || dev->states_usage[i].disable)
> continue;
>
> if (drv->states[i].target_residency <= duration_us)
> break;
> }
> +   if (i < 0)
> +   i = state_idx;
> return i;
>  }

I'll do something slightly similar, but equivalent.

>
> @@ -264,7 +267,6 @@ static int teo_select(struct cpuidle_driver *drv, struct 
> cpuidle_device *dev,
> if (max_early_idx >= 0 &&
> count < cpu_data->states[i].early_hits)
> count = cpu_data->states[i].early_hits;
> -
> continue;
> }
>
> There is an additional issue where if idle state 0 is disabled (with the 
> above suggested code patch),
> idle state usage seems to fall to deeper states than idle state 1.
> This is not the expected behaviour.

No, it isn't.

> Kernel 4.20-rc3 works as expected.
> I have not figured this issue out yet, in the code.
>
> Example (1 minute per sample. Number of entries/exits per state):
> State 0 State 1 State 2 State 3 State 4Watts
>28235143, 83, 26, 17,837,  64.900
> 5583238, 657079,5884941,8498552,   30986831,  62.433 << 
> Transition sample, after idle state 0 disabled
>   0, 793517,7186099,   10559878,   38485721,  61.900 << ?? 
> should have all gone into Idle state 1
>   0, 795414,7340703,   10553117,   38513456,  62.050
>   0, 807028,7288195,   10574113,   38523524,  62.167
>   0, 814983,7403534,   10575108,   38571228,  62.167
>   0, 838302,7747127,   10552289,   38556054,  62.183
> 9664999, 544473,4914512,6942037,   25295361,  63.633 << 
> Transition sample, after idle state 0 enabled
>27893504, 96, 40,  9,912,  66.500
>26556343, 83, 29,  7,814,  66.683
>27929227, 64, 20, 10,931,  66.683

I see.

OK, I'll look into this too, thanks!


Re: [PATCH v2 0/6] device property: Introducing software nodes

2018-11-29 Thread Rafael J. Wysocki
On Friday, November 9, 2018 3:21:32 PM CET Heikki Krogerus wrote:
> Hi,
> 
> This is the second version of my proposal for "software nodes". There
> was a "dereferencing freed memory" bug in patch 3/5 which is now
> fixed. device_add_properties() and device_remove_properties() no
> longer change places in the code as requested by Andy.
> 
> The original RFC can be checked from here:
> https://lkml.org/lkml/2018/10/12/518
> 
> The origin commit message:
> 
> To continue the discussion started by Dmitry [1], this is my proposal
> that I mentioned in my last mail. In short, the idea is that instead
> of trying to extend the support for the currently used struct
> property_set, I'm proposing that we introduce a completely new,
> independent type of fwnode, and replace the struct property_set with
> it. I'm calling the type "software node" here.
> 
> The reason for a complete separation of the software nodes from the
> generic property handling code is the need to be able to create the
> nodes independently from the devices that they are bind to.
> 
> The way this works is that every node that is created will have a
> kobject registered. That will take care the ref counting for us, and
> also allow us to for example display the properties in sysfs.
> 
> There are a few more details in patch 3/5 about the software nodes in
> the commit message.
> 
> [1] https://lkml.org/lkml/2018/9/17/1067
> 
> --
> heikki
> 
> 
> Heikki Krogerus (6):
>   driver core: platform: Remove duplicated device_remove_properties()
> call
>   drivers core: Prepare support for multiple platform notifications
>   ACPI / glue: Add acpi_platform_notify() function
>   drivers: base: Introducing software nodes to the firmware node
> framework
>   device property: Move device_add_properties() to swnode.c
>   device property: Remove struct property_set
> 
>  .../ABI/testing/sysfs-devices-software_node   |  10 +
>  drivers/acpi/bus.c|   1 -
>  drivers/acpi/glue.c   |  21 +-
>  drivers/acpi/internal.h   |   1 -
>  drivers/base/Makefile |   2 +-
>  drivers/base/core.c   |  34 +-
>  drivers/base/platform.c   |   1 -
>  drivers/base/property.c   | 511 +
>  drivers/base/swnode.c | 678 ++
>  include/linux/acpi.h  |  10 +
>  include/linux/property.h  |  12 +
>  11 files changed, 768 insertions(+), 513 deletions(-)
>  create mode 100644 Documentation/ABI/testing/sysfs-devices-software_node
>  create mode 100644 drivers/base/swnode.c
> 
> 

All [1-6/6] applied, thanks!




Re: [PATCH v3 1/4] PCI / ACPI: Identify untrusted PCI devices

2018-11-29 Thread Rafael J. Wysocki
On Thu, Nov 29, 2018 at 4:52 PM Mika Westerberg
 wrote:
>
> A malicious PCI device may use DMA to attack the system. An external
> Thunderbolt port is a convenient point to attach such a device. The OS
> may use IOMMU to defend against DMA attacks.
>
> Recent BIOSes with Thunderbolt ports mark these externally facing root
> ports with this ACPI _DSD [1]:
>
>   Name (_DSD, Package () {
>   ToUUID ("efcc06cc-73ac-4bc3-bff0-76143807c389"),
>   Package () {
>   Package () {"ExternalFacingPort", 1},
>   Package () {"UID", 0 }
>   }
>   })
>
> If we find such a root port, mark it and all its children as untrusted.
> The rest of the OS may use this information to enable DMA protection
> against malicious devices. For instance the device may be put behind an
> IOMMU to keep it from accessing memory outside of what the driver has
> allocated for it.
>
> While at it, add a comment on top of prp_guids array explaining the
> possible caveat resulting when these GUIDs are treated equivalent.
>
> [1] 
> https://docs.microsoft.com/en-us/windows-hardware/drivers/pci/dsd-for-pcie-root-ports#identifying-externally-exposed-pcie-root-ports
>
> Signed-off-by: Mika Westerberg 

Acked-by: Rafael J. Wysocki 

> ---
>  drivers/acpi/property.c | 11 +++
>  drivers/pci/pci-acpi.c  | 19 +++
>  drivers/pci/probe.c | 15 +++
>  include/linux/pci.h |  8 
>  4 files changed, 53 insertions(+)
>
> diff --git a/drivers/acpi/property.c b/drivers/acpi/property.c
> index 8c7c4583b52d..77abe0ec4043 100644
> --- a/drivers/acpi/property.c
> +++ b/drivers/acpi/property.c
> @@ -24,6 +24,14 @@ static int acpi_data_get_property_array(const struct 
> acpi_device_data *data,
> acpi_object_type type,
> const union acpi_object **obj);
>
> +/*
> + * The GUIDs here are made equivalent to each other in order to avoid extra
> + * complexity in the properties handling code, with the caveat that the
> + * kernel will accept certain combinations of GUID and properties that are
> + * not defined without a warning. For instance if any of the properties
> + * from different GUID appear in a property list of another, it will be
> + * accepted by the kernel. Firmware validation tools should catch these.
> + */
>  static const guid_t prp_guids[] = {
> /* ACPI _DSD device properties GUID: 
> daffd814-6eba-4d8c-8a91-bc9bbf4aa301 */
> GUID_INIT(0xdaffd814, 0x6eba, 0x4d8c,
> @@ -31,6 +39,9 @@ static const guid_t prp_guids[] = {
> /* Hotplug in D3 GUID: 6211e2c0-58a3-4af3-90e1-927a4e0c55a4 */
> GUID_INIT(0x6211e2c0, 0x58a3, 0x4af3,
>   0x90, 0xe1, 0x92, 0x7a, 0x4e, 0x0c, 0x55, 0xa4),
> +   /* External facing port GUID: efcc06cc-73ac-4bc3-bff0-76143807c389 */
> +   GUID_INIT(0xefcc06cc, 0x73ac, 0x4bc3,
> + 0xbf, 0xf0, 0x76, 0x14, 0x38, 0x07, 0xc3, 0x89),
>  };
>
>  static const guid_t ads_guid =
> diff --git a/drivers/pci/pci-acpi.c b/drivers/pci/pci-acpi.c
> index 921db6f80340..e1949f7efd9c 100644
> --- a/drivers/pci/pci-acpi.c
> +++ b/drivers/pci/pci-acpi.c
> @@ -789,6 +789,24 @@ static void pci_acpi_optimize_delay(struct pci_dev *pdev,
> ACPI_FREE(obj);
>  }
>
> +static void pci_acpi_set_untrusted(struct pci_dev *dev)
> +{
> +   u8 val;
> +
> +   if (pci_pcie_type(dev) != PCI_EXP_TYPE_ROOT_PORT)
> +   return;
> +   if (device_property_read_u8(>dev, "ExternalFacingPort", ))
> +   return;
> +
> +   /*
> +* These root ports expose PCIe (including DMA) outside of the
> +* system so make sure we treat them and everything behind as
> +* untrusted.
> +*/
> +   if (val)
> +   dev->untrusted = 1;
> +}
> +
>  static void pci_acpi_setup(struct device *dev)
>  {
> struct pci_dev *pci_dev = to_pci_dev(dev);
> @@ -798,6 +816,7 @@ static void pci_acpi_setup(struct device *dev)
> return;
>
> pci_acpi_optimize_delay(pci_dev, adev->handle);
> +   pci_acpi_set_untrusted(pci_dev);
>
> pci_acpi_add_pm_notifier(adev, pci_dev);
> if (!adev->wakeup.flags.valid)
> diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
> index b1c05b5054a0..257b9f6f2ebb 100644
> --- a/drivers/pci/probe.c
> +++ b/drivers/pci/probe.c
> @@ -1378,6 +1378,19 @@ static void set_pcie_thunderbolt(struct pci_dev *dev)
> }
>  }
>
> +static void set_pcie_untrusted(struct pci_dev *dev)
> +{
> +   struct pci_dev *parent;
> +
> +   /*
> 

Re: [PATCH v5 14/15] ACPI / scan: Create platform device for INT3515 ACPI nodes

2018-11-29 Thread Rafael J. Wysocki
On Wed, Nov 28, 2018 at 12:48 PM Andy Shevchenko
 wrote:
>
> The ACPI device with INT3515 _HID is representing a complex USB PD
> hardware infrastructure which includes several I2C slave ICs.
>
> We add an ID to the I2C multi instantiate list to enumerate
> all I2C slaves correctly.
>
> Signed-off-by: Andy Shevchenko 
> Reviewed-by: Hans de Goede 

Acked-by: Rafael J. Wysocki 

and I'm assuming that you will route the whole series through the
platform/x86 drivers tree.


[GIT PULL] ACPI fix for v4.20-rc5

2018-11-29 Thread Rafael J. Wysocki
Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 acpi-4.20-rc5

with top-most commit c4f784268210ae5e6749d4ba30d117bd301a70a6

 Merge branch 'acpica-fixes'

on top of commit 2bbb5fa37475d7aa5fa62f34db1623f3da2dfdfa

 ACPI / platform: Add SMB0001 HID to forbidden_id_list

to receive an ACPI fix for 4.20-rc5.

This fixes a recent regression in ACPICA releted to the Generic
Serial Bus protocol handling and causing it to read or write too
little or too much data in some cases, so incorrect data may be
written to hardware as a result (Hans de Goede).

Thanks!


---

Hans de Goede (1):
  ACPICA: Fix handling of buffer-size in acpi_ex_write_data_to_field()

---

 drivers/acpi/acpica/exserial.c | 21 ++---
 1 file changed, 2 insertions(+), 19 deletions(-)


[GIT PULL] Power management fixes for v4.20-rc4

2018-11-29 Thread Rafael J. Wysocki
Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 pm-4.20-rc5

with top-most commit 36c3aeb4b48d5b058526d606fde14db4fd7e5e6d

 Merge branch 'opp/fixes-for-4.20' of
git://git.kernel.org/pub/scm/linux/kernel/git/vireshk/pm

on top of commit 2e6e902d185027f8e3cb8b7305238f7e35d6a436

 Linux 4.20-rc4

to receive power management fixes for 4.20-rc5.

These fix two issues in the operating performance points (OPP)
framework.

Specifics:

 - Fix the handling of the "operating-points-v2" property to
   avoid failures if multiple phandles are present in it which
   is legitimate (Viresh Kumar).

 - Drop the unnecessary static initialization of the .owner field in
   the ti_opp_supply_driver structure (YueHaibing).

Thanks!


---

Viresh Kumar (1):
  OPP: Fix parsing of multiple phandles in "operating-points-v2" property

YueHaibing (1):
  opp: ti-opp-supply: Fix platform_no_drv_owner.cocci warnings

---

 drivers/opp/of.c| 6 ++
 drivers/opp/ti-opp-supply.c | 1 -
 2 files changed, 2 insertions(+), 5 deletions(-)


Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-29 Thread Rafael J. Wysocki
Hi Doug,

On Thu, Nov 29, 2018 at 12:20 AM Doug Smythies  wrote:
>
> On 2018.11.23 02:36 Rafael J. Wysocki wrote:
>
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).
>
> -- above missing-- (see follow up e-mail from Rafael)
>
>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).
>
> Hi Rafael,
>
> I did some minimal testing on teov6, using kernel 4.20-rc3 as my baseline
> reference kernel.
>
> Test 1: Phoronix bdench test, all options: 1, 6, 12, 48, 128, 256 clients.
>
> Note: because it uses the disk, the dbench test is somewhat non-repeatable.
> However, if particular attention is paid to not doing anything else with
> the disk between tests, then it seems to be repeatable to within about 6%.
>
> Anyway no significant difference observed between kernel 4.20-rc3 and the
> same with the teov6 patch.
>
> Test 2: Pipe test, non cross core. (And idle state 0 test, really)
> I ran 4 pipe tests, 1 for each of my 4 cores, @2 CPUs per core.
> Thus, pretty much only idle state 0 was ever used.
> Processor package power was similar for both kernels.
> teov6 entered/exited idle state 0 about 60,984 times/second/cpu.
> -rc3 entered/exited idle state 0 about 62,806 times/second/cpu.
> There was a difference in percentage time spent in idle state 0,
> with kernel 4.20-rc3 spending 0.2441% in idle state 0 verses
> teov6 at 0.0641%.
>
> For throughput, teov6 was 1.4% faster.
>
> Test 3: was an attempt to sweep through a preference for
> all idle states.
>
> 40 threads were launched with nothing to do except sleep
> for a variable duration of 1 to 500 uSec, each step was
> run for 1 minute. With 1 minute idle before the test and a few
> minutes idle after, the total test duration was about 505 minutes.
> Recall that when one asks for a short sleep of 1 uSec, they actually
> get about 50 uSec, due to overheads. So I use 40 threads in an attempt
> to get the average time between wakeup events per CPU down somewhat.
>
> The results are here:
> http://fast.smythies.com/linux-pm/k420/k420-pn-sweep-teo6-2.htm
>
> I might try to get some histogram information at a later date.

Thank you for the results, much appreciated!


Re: [Bug] SD card reader in Acer Aspire S5 broken in 4.20-rc

2018-11-28 Thread Rafael J. Wysocki
On Tuesday, November 27, 2018 9:25:14 PM CET Bjorn Helgaas wrote:
> On Mon, Nov 26, 2018 at 11:37:20PM +0100, Rafael J. Wysocki wrote:
> > On Monday, November 26, 2018 7:03:58 PM CET Rafael J. Wysocki wrote:
> > > Hi Bjorn,
> > > 
> > > The SD card reader in my Acer Aspire S5 doesn't work with 4.20-rc.
> > > 
> > > Here's what lspci -v says about it (in a bad kernel):
> > > 
> > > 02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5209 
> > > PCI Express Card Reader 
> > > (rev 01)
> > > Subsystem: Acer Incorporated [ALI] Device 0704
> > > Flags: bus master, fast devsel, latency 0, IRQ 35
> > > Memory at d9001000 (32-bit, non-prefetchable) [size=4K]
> > > Capabilities: [40] Power Management version 3
> > > Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
> > > Capabilities: [70] Express Endpoint, MSI 00
> > > Capabilities: [100] Advanced Error Reporting
> > > Capabilities: [140] Device Serial Number 00-00-00-01-00-4c-e0-00
> > > Kernel driver in use: rtsx_pci
> > > Kernel modules: rtsx_pci
> 
> Thanks a lot for bisecting this!
> 
> With a good kernel (v4.19 or v4.20-rc with 17c91487364f reverted),
> would you mind collecting "lspci -vv" output, the dmesg log with
> "pci=earlydump", and the FADT dump?

All of the information is attached to the BZ entry at

https://bugzilla.kernel.org/show_bug.cgi?id=201801

I have created for the tracking of this issue.  Let's continue in the BZ.

Cheers,
Rafael



Re: [Bug] SD card reader in Acer Aspire S5 broken in 4.20-rc

2018-11-27 Thread Rafael J. Wysocki
On Tuesday, November 27, 2018 9:25:14 PM CET Bjorn Helgaas wrote:
> On Mon, Nov 26, 2018 at 11:37:20PM +0100, Rafael J. Wysocki wrote:
> > On Monday, November 26, 2018 7:03:58 PM CET Rafael J. Wysocki wrote:
> > > Hi Bjorn,
> > > 
> > > The SD card reader in my Acer Aspire S5 doesn't work with 4.20-rc.
> > > 
> > > Here's what lspci -v says about it (in a bad kernel):
> > > 
> > > 02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5209 
> > > PCI Express Card Reader 
> > > (rev 01)
> > > Subsystem: Acer Incorporated [ALI] Device 0704
> > > Flags: bus master, fast devsel, latency 0, IRQ 35
> > > Memory at d9001000 (32-bit, non-prefetchable) [size=4K]
> > > Capabilities: [40] Power Management version 3
> > > Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
> > > Capabilities: [70] Express Endpoint, MSI 00
> > > Capabilities: [100] Advanced Error Reporting
> > > Capabilities: [140] Device Serial Number 00-00-00-01-00-4c-e0-00
> > > Kernel driver in use: rtsx_pci
> > > Kernel modules: rtsx_pci
> 
> Thanks a lot for bisecting this!
> 
> With a good kernel (v4.19 or v4.20-rc with 17c91487364f reverted),
> would you mind collecting "lspci -vv" output, the dmesg log with
> "pci=earlydump", and the FADT dump?

I'll do that tomorrow.

> I'm interested in the initial state of the device at handoff from
> BIOS, and what Linux changes even when aspm_disabled is set.

OK

> If we can't figure out a way to fix both this issue and the one fixed
> by 17c91487364f, I guess the fallback will be to revert 17c91487364f
> since it's better to allow a system that was previously broken to
> remain broken than it is to break a system that previously worked.

Well, depending on how many systems are affected by the issues fixed by
17c91487364f IMO.

Arguably, if the FADT says "NO_ASPM", then the platform should not need the
OS to initialize ASPM to work.  I guess a manual workaround (like an extra
kernel command line option or similar) should suffice in this particular
case.

> But obviously I hope we can figure out a solution that fixes both
> cases.

Of course.



Re: [Bug] SD card reader in Acer Aspire S5 broken in 4.20-rc

2018-11-26 Thread Rafael J. Wysocki
On Monday, November 26, 2018 7:03:58 PM CET Rafael J. Wysocki wrote:
> Hi Bjorn,
> 
> The SD card reader in my Acer Aspire S5 doesn't work with 4.20-rc.
> 
> Here's what lspci -v says about it (in a bad kernel):
> 
> 02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5209 PCI 
> Express Card Reader 
> (rev 01)
> Subsystem: Acer Incorporated [ALI] Device 0704
> Flags: bus master, fast devsel, latency 0, IRQ 35
> Memory at d9001000 (32-bit, non-prefetchable) [size=4K]
> Capabilities: [40] Power Management version 3
> Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
> Capabilities: [70] Express Endpoint, MSI 00
> Capabilities: [100] Advanced Error Reporting
> Capabilities: [140] Device Serial Number 00-00-00-01-00-4c-e0-00
> Kernel driver in use: rtsx_pci
> Kernel modules: rtsx_pci
> 
> When it doesn't work, it doesn't generate any interrupts on device insertion
> and removal and this seems to be reproducible 100% of the time.
> 
> Bisection turned up
> 
> commit de468b755464426c276df2daf1e54bcd64186020
> Merge: b1801bf05964 d6112f8def51
> Author: Bjorn Helgaas 
> Date:   Sat Oct 20 11:45:28 2018 -0500
> 
> Merge branch 'pci/enumeration'
> 
>   - Remove x86 and arm64 node-local allocation for host bridge structures
> (Punit Agrawal)
> 
>   - Pay attention to device-specific _PXM node values (Jonathan Cameron)
> 
>   - Support new Immediate Readiness bit (Felipe Balbi)
> 
> * pci/enumeration:
>   PCI: Add support for Immediate Readiness
>   ACPI/PCI: Pay attention to device-specific _PXM node values
>   x86/PCI: Remove node-local allocation when initialising host controller
>   arm64: PCI: Remove node-local allocations when initialising host 
> controller
> 
> as the first bad commit, but the "PCI: Add support for Immediate Readiness"
> one was tested as "good" (full bisect log is attached).
> 
> I wonder if you have any ideas on what to check?

So reverting 17c91487364f (PCI/ASPM: Do not initialize link state when
aspm_disabled is set) on top of 4.20-rc4 makes the problem go away.

I guess that the device in question needs pcie_aspm_cap_init() to be
called for it even though the FADT says "NO_ASPM".

Thanks,
Rafael





[Bug] SD card reader in Acer Aspire S5 broken in 4.20-rc

2018-11-26 Thread Rafael J. Wysocki
Hi Bjorn,

The SD card reader in my Acer Aspire S5 doesn't work with 4.20-rc.

Here's what lspci -v says about it (in a bad kernel):

02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5209 PCI 
Express Card Reader 
(rev 01)
Subsystem: Acer Incorporated [ALI] Device 0704
Flags: bus master, fast devsel, latency 0, IRQ 35
Memory at d9001000 (32-bit, non-prefetchable) [size=4K]
Capabilities: [40] Power Management version 3
Capabilities: [50] MSI: Enable+ Count=1/1 Maskable- 64bit+
Capabilities: [70] Express Endpoint, MSI 00
Capabilities: [100] Advanced Error Reporting
Capabilities: [140] Device Serial Number 00-00-00-01-00-4c-e0-00
Kernel driver in use: rtsx_pci
Kernel modules: rtsx_pci

When it doesn't work, it doesn't generate any interrupts on device insertion
and removal and this seems to be reproducible 100% of the time.

Bisection turned up

commit de468b755464426c276df2daf1e54bcd64186020
Merge: b1801bf05964 d6112f8def51
Author: Bjorn Helgaas 
Date:   Sat Oct 20 11:45:28 2018 -0500

Merge branch 'pci/enumeration'

  - Remove x86 and arm64 node-local allocation for host bridge structures
(Punit Agrawal)

  - Pay attention to device-specific _PXM node values (Jonathan Cameron)

  - Support new Immediate Readiness bit (Felipe Balbi)

* pci/enumeration:
  PCI: Add support for Immediate Readiness
  ACPI/PCI: Pay attention to device-specific _PXM node values
  x86/PCI: Remove node-local allocation when initialising host controller
  arm64: PCI: Remove node-local allocations when initialising host 
controller

as the first bad commit, but the "PCI: Add support for Immediate Readiness"
one was tested as "good" (full bisect log is attached).

I wonder if you have any ideas on what to check?

Cheers,
Rafael
git bisect start
# good: [84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d] Linux 4.19
git bisect good 84df9525b0c27f3ebc2ebb1864fa62a97fdedb7d
# bad: [2e6e902d185027f8e3cb8b7305238f7e35d6a436] Linux 4.20-rc4
git bisect bad 2e6e902d185027f8e3cb8b7305238f7e35d6a436
# bad: [18d0eae30e6a4f8644d589243d7ac1d70d29203d] Merge tag 'char-misc-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc
git bisect bad 18d0eae30e6a4f8644d589243d7ac1d70d29203d
# good: [50b825d7e87f4cff7070df6eb26390152bb29537] Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
git bisect good 50b825d7e87f4cff7070df6eb26390152bb29537
# bad: [3acbd2de6bc3af215c6ed7732dfc097d1e238503] Merge tag 'sound-4.20-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tiwai/sound
git bisect bad 3acbd2de6bc3af215c6ed7732dfc097d1e238503
# good: [a41efc2a0f68cea26665ab9e6d991c9bf33b3f59] Merge tag 'dmaengine-4.20-rc1' of git://git.infradead.org/users/vkoul/slave-dma
git bisect good a41efc2a0f68cea26665ab9e6d991c9bf33b3f59
# bad: [d49f8a52b15bf35db778035340d8a673149f9f93] Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
git bisect bad d49f8a52b15bf35db778035340d8a673149f9f93
# good: [6c404a68bf83b4135a8a9aa1c388ebdf98e8ba7f] scsi: dc395x: fix DMA API usage in sg_update_list
git bisect good 6c404a68bf83b4135a8a9aa1c388ebdf98e8ba7f
# bad: [61ce5809570b38710f08ebfcd44a5b06777d] Merge branch 'remotes/lorenzo/pci/cadence'
git bisect bad 61ce5809570b38710f08ebfcd44a5b06777d
# good: [e51cd9ce5dd3b10f9e67a30a4dc00fc1fa80c673] PCI/AER: Refactor error injection fallbacks
git bisect good e51cd9ce5dd3b10f9e67a30a4dc00fc1fa80c673
# bad: [ee8360fdafac54eefd0df69fbd99338896cf806b] Merge branch 'pci/misc'
git bisect bad ee8360fdafac54eefd0df69fbd99338896cf806b
# good: [fb513f60ea58f096be7006f899e2181762af37cb] NTB: switchtec_ntb: Update switchtec documentation with prerequisites for NTB
git bisect good fb513f60ea58f096be7006f899e2181762af37cb
# good: [b1801bf05964321d79fbbeae96c8ab09da2e2e49] Merge branch 'pci/aspm'
git bisect good b1801bf05964321d79fbbeae96c8ab09da2e2e49
# good: [d6112f8def514e019658bcc9b57d53acdb71ca3f] PCI: Add support for Immediate Readiness
git bisect good d6112f8def514e019658bcc9b57d53acdb71ca3f
# bad: [20634dc361e1c5fe2dae380a7d0a21ca7f32c4f7] Merge branch 'pci/hotplug'
git bisect bad 20634dc361e1c5fe2dae380a7d0a21ca7f32c4f7
# bad: [de468b755464426c276df2daf1e54bcd64186020] Merge branch 'pci/enumeration'
git bisect bad de468b755464426c276df2daf1e54bcd64186020
# first bad commit: [de468b755464426c276df2daf1e54bcd64186020] Merge branch 'pci/enumeration'


Re: [RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-23 Thread Rafael J. Wysocki
On Friday, November 23, 2018 11:35:38 AM CET Rafael J. Wysocki wrote:
> From: Rafael J. Wysocki 
> 
> The venerable menu governor does some thigns that are quite
> questionable in my view.
> 
> First, it includes timer wakeups in the pattern detection data and
> mixes them up with wakeups from other sources which in some cases
> causes it to expect what essentially would be a timer wakeup in a
> time frame in which no timer wakeups are possible (becuase it knows
> the time until the next timer event and that is later than the
> expected wakeup time).
> 
> Second, it uses the extra exit latency limit based on the predicted
> idle duration and depending on the number of tasks waiting on I/O,
> even though those tasks may run on a different CPU when they are
> woken up.  Moreover, the time ranges used by it for the sleep length
> correction factors depend on whether or not there are tasks waiting
> on I/O, which again doesn't imply anything in particular, and they
> are not correlated to the list of available idle states in any way
> whatever.
> 
> Also, the pattern detection code in menu may end up considering
> values that are too large to matter at all, in which cases running
> it is a waste of time.
> 
> A major rework of the menu governor would be required to address
> these issues and the performance of at least some workloads (tuned
> specifically to the current behavior of the menu governor) is likely
> to suffer from that.  It is thus better to introduce an entirely new
> governor without them and let everybody use the governor that works
> better with their actual workloads.
> 
> The new governor introduced here, the timer events oriented (TEO)
> governor, uses the same basic strategy as menu: it always tries to
> find the deepest idle state that can be used in the given conditions.
> However, it applies a different approach to that problem.
> 
> First, it doesn't use "correction factors" for the time till the
> closest timer, but instead it tries to correlate the measured idle
> duration values with the available idle states and use that
> information to pick up the idle state that is most likely to "match"
> the upcoming CPU idle interval.
> 
> Second, it doesn't take the number of "I/O waiters" into account at
> all and the pattern detection code in it avoids taking timer wakeups
> into account.  It also only uses idle duration values less than the
> current time till the closest timer (with the tick excluded) for that
> purpose.
> 
> Signed-off-by: Rafael J. Wysocki 
> ---
> 
> v5 -> v6:
>  * Avoid applying poll_time_limit to non-polling idle states by mistake.
>  * Use idle duration measured by the governor for everything (as it likely is
>more accurate than the one measured by the core).

This particular change is actually missing, sorry about that.  It is not
essential, however, so the v6 should be good enough as is for evaluation
and review purposes.

>  * Rename SPIKE to PULSE.
>  * Do not run pattern detection upfront.  Instead, use recent idle duration
>values to refine the state selection after finding a candidate idle state.
>  * Do not use the expected idle duration as an extra latency constraint
>(exit latency is less than the target residency for all of the idle states
>known to me anyway, so this doesn't change anything in practice).

Thanks,
Rafael



[RFC/RFT][PATCH v6] cpuidle: New timer events oriented governor for tickless systems

2018-11-23 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

The venerable menu governor does some thigns that are quite
questionable in my view.

First, it includes timer wakeups in the pattern detection data and
mixes them up with wakeups from other sources which in some cases
causes it to expect what essentially would be a timer wakeup in a
time frame in which no timer wakeups are possible (becuase it knows
the time until the next timer event and that is later than the
expected wakeup time).

Second, it uses the extra exit latency limit based on the predicted
idle duration and depending on the number of tasks waiting on I/O,
even though those tasks may run on a different CPU when they are
woken up.  Moreover, the time ranges used by it for the sleep length
correction factors depend on whether or not there are tasks waiting
on I/O, which again doesn't imply anything in particular, and they
are not correlated to the list of available idle states in any way
whatever.

Also, the pattern detection code in menu may end up considering
values that are too large to matter at all, in which cases running
it is a waste of time.

A major rework of the menu governor would be required to address
these issues and the performance of at least some workloads (tuned
specifically to the current behavior of the menu governor) is likely
to suffer from that.  It is thus better to introduce an entirely new
governor without them and let everybody use the governor that works
better with their actual workloads.

The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it applies a different approach to that problem.

First, it doesn't use "correction factors" for the time till the
closest timer, but instead it tries to correlate the measured idle
duration values with the available idle states and use that
information to pick up the idle state that is most likely to "match"
the upcoming CPU idle interval.

Second, it doesn't take the number of "I/O waiters" into account at
all and the pattern detection code in it avoids taking timer wakeups
into account.  It also only uses idle duration values less than the
current time till the closest timer (with the tick excluded) for that
purpose.

Signed-off-by: Rafael J. Wysocki 
---

v5 -> v6:
 * Avoid applying poll_time_limit to non-polling idle states by mistake.
 * Use idle duration measured by the governor for everything (as it likely is
   more accurate than the one measured by the core).
 * Rename SPIKE to PULSE.
 * Do not run pattern detection upfront.  Instead, use recent idle duration
   values to refine the state selection after finding a candidate idle state.
 * Do not use the expected idle duration as an extra latency constraint
   (exit latency is less than the target residency for all of the idle states
   known to me anyway, so this doesn't change anything in practice).

v4 -> v5:
 * Avoid using shallow idle states when the tick has been stopped already.

v3 -> v4:
 * Make the pattern detection avoid returning too early if the minimum
   sample is too far from the average.
 * Reformat the changelog (as requested by Peter).

v2 -> v3:
 * Simplify the pattern detection code and make it return a value
lower than the time to the closest timer if the majority of recent
idle intervals are below it regardless of their variance (that should
cause it to be slightly more aggressive).
 * Do not count wakeups from state 0 due to the time limit in poll_idle()
   as non-timer.

---
 drivers/cpuidle/Kconfig|   11 
 drivers/cpuidle/governors/Makefile |1 
 drivers/cpuidle/governors/teo.c|  450 +
 3 files changed, 462 insertions(+)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===
--- /dev/null
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -0,0 +1,450 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Timer events oriented CPU idle governor
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Rafael J. Wysocki 
+ *
+ * The idea of this governor is based on the observation that on many systems
+ * timer events are two or more orders of magnitude more frequent than any
+ * other interrupts, so they are likely to be the most significant source of 
CPU
+ * wakeups from idle states.  Moreover, information about what happened in the
+ * (relatively recent) past can be used to estimate whether or not the deepest
+ * idle state with target residency within the time to the closest timer is
+ * likely to be suitable for the upcoming idle time of the CPU and, if not, 
then
+ * which of the shallower idle states to choose.
+ *
+ * Of course, non-timer wakeup sources are more important in some use cases and
+ * they can be covered by taking a few most recent idle time intervals of the
+ * CPU

[GIT PULL] ACPI fix for v4.20-rc4

2018-11-23 Thread Rafael J. Wysocki
Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 acpi-4.20-rc4

with top-most commit 2bbb5fa37475d7aa5fa62f34db1623f3da2dfdfa

 ACPI / platform: Add SMB0001 HID to forbidden_id_list

on top of commit 9ff01193a20d391e8dbce4403dd5ef87c7eaaca6

 Linux 4.20-rc3

to receive an ACPI fix for 4.20-rc4.

This prevents the ACPI core from registering a platform device for
the SMB0001 HID to avoid IRQ allocation issues (Hans de Goede).

Thanks!


---

Hans de Goede (1):
  ACPI / platform: Add SMB0001 HID to forbidden_id_list

---

 drivers/acpi/acpi_platform.c | 1 +
 1 file changed, 1 insertion(+)


[GIT PULL] Power management fixes for v4.20-rc4

2018-11-23 Thread Rafael J. Wysocki
Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 pm-4.20-rc4

with top-most commit 1d50088ca3956e5dcd2751a658e7869b9af10bb4

 Merge branches 'pm-cpufreq' and 'pm-sleep'

on top of commit 9ff01193a20d391e8dbce4403dd5ef87c7eaaca6

 Linux 4.20-rc3

to receive power management fixes for 4.20-rc4.

These fix two issues in the Operating Performance Points (OPP)
framework, one cpufreq driver issue, one problem related to the
tasks freezer and a few build-related issues in the cpupower
utility.

Specifics:

 - Fix tasks freezer deadlock in de_thread() that occurs if one
   of its sub-threads has been frozen already (Chanho Min).

 - Avoid registering a platform device by the ti-cpufreq driver
   on platforms that cannot use it (Dave Gerlach).

 - Fix a mistake in the ti-opp-supply operating performance points
   (OPP) driver that caused an incorrect reference voltage to be
   used and make it adjust the minimum voltage dynamically to avoid
   hangs or crashes in some cases (Keerthy).

 - Fix issues related to compiler flags in the cpupower utility
   and correct a linking problem in it by renaming a file with
   a duplicate name (Jiri Olsa, Konstantin Khlebnikov).

Thanks!


---

Chanho Min (1):
  exec: make de_thread() freezable

Dave Gerlach (1):
  cpufreq: ti-cpufreq: Only register platform_device when supported

Jiri Olsa (2):
  tools cpupower debug: Allow to use outside build flags
  tools cpupower: Override CFLAGS assignments

Keerthy (2):
  opp: ti-opp-supply: Dynamically update u_volt_min
  opp: ti-opp-supply: Correct the supply in _get_optimal_vdd_voltage call

Konstantin Khlebnikov (1):
  tools/power/cpupower: fix compilation with STATIC=true

---

 drivers/cpufreq/ti-cpufreq.c   | 26 +-
 drivers/opp/ti-opp-supply.c|  5 -
 fs/exec.c  |  5 +++--
 tools/power/cpupower/Makefile  | 12 ++--
 tools/power/cpupower/bench/Makefile|  2 +-
 tools/power/cpupower/debug/x86_64/Makefile |  4 ++--
 tools/power/cpupower/lib/cpufreq.c |  2 +-
 tools/power/cpupower/lib/cpuidle.c |  2 +-
 tools/power/cpupower/lib/cpupower.c|  4 ++--
 tools/power/cpupower/lib/cpupower_intern.h |  2 +-
 10 files changed, 42 insertions(+), 22 deletions(-)


Re: [PATCH v9 09/15] sched: Introduce sched_energy_present static key

2018-11-22 Thread Rafael J. Wysocki
On Thu, Nov 22, 2018 at 4:25 PM Quentin Perret  wrote:
>
> On Thursday 22 Nov 2018 at 11:25:45 (+0100), Peter Zijlstra wrote:
> > On Thu, Nov 22, 2018 at 09:32:39AM +, Quentin Perret wrote:
> > > Hmm, I went too fast, that's totally broken. But there's still something
> > > we can do with static_branch_{inc,dec} I think. I'll come back later
> > > with a better solution.
> >
> > Right; if you count the rd's that have pd set, it should work-ish. Yes,
> > much cleaner if you can get it to work.
>
> So, I came up with the following code which seems to work OK. It's not
> as I clean as I'd like, though. The fact that free_pd() can be called in
> softirq context is annoying to manipulate the static key ...
>
> An alternative to this work item workaround is to do static_branch_dec()
> from build_perf_domains() and next to the three call sites of
> free_rootdomain() in order to avoid the call_rcu() context. Not very
> pretty either.
>
> Or we can just stick with your original suggestion to carry a boolean
> around.

What's problematic with carrying a boolean around?


Re: [PATCH v9 08/15] sched/topology: Make Energy Aware Scheduling depend on schedutil

2018-11-22 Thread Rafael J. Wysocki
On Thu, Nov 22, 2018 at 3:05 PM Peter Zijlstra  wrote:
>
> On Mon, Nov 19, 2018 at 02:18:50PM +, Quentin Perret wrote:
> > + if (rd->pd)
> > + pr_warn("rd %*pbl: Disabling EAS, schedutil 
> > is mandatory\n",
> > + cpumask_pr_args(cpu_map));
> > + goto free;
>
> I would have preferred just not allowing the governor to change at all,
> but if you want to allow that, this is, I suppose, the best you can do.
>
> Rafael, are you otherwise OK with the grubbing inside cpufreq?

Sort of. :-)

I would put the sched_cpufreq_governor_change() header (and the static
inline stub of it) into linux/cpufreq.h (instead of
linux/sched/cpufreq.h) so that cpufreq.c doesn't have to include
linux/sched/cpufreq.h which is kind of confusing.

Apart from that the changes are fine by me.


Re: [RFC/RFT][PATCH v5] cpuidle: New timer events oriented governor for tickless systems

2018-11-21 Thread Rafael J. Wysocki
On Thursday, November 15, 2018 4:15:59 AM CET Rafael J. Wysocki wrote:
> On Sunday, November 11, 2018 4:40:17 PM CET Peter Zijlstra wrote:

[cut]

> 
> > Anyway; a fair while ago I proposed a different estimator. I've not had
> > time to dig through the 4 prior versions so I cannot tell if you've
> > already tried this, but the idea was simple:
> > 
> >   - track the last @n wakeup distances in the @idle-states buckets;
> >   - sum the buckets in increasing idle state and pick the state before
> > you reach 50% of @n.
> > 
> > That is computationally cheaper than what you have; and should allow you
> > to increase @n without making the computation more expensive.
> 
> I can do something similar (actually, I have a prototype ready already),
> but do it after walking the idle states once (which will give us a preliminary
> idle state choice based on the time to the closest timer and the "history")
> and then sum the buckets below the idle state selected so far in the 
> decreasing
> order (that will tend to select a shallower state in "tie" situations).

I did that, but the result was not encouraging as it caused state 0 (polling)
to be selected way too often (and the total time spent in it was significantly
greater too).

I have gone back to averaging over the most recent observed idle duration
values, but now I'm doing that after selecting a candidate idle state
(based on the time to the closest timer and the "history") and only taking
values below the target residency of that state into account.  Seems to work
reasonably well.

I'll send a v6 tomorrow if all goes well.



Re: [PATCH v2] exec: make de_thread() freezable

2018-11-21 Thread Rafael J. Wysocki
On Monday, November 12, 2018 9:15:18 AM CET Oleg Nesterov wrote:
> On 11/12, Chanho Min wrote:
> >
> > @@ -1083,7 +1084,7 @@ static int de_thread(struct task_struct *tsk)
> > while (sig->notify_count) {
> > __set_current_state(TASK_KILLABLE);
> > spin_unlock_irq(lock);
> > -   schedule();
> > +   freezable_schedule();
> > if (unlikely(__fatal_signal_pending(tsk)))
> > goto killed;
> > spin_lock_irq(lock);
> > @@ -,7 +1112,7 @@ static int de_thread(struct task_struct *tsk)
> > __set_current_state(TASK_KILLABLE);
> > write_unlock_irq(_lock);
> > cgroup_threadgroup_change_end(tsk);
> > -   schedule();
> > +   freezable_schedule();
> > if (unlikely(__fatal_signal_pending(tsk)))
> > goto killed;
> > }
> 
> Thanks, looks good to me.
> 
> Acked-by: Oleg Nesterov 

Patch applied, thanks!



Re: [GIT PULL] cpupower update for Linux 4.20-rc4

2018-11-21 Thread Rafael J. Wysocki
On Tue, Nov 20, 2018 at 10:16 PM shuah  wrote:
>
> Hi Rafael,
>
> Please pull the following update for Linux 4.20-rc4 or rc5 depending on
> your pm pull schedule to Linus.
>
> This cpupower update for Linux 4.20-rc4 consists of compile fixes to
> allow use of outside build flags and override of CFLAGS from Jiri Olsa,
> and fix to compilation with STATIC=true from Konstantin Khlebnikov.
>
> diff is attached.

Pulled, thanks!


Re: [PATCH v2 0/4] device property: Add fwnode_get_name() helper

2018-11-20 Thread Rafael J. Wysocki
On Thursday, November 8, 2018 5:51:52 PM CET Heikki Krogerus wrote:
> Hi,
> 
> This is the second version of my proposal for this helper. The
> first version can be checked here:
> https://lkml.org/lkml/2018/11/5/326
> 
> In order to support also ACPI properly, I decided to change the API.
> The function fwnode_name() is now fwnode_get_name(), and instead of
> returning pointer to the name, the function copies it to a buffer. I
> did that because acpica does not offer a way to get a pointer to the
> node name, and the name is clearly expected to be accessed only with
> the namespace lock held.
> 
> I think this is better approach in any case. It will also solve the
> problem of getting rid of the unit-address part from DT node names.
> 
> Let me know what you guys think.
> 
> --
> heikki
> 
> 
> Heikki Krogerus (4):
>   device property: Introduce fwnode_get_name()
>   ACPI: property: Add acpi_fwnode_name()
>   of/property: Add of_fwnode_name()
>   device property: Drop get_named_child_node callback
> 
>  drivers/acpi/property.c  | 43 ++--
>  drivers/base/property.c  | 34 ++-
>  drivers/of/property.c| 26 ++--
>  include/linux/fwnode.h   |  6 +++---
>  include/linux/property.h |  2 ++
>  5 files changed, 73 insertions(+), 38 deletions(-)
> 
> 

I will be expecting at least one more iteration of this series to address the
Rob's comments.

Thanks,
Rafael




Re: [PATCH v2] cpuidle: big.LITTLE: add of_node_put()

2018-11-20 Thread Rafael J. Wysocki
On Tue, Nov 20, 2018 at 4:46 PM Yangtao Li  wrote:
>
> The of_node_put() is missing.So we call the of_node_put() to release
> the refcount.

I would say "of_find_node_by_path() acquires a reference to the node
returned by it and that reference needs to be dropped by its caller.
bl_idle_init() doesn't do that, so fix it."

Thanks,
Rafael


Re: [PATCH] cpuidle: big.LITTLE: add of_node_put()

2018-11-20 Thread Rafael J. Wysocki
On Tue, Nov 20, 2018 at 2:00 PM Yangtao Li  wrote:
>
> use of_node_put() to release the refcount.

I gather that the of_node_put() is missing?

If so, that should be stated in the changelog too.

> Signed-off-by: Yangtao Li 
> ---
>  drivers/cpuidle/cpuidle-big_little.c | 8 ++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/cpuidle/cpuidle-big_little.c 
> b/drivers/cpuidle/cpuidle-big_little.c
> index db2ede565f1a..a3b8bc09bac8 100644
> --- a/drivers/cpuidle/cpuidle-big_little.c
> +++ b/drivers/cpuidle/cpuidle-big_little.c
> @@ -174,8 +174,10 @@ static int __init bl_idle_init(void)
> /*
>  * Initialize the driver just for a compliant set of machines
>  */
> -   if (!of_match_node(compatible_machine_match, root))
> -   return -ENODEV;
> +   if (!of_match_node(compatible_machine_match, root)){
> +   ret = -ENODEV;
> +   goto out_release_node;
> +   }
>
> if (!mcpm_is_available())
> return -EUNATCH;
> @@ -222,6 +224,8 @@ static int __init bl_idle_init(void)
> kfree(bl_idle_big_driver.cpumask);
>  out_uninit_little:
> kfree(bl_idle_little_driver.cpumask);
> +out_release_node:
> +   of_node_put(root);
>
> return ret;
>  }
> --
> 2.17.0
>


Re: [PATCH 6/7] acpi: Create subtable parsing infrastructure

2018-11-19 Thread Rafael J. Wysocki
On Wed, Nov 14, 2018 at 11:53 PM Keith Busch  wrote:
>
> Parsing entries in an ACPI table had assumed a generic header structure
> that is most common. There is no standard ACPI header, though, so less
> common types would need custom parsers if they want go walk their
> subtable entry list.
>
> Create the infrastructure for adding different table types so parsing
> the entries array may be more reused for all ACPI system tables.
>
> Signed-off-by: Keith Busch 
> ---
>  drivers/acpi/tables.c | 75 
> ---
>  1 file changed, 65 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
> index 61203eebf3a1..15ee77780f68 100644
> --- a/drivers/acpi/tables.c
> +++ b/drivers/acpi/tables.c
> @@ -49,6 +49,19 @@ static struct acpi_table_desc 
> initial_tables[ACPI_MAX_TABLES] __initdata;
>
>  static int acpi_apic_instance __initdata;
>
> +enum acpi_subtable_type {
> +   ACPI_SUBTABLE_COMMON,
> +};
> +
> +union acpi_subtable_headers {
> +   struct acpi_subtable_header common;
> +};
> +
> +struct acpi_subtable_entry {
> +   union acpi_subtable_headers *hdr;
> +   enum acpi_subtable_type type;
> +};
> +
>  /*
>   * Disable table checksum verification for the early stage due to the size
>   * limitation of the current x86 early mapping implementation.
> @@ -217,6 +230,45 @@ void acpi_table_print_madt_entry(struct 
> acpi_subtable_header *header)
> }
>  }
>
> +static unsigned long __init
> +acpi_get_entry_type(struct acpi_subtable_entry *entry)
> +{
> +   switch (entry->type) {
> +   case ACPI_SUBTABLE_COMMON:
> +   return entry->hdr->common.type;
> +   }
> +   WARN_ONCE(1, "invalid acpi type\n");
> +   return 0;
> +}
> +
> +static unsigned long __init
> +acpi_get_entry_length(struct acpi_subtable_entry *entry)
> +{
> +   switch (entry->type) {
> +   case ACPI_SUBTABLE_COMMON:
> +   return entry->hdr->common.length;
> +   }
> +   WARN_ONCE(1, "invalid acpi type\n");

AFAICS this does a WARN_ONCE() on information obtained from firmware.

That is not a kernel problem, so generating traces in that case is not
a good idea IMO.  Moreover, users can't really do much about this in
the majority of cases, so a pr_info() message should be sufficient.

And similarly below.


Re: [RFC/RFT][PATCH v5] cpuidle: New timer events oriented governor for tickless systems

2018-11-14 Thread Rafael J. Wysocki
On Sunday, November 11, 2018 4:40:17 PM CET Peter Zijlstra wrote:
> On Thu, Nov 08, 2018 at 06:25:07PM +0100, Rafael J. Wysocki wrote:
> > +unsigned int teo_idle_duration(struct cpuidle_driver *drv,
> > +  struct teo_cpu *cpu_data,
> > +  unsigned int sleep_length_us)
> > +{
> > +   u64 range, max_spread, sum, max, min;
> > +   unsigned int i, count;
> > +
> > +   /*
> > +* If the sleep length is below the target residency of idle state 1,
> > +* the only viable choice is to select the first available (enabled)
> > +* idle state, so return immediately in that case.
> > +*/
> > +   if (sleep_length_us < drv->states[1].target_residency)
> > +   return sleep_length_us;
> > +
> > +   /*
> > +* The purpose of this function is to check if there is a pattern of
> > +* wakeups indicating that it would be better to select a state
> > +* shallower than the deepest one matching the sleep length or the
> > +* deepest one at all if the sleep lenght is long.  Larger idle duration
> > +* values are beyond the interesting range.
> > +*/
> > +   range = drv->states[drv->state_count-1].target_residency;
> > +   range = min_t(u64, sleep_length_us, range + (range >> 2));
> > +
> > +   /*
> > +* This is the value to compare with the distance between the average
> > +* and the greatest sample to decide whether or not it is small enough.
> > +* Take 10 us as the total cap of it.
> > +*/
> > +   max_spread = max_t(u64, range >> MAX_SPREAD_SHIFT, 10);
> > +
> > +   /*
> > +* First pass: compute the sum of interesting samples, the minimum and
> > +* maximum of them and count them.
> > +*/
> > +   count = 0;
> > +   sum = 0;
> > +   max = 0;
> > +   min = UINT_MAX;
> > +
> > +   for (i = 0; i < INTERVALS; i++) {
> > +   u64 val = cpu_data->intervals[i];
> > +
> > +   if (val >= range)
> > +   continue;
> > +
> > +   count++;
> > +   sum += val;
> > +   if (max < val)
> > +   max = val;
> > +
> > +   if (min > val)
> > +   min = val;
> > +   }
> > +
> > +   /* Give up if the number of interesting samples is too small. */
> > +   if (count <= INTERVALS / 2)
> > +   return sleep_length_us;
> > +
> > +   /*
> > +* If the distance between the max or min and the average is too large,
> > +* try to refine by discarding the max, as long as the count is above 3.
> > +*/
> > +   while (count > 3 && max > max_spread &&
> > +  ((max - max_spread) * count > sum ||
> > +   (min + max_spread) * count < sum)) {
> > +
> > +   range = max;
> > +
> > +   /*
> > +* Compute the sum of samples in the interesting range.  Count
> > +* them and find the maximum of them.
> > +*/
> > +   count = 0;
> > +   sum = 0;
> > +   max = 0;
> > +
> > +   for (i = 0; i < INTERVALS; i++) {
> > +   u64 val = cpu_data->intervals[i];
> > +
> > +   if (val >= range)
> > +   continue;
> > +
> > +   count++;
> > +   sum += val;
> > +   if (max < val)
> > +   max = val;
> > +   }
> > +   }
> > +
> > +   return div64_u64(sum, count);
> > +}
> 
> By always discarding the larger value; you're searching for the first or
> shortest peak, right?

Yes, basically.

> While that is always a safe value; it might not be the best value.

Sure.

> Also; I think you can write the whole thing shorter; maybe like:
> 
> 
>   do {
>   count = sum = max = 0;
>   min = UINT_MAX;
> 
>   for (i = 0; i < INTERVALS; i++) {
>   u64 val = cpu_data->intervals[i];
> 
>   if (val >= range)
>   continue;
> 
>   count++;
>   sum += val;
>   max = max(max, val);
>   min = min(min, val);
>   }
> 
>   range = max;
> 
>   } while (count > 3 && max > max_spread &&
>((ma

Re: [RFC/RFT][PATCH v5] cpuidle: New timer events oriented governor for tickless systems

2018-11-14 Thread Rafael J. Wysocki
On Sunday, November 11, 2018 4:20:34 PM CET Peter Zijlstra wrote:
> On Thu, Nov 08, 2018 at 06:25:07PM +0100, Rafael J. Wysocki wrote:
> > +/*
> > + * The SPIKE value is added to metrics when they grow and the DECAY_SHIFT 
> > value
> > + * is used for decreasing metrics on a regular basis.
> > + */
> > +#define SPIKE  1024
> > +#define DECAY_SHIFT3
> 
> > +   if (idx_timer >= 0) {
> > +   unsigned int hits = cpu_data->states[idx_timer].hits;
> > +   unsigned int misses = cpu_data->states[idx_timer].misses;
> > +
> > +   hits -= hits >> DECAY_SHIFT;
> > +   misses -= misses >> DECAY_SHIFT;
> > +
> > +   if (idx_timer > idx_hit) {
> > +   misses += SPIKE;
> > +   if (idx_hit >= 0)
> > +   cpu_data->states[idx_hit].early_hits += SPIKE;
> > +   } else {
> > +   hits += SPIKE;
> > +   }
> > +
> > +   cpu_data->states[idx_timer].misses = misses;
> > +   cpu_data->states[idx_timer].hits = hits;
> > +   }
> 
> That's a pulse-density-modulator, with a signal bound of:
> 
>   1024 == x/8 -> x := 8192
> 
> Anyway; I would suggest s/SPIKE/PULSE/ because of that.

OK




Re: [RFC/RFT][PATCH v5] cpuidle: New timer events oriented governor for tickless systems

2018-11-14 Thread Rafael J. Wysocki
On Saturday, November 10, 2018 8:10:01 PM CET Giovanni Gherdovich wrote:
> On Thu, 2018-11-08 at 18:25 +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > Subject: [PATCH] cpuidle: New timer events oriented governor for tickless 
> > systems
> > 

[cut]

> [NOTE: the tables in this message are quite wide, ~130 columns. If this
> doesn't get to you properly formatted you can read a copy of this message at
> the URL https://beta.suse.com/private/ggherdovich/teo-eval/teo-eval.html ]
> 
> 
> Hello Rafael,
> 
> I have results for v3 and v5. Regarding v4, I made a mistake and didn't get
> valid data; as I saw v5 coming shortly after, I didn't rerun v4.
> 
> I'm replying to the v5 thread because that's where these results belong, but
> I'm quoting your text from the v2 email at
> https://lore.kernel.org/lkml/4168371.zz0pvzt...@aspire.rjw.lan so that's
> easier to follow along.

Thanks for the results, much appreciated!

> The quick summary is:
> 
> ---> sockperf on loopback over UDP, mode "throughput":
>  this had a 12% regression in v2 on 48x-HASWELL-NUMA, which is completely
>  recovered in v3 and v5. Good stuff.

That's good news, thanks!

> ---> dbench on xfs:
>  this was down 16% in v2 on 48x-HASWELL-NUMA. On v5 we're at a 10%
>  regression. Slight improvement. What's really hurting here is the single
>  client scenario.
> 
> ---> netperf-udp on loopback:
>  had 6% regression on v2 on 8x-SKYLAKE-UMA, which is the same as what
>  happens in v5.
> 
> ---> tbench on loopback:
>  was down 10% in v2 on 8x-SKYLAKE-UMA, now slightly worse in v5 with a 12%
>  regression. As in dbench, it's at low number of clients that the results
>  are worst. Note that this machine is different from the one that has the
>  dbench regression.

Clearly, playing with the pattern detection part of the governor alone is not
sufficient to make all of the workloads happy at the same time.

> A more detailed report follows below.
> 
> I maintain my original opinion from v2 that this governor is largely
> performance-neutral and I'm not overly worried about the numbers above:

OK, fair enough.

Cheers,
Rafael



Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-14 Thread Rafael J. Wysocki
On Wed, Nov 14, 2018 at 7:26 AM Doug Smythies  wrote:
>
> On 2018.11.08 00:00 Rafael J. Wysocki wrote:
> > On Wednesday, November 7, 2018 6:04:12 PM CET Doug Smythies wrote:
> >> On 2018.11.04 08:31 Rafael J. Wysocki wrote:
>
> ...[snip]...
> >> The results are:
> >> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
> >> http://fast.smythies.com/linux-pm/k420/histo_compare.htm
>
> ...[snip]...
>
> >> There are some odd long idle durations with TEOv3 for idle
> >> states 1, 2, and 3 that I'll watch for with v4 testing.
> >
> > That unfortunately is a result of bugs in the v4 (and v2 - v3 too).
> >
> > Namely, it doesn't take the cases when the tick has been stopped already
> > into account correctly.  IOW, all of the data points beyond the tick 
> > boundary
> > should go into the "final" peak.
> >
> > I'll send a v5.
>
> With v5 there were no long idle durations for idle states 1, 2, and 3 for
> this same Phoronix dbench test.

That's good news, thank you!


[GIT PULL] ACPI fix for v4.20-rc3

2018-11-14 Thread Rafael J. Wysocki
Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 acpi-4.20-rc3

with top-most commit 017ce359a7187192df5222a00fa3c9055eb3736d

 ACPI / PMIC: xpower: fix IOSF_MBI dependency

on top of commit 651022382c7f8da46cb4872a545ee1da6d097d2a

 Linux 4.20-rc1

to receive an ACPI fix for 4.20-rc3.

This fixes a recently introduced build issue in the xpower PMIC
driver (Arnd Bergmann).

Thanks!


---

Arnd Bergmann (1):
  ACPI / PMIC: xpower: fix IOSF_MBI dependency

---

 drivers/acpi/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)


[RFC/RFT][PATCH v5] cpuidle: New timer events oriented governor for tickless systems

2018-11-08 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 
Subject: [PATCH] cpuidle: New timer events oriented governor for tickless 
systems

The venerable menu governor does some thigns that are quite
questionable in my view.

First, it includes timer wakeups in the pattern detection data and
mixes them up with wakeups from other sources which in some cases
causes it to expect what essentially would be a timer wakeup in a
time frame in which no timer wakeups are possible (becuase it knows
the time until the next timer event and that is later than the
expected wakeup time).

Second, it uses the extra exit latency limit based on the predicted
idle duration and depending on the number of tasks waiting on I/O,
even though those tasks may run on a different CPU when they are
woken up.  Moreover, the time ranges used by it for the sleep length
correction factors depend on whether or not there are tasks waiting
on I/O, which again doesn't imply anything in particular, and they
are not correlated to the list of available idle states in any way
whatever.

Also, the pattern detection code in menu may end up considering
values that are too large to matter at all, in which cases running
it is a waste of time.

A major rework of the menu governor would be required to address
these issues and the performance of at least some workloads (tuned
specifically to the current behavior of the menu governor) is likely
to suffer from that.  It is thus better to introduce an entirely new
governor without them and let everybody use the governor that works
better with their actual workloads.

The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it applies a different approach to that problem.

First, it doesn't use "correction factors" for the time till the
closest timer, but instead it tries to correlate the measured idle
duration values with the available idle states and use that
information to pick up the idle state that is most likely to "match"
the upcoming CPU idle interval.

Second, it doesn't take the number of "I/O waiters" into account at
all and the pattern detection code in it avoids taking timer wakeups
into account.  It also only uses idle duration values less than the
current time till the closest timer (with the tick excluded) for that
purpose.

Signed-off-by: Rafael J. Wysocki 
---

v4 -> v5:
 * Avoid using shallow idle states when the tick has been stopped already.

v3 -> v4:
 * Make the pattern detection avoid returning too early if the minimum
   sample is too far from the average.
 * Reformat the changelog (as requested by Peter).

v2 -> v3:
 * Simplify the pattern detection code and make it return a value
lower than the time to the closest timer if the majority of recent
idle intervals are below it regardless of their variance (that should
cause it to be slightly more aggressive).
 * Do not count wakeups from state 0 due to the time limit in poll_idle()
   as non-timer.

---
 drivers/cpuidle/Kconfig|   11 
 drivers/cpuidle/governors/Makefile |1 
 drivers/cpuidle/governors/teo.c|  521 +
 3 files changed, 533 insertions(+)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===
--- /dev/null
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -0,0 +1,521 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Timer events oriented CPU idle governor
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Rafael J. Wysocki 
+ *
+ * The idea of this governor is based on the observation that on many systems
+ * timer events are two or more orders of magnitude more frequent than any
+ * other interrupts, so they are likely to be the most significant source of 
CPU
+ * wakeups from idle states.  Moreover, information about what happened in the
+ * (relatively recent) past can be used to estimate whether or not the deepest
+ * idle state with target residency within the time to the closest timer is
+ * likely to be suitable for the upcoming idle time of the CPU and, if not, 
then
+ * which of the shallower idle states to choose.
+ *
+ * Of course, non-timer wakeup sources are more important in some use cases and
+ * they can be covered by detecting patterns among recent idle time intervals
+ * of the CPU.  However, even in that case it is not necessary to take idle
+ * duration values greater than the time till the closest timer into account, 
as
+ * the patterns that they may belong to produce average values close enough to
+ * the time till the closest timer (sleep length) anyway.
+ *
+ * Thus this governor estimates whether or not the upcoming idle time of the 
CPU
+ * is likely to be significantly shorter than the sleep length and selects an
+ * idle state for it in accordance with that, as follows:
+ *
+ * - If there is 

Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-08 Thread Rafael J. Wysocki
On Wednesday, November 7, 2018 6:04:12 PM CET Doug Smythies wrote:
> On 2018.11.04 08:31 Rafael J. Wysocki wrote:
> 
> > v2 -> v3:
> > * Simplify the pattern detection code and make it return a value
> > lower than the time to the closest timer if the majority of recent
> > idle intervals are below it regardless of their variance (that should
> > cause it to be slightly more aggressive).
> > * Do not count wakeups from state 0 due to the time limit in poll_idle()
> >as non-timer.
> >
> > Note: I will be mostly offline tomorrow, so this goes slightly early.
> > I have tested it only very lightly, but it is not so much different from
> > the previous one.
> > 
> > It requires the same additional patches to apply as the previous one too.
> 
> Even though this v3 has now been superseded by v4, I completed some test
> work in progress for v3 anyhow.

That's useful anyway, thanks for doing that!

> The main reason to complete the work, and write up, was because, and for my
> own interest as much as anything, I wanted to specifically test for the
> influence of running trace on the system under test.
> Reference: https://marc.info/?l=linux-kernel=154145580925439=2
> 
> The Phoronix dbench test was run under the option to run all
> the tests, instead of just one number of clients. This was done
> with a reference/baseline kernel of 4.20-rc1, and also with this
> TEO version 3 patch. The tests were also repeated with trace
> enabled for 5000 seconds. Idle information and processor
> package power were sampled once per minute in all test runs.
> 
> The results are:
> http://fast.smythies.com/linux-pm/k420/k420-dbench-teo3.htm
> http://fast.smythies.com/linux-pm/k420/histo_compare.htm

Thanks a bunch for these!

> Conclusion: trace has negligible effect, until the system gets
> severely overloaded.
> 
> There are some odd long idle durations with TEOv3 for idle
> states 1, 2, and 3 that I'll watch for with v4 testing.

That unfortunately is a result of bugs in the v4 (and v2 - v3 too).

Namely, it doesn't take the cases when the tick has been stopped already
into account correctly.  IOW, all of the data points beyond the tick boundary
should go into the "final" peak.

I'll send a v5.

> Other information:
> Processor: Intel(R) Core(TM) i7-2600K CPU @ 3.40GHz
> The kernels were 1000 Hz.
> Idle latency/residency info:
> STATE: state0   DESC: CPUIDLE CORE POLL IDLENAME: POLL  LATENCY: 0
>   RESIDENCY: 0
> STATE: state1   DESC: MWAIT 0x00NAME: C1LATENCY: 2  
> RESIDENCY: 2
> STATE: state2   DESC: MWAIT 0x01NAME: C1E   LATENCY: 10 
> RESIDENCY: 20
> STATE: state3   DESC: MWAIT 0x10NAME: C3LATENCY: 80 
> RESIDENCY: 211
> STATE: state4   DESC: MWAIT 0x20NAME: C6LATENCY: 104
> RESIDENCY: 345

Thanks,
Rafael



Re: [PATCH] ACPI / PMIC: xpower: fix IOSF_MBI dependency

2018-11-07 Thread Rafael J. Wysocki
On Fri, Nov 2, 2018 at 12:07 PM Arnd Bergmann  wrote:
>
> We still get a link failure with IOSF_MBI=m when the xpower driver
> is built-in:
>
> drivers/acpi/pmic/intel_pmic_xpower.o: In function 
> `intel_xpower_pmic_update_power':
> intel_pmic_xpower.c:(.text+0x4f2): undefined reference to 
> `iosf_mbi_block_punit_i2c_access'
> intel_pmic_xpower.c:(.text+0x5e2): undefined reference to 
> `iosf_mbi_unblock_punit_i2c_access'
>
> This makes the dependency stronger, so we can only build when IOSF_MBI
> is built-in.
>
> Fixes: 6a9b593d4b6f ("ACPI / PMIC: xpower: Add depends on IOSF_MBI to Kconfig 
> entry")
> Signed-off-by: Arnd Bergmann 
> ---
>  drivers/acpi/Kconfig | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
> index 18851e7eedd5..31a3c4a03f61 100644
> --- a/drivers/acpi/Kconfig
> +++ b/drivers/acpi/Kconfig
> @@ -514,7 +514,7 @@ config CRC_PMIC_OPREGION
>
>  config XPOWER_PMIC_OPREGION
> bool "ACPI operation region support for XPower AXP288 PMIC"
> -   depends on MFD_AXP20X_I2C && IOSF_MBI
> +   depends on MFD_AXP20X_I2C && IOSF_MBI=y
> help
>   This config adds ACPI operation region support for XPower AXP288 
> PMIC.
>
> --

At this point I'm inclined to apply the patch as is as a short-term
fix and improvements can be made on top of it.

Any objections?


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-07 Thread Rafael J. Wysocki
On Wed, Nov 7, 2018 at 9:59 AM Peter Zijlstra  wrote:
>
> On Wed, Nov 07, 2018 at 12:39:31AM +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 6, 2018 at 8:51 PM Peter Zijlstra  wrote:
> > >
> > > On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
> > > > On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  
> > > > wrote:
> > >
> > > > > Instead of this detector; why haven't you used the code from
> > > > > kernel/irq/timings.c ?
> > > >
> > > > Because it doesn't help much AFAICS.
> > > >
> > > > Wakeups need not be interrupts in particular
> > >
> > > You're alluding to the MWAIT wakeup through the MONITOR address ?
> >
> > Yes.
>
> Right, those will not be accounted for and will need something else.

Precisely. :-)

> > > > and interrupt patterns that show up when the CPU is busy may not be
> > > > relevant for when it is idle.
> > >
> > > I think that is not always true; consider things like the periodic
> > > interrupt from frame rendering or audio; if there is nothing more going
> > > on in the system than say playing your favourite tune, it gets the
> > > 'need more data soon' interrupt from the audio card, wakes up, does a 
> > > little
> > > mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
> > > sleep. Same for video playback I assume, the vsync interrupt for buffer
> > > flips is fairly predictable.
> > >
> > > The interrupt predictor we have in kernel/irq/timings.c should be very
> > > accurate in predicting those interrupts.
> >
> > In the above case the interrupts should produce a detectable pattern
> > of wakeups anyway.
>
> Ah, not so. Suppose you have both the audio and video interrupt going at
> a steady rate but different rate, then the combined pattern isn't
> trivial at all.

It isn't trivial, but it will appear as a "pattern" to the pattern
detector in v4 of the patch.

Basically, all of that is about looking for a reason to select a
shallower idle state than indicated by the time till the closest timer
alone.

>From that standpoint it makes sense to look at several most recent
wakeups and see if a pattern is forming in there, which is what the
code in v4 of the patch does.  It also makes sense to look at
interrupt patters, but that is costly, so the overhead needs to be
justified.  The question to me is whether or not there is any
additional value in that given that the most recent wakeups are (and
IMO need to be) taken into consideration anyway.

> > In general, however, I need to be convinced that interrupts that
> > didn't wake up the CPU from idle are relevant for next wakeup
> > prediction.  I see that this may be the case, but to what extent is
> > rather unclear to me and it looks like calling
> > irq_timings_next_event() would add considerable overhead.
>
> How about we add a (debug) knob so that people can play with it for now?
> If it turns out to be useful, we'll learn.

So I'm inclined to try to invoke irq_timings_next_event() in addition
to looking at the most recent wakeups and see how far that can get us.

With some extra instrumentation in place it should be possible to
verify how much value that brings to the table.


[RFC/RFT][PATCH v4] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

The venerable menu governor does some thigns that are quite
questionable in my view.

First, it includes timer wakeups in the pattern detection data and
mixes them up with wakeups from other sources which in some cases
causes it to expect what essentially would be a timer wakeup in a
time frame in which no timer wakeups are possible (becuase it knows
the time until the next timer event and that is later than the
expected wakeup time).

Second, it uses the extra exit latency limit based on the predicted
idle duration and depending on the number of tasks waiting on I/O,
even though those tasks may run on a different CPU when they are
woken up.  Moreover, the time ranges used by it for the sleep length
correction factors depend on whether or not there are tasks waiting
on I/O, which again doesn't imply anything in particular, and they
are not correlated to the list of available idle states in any way
whatever.

Also, the pattern detection code in menu may end up considering
values that are too large to matter at all, in which cases running
it is a waste of time.

A major rework of the menu governor would be required to address
these issues and the performance of at least some workloads (tuned
specifically to the current behavior of the menu governor) is likely
to suffer from that.  It is thus better to introduce an entirely new
governor without them and let everybody use the governor that works
better with their actual workloads.

The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it applies a different approach to that problem.

First, it doesn't use "correction factors" for the time till the
closest timer, but instead it tries to correlate the measured idle
duration values with the available idle states and use that
information to pick up the idle state that is most likely to "match"
the upcoming CPU idle interval.

Second, it doesn't take the number of "I/O waiters" into account at
all and the pattern detection code in it avoids taking timer wakeups
into account.  It also only uses idle duration values less than the
current time till the closest timer (with the tick excluded) for that
purpose.

Signed-off-by: Rafael J. Wysocki 
---

v3 -> v4:
 * Make the pattern detection avoid returning too early if the minimum
   sample is too far from the average.
 * Reformat the changelog (as requested by Peter).

v2 -> v3:
 * Simplify the pattern detection code and make it return a value
lower than the time to the closest timer if the majority of recent
idle intervals are below it regardless of their variance (that should
cause it to be slightly more aggressive).
 * Do not count wakeups from state 0 due to the time limit in poll_idle()
   as non-timer.

---
 drivers/cpuidle/Kconfig|   11 
 drivers/cpuidle/governors/Makefile |1 
 drivers/cpuidle/governors/teo.c|  507 +
 3 files changed, 519 insertions(+)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===
--- /dev/null
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -0,0 +1,507 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Timer events oriented CPU idle governor
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Rafael J. Wysocki 
+ *
+ * The idea of this governor is based on the observation that on many systems
+ * timer events are two or more orders of magnitude more frequent than any
+ * other interrupts, so they are likely to be the most significant source of 
CPU
+ * wakeups from idle states.  Moreover, information about what happened in the
+ * (relatively recent) past can be used to estimate whether or not the deepest
+ * idle state with target residency within the time to the closest timer is
+ * likely to be suitable for the upcoming idle time of the CPU and, if not, 
then
+ * which of the shallower idle states to choose.
+ *
+ * Of course, non-timer wakeup sources are more important in some use cases and
+ * they can be covered by detecting patterns among recent idle time intervals
+ * of the CPU.  However, even in that case it is not necessary to take idle
+ * duration values greater than the time till the closest timer into account, 
as
+ * the patterns that they may belong to produce average values close enough to
+ * the time till the closest timer (sleep length) anyway.
+ *
+ * Thus this governor estimates whether or not the upcoming idle time of the 
CPU
+ * is likely to be significantly shorter than the sleep length and selects an
+ * idle state for it in accordance with that, as follows:
+ *
+ * - If there is a pattern of 5 or more recent non-timer wakeups earlier than
+ *   the closest timer event, expect one more of them to occur and use the
+ *   average of the idle duration va

Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 8:51 PM Peter Zijlstra  wrote:
>
> On Tue, Nov 06, 2018 at 07:19:24PM +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  wrote:
>
> > > Instead of this detector; why haven't you used the code from
> > > kernel/irq/timings.c ?
> >
> > Because it doesn't help much AFAICS.
> >
> > Wakeups need not be interrupts in particular
>
> You're alluding to the MWAIT wakeup through the MONITOR address ?

Yes.

> > and interrupt patterns that show up when the CPU is busy may not be
> > relevant for when it is idle.
>
> I think that is not always true; consider things like the periodic
> interrupt from frame rendering or audio; if there is nothing more going
> on in the system than say playing your favourite tune, it gets the
> 'need more data soon' interrupt from the audio card, wakes up, does a little
> mp3/flac/ogg/whatever decode to fill up the buffer and goes back to
> sleep. Same for video playback I assume, the vsync interrupt for buffer
> flips is fairly predictable.
>
> The interrupt predictor we have in kernel/irq/timings.c should be very
> accurate in predicting those interrupts.

In the above case the interrupts should produce a detectable pattern
of wakeups anyway.

In general, however, I need to be convinced that interrupts that
didn't wake up the CPU from idle are relevant for next wakeup
prediction.  I see that this may be the case, but to what extent is
rather unclear to me and it looks like calling
irq_timings_next_event() would add considerable overhead.


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 6:04 PM Peter Zijlstra  wrote:
>
> On Sun, Nov 04, 2018 at 05:31:20PM +0100, Rafael J. Wysocki wrote:
> > + * - If there is a pattern of 5 or more recent non-timer wakeups earlier 
> > than
> > + *   the closest timer event, expect one more of them to occur and use the
> > + *   average of the idle duration values corresponding to them to select an
> > + *   idle state for the CPU.
>
>
> > +/**
> > + * teo_idle_duration - Estimate the duration of the upcoming CPU idle time.
> > + * @drv: cpuidle driver containing state data.
> > + * @cpu_data: Governor data for the target CPU.
> > + * @sleep_length_us: Time till the closest timer event in microseconds.
> > + */
> > +unsigned int teo_idle_duration(struct cpuidle_driver *drv,
> > +struct teo_cpu *cpu_data,
> > +unsigned int sleep_length_us)
> > +{
> > + u64 range, max_spread, max, sum;
> > + unsigned int count;
> > +
> > + /*
> > +  * If the sleep length is below the target residency of idle state 1,
> > +  * the only viable choice is to select the first available (enabled)
> > +  * idle state, so return immediately in that case.
> > +  */
> > + if (sleep_length_us < drv->states[1].target_residency)
> > + return sleep_length_us;
> > +
> > + /*
> > +  * The purpose of this function is to check if there is a pattern of
> > +  * wakeups indicating that it would be better to select a state
> > +  * shallower than the deepest one matching the sleep length or the
> > +  * deepest one at all if the sleep lenght is long.  Larger idle 
> > duration
> > +  * values are beyond the interesting range.
> > +  *
> > +  * Narrowing the range of interesting values down upfront also helps 
> > to
> > +  * avoid overflows during the computation below.
> > +  */
> > + range = drv->states[drv->state_count-1].target_residency;
> > + range = min_t(u64, sleep_length_us, range + (range >> 2));
> > +
> > + /*
> > +  * This is the value to compare with the distance between the average
> > +  * and the greatest sample to decide whether or not it is small 
> > enough.
> > +  * Take 10 us as the total cap of it.
> > +  */
> > + max_spread = max_t(u64, range >> MAX_SPREAD_SHIFT, 10);
> > +
> > + max = range;
> > +
> > + do {
> > + u64 cap = max;
> > + int i;
> > +
> > + /*
> > +  * Compute the sum of the saved intervals below the cap and 
> > the
> > +  * sum of of their squares.  Count them and find the maximum
> > +  * interval below the cap.
> > +  */
> > + count = 0;
> > + sum = 0;
> > + max = 0;
> > +
> > + for (i = 0; i < INTERVALS; i++) {
> > + u64 val = cpu_data->intervals[i];
> > +
> > + if (val >= cap)
> > + continue;
> > +
> > + count++;
> > + sum += val;
> > + if (max < val)
> > + max = val;
> > + }
> > +
> > + /*
> > +  * Give up if the total number of interesting samples is too
> > +  * small.
> > +  */
> > + if (cap == range && count <= INTERVALS / 2)
> > + return sleep_length_us;
> > +
> > + /*
> > +  * If the distance between the max and the average is too 
> > large,
> > +  * discard the max an repeat.
> > +  */
> > + } while (count > 3 && max > max_spread && (max - max_spread) * count 
> > > sum);
> > +
> > + return div64_u64(sum, count);
> > +}
>
> Instead of this detector; why haven't you used the code from
> kernel/irq/timings.c ?

Because it doesn't help much AFAICS.

Wakeups need not be interrupts in particular and interrupt patterns
that show up when the CPU is busy may not be relevant for when it is
idle.


Re: [PATCH] driver core: Add branch prediction hints in really_probe()

2018-11-06 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 3:43 PM Muchun Song  wrote:
>
> Hi Rafael,
>
> If we want the driver core to test driver remove functions, we can
> enable CONFIG_DEBUG_TEST_DRIVER_REMOVE. This option is
> just for testing it. So, in most cases, the option is disabled and the if
> condition is false. So I think we can add an unlikely() to it.

Yes, it can be added there, but does it really need to be added?

If the conditions are false all the time, the branch predictor in the
processor should be able to deal with it just fine.

And if they are false already at build time, the compiler should just
optimize them away.


Re: [RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-06 Thread Rafael J. Wysocki
On Monday, November 5, 2018 8:32:51 PM CET Giovanni Gherdovich wrote:
> On Sun, 2018-11-04 at 17:31 +0100, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> > 

[cut]

> > +
> > +/**
> > + * teo_update - Update CPU data after wakeup.
> > + * @drv: cpuidle driver containing state data.
> > + * @dev: Target CPU.
> > + */
> > +static void teo_update(struct cpuidle_driver *drv, struct cpuidle_device 
> > *dev)
> > +{
> > +   struct teo_cpu *cpu_data = per_cpu_ptr(_cpus, dev->cpu);
> > +   unsigned int sleep_length_us = 
> > ktime_to_us(cpu_data->sleep_length_ns);
> > +   int i, idx_hit = -1, idx_timer = -1;
> > +   unsigned int measured_us;
> > +
> > +   if (cpu_data->time_span_ns == cpu_data->sleep_length_ns) {
> > +   /* One of the safety nets has triggered (most likely). */
> > +   measured_us = sleep_length_us;
> > +   } else {
> > +   measured_us = dev->last_residency;
> > +   i = cpu_data->last_state;
> > +   if (measured_us >= 2 * drv->states[i].exit_latency)
> > +   measured_us -= drv->states[i].exit_latency;
> > +   else
> > +   measured_us /= 2;
> > +   }
> > +
> 
> I haven't read this v3 yet,

The pattern detection code in it has a minor issue that needs addressing, so
I'm going to send a v4 later today (after testing it somewhat).

> so just a little comment on the bit above (which
> is there since v1).

To be precise, this piece is there in the menu governor too and I thought that
it made sense, so I copied it from there. :-)

> When you check for measured_us >= 2 * exit_latency, is that because
> dev->last_residency is composed by an "entry" latency, then the actual
> residency, and finally the exit_latency? I'm asking about the 2x factor
> there.

If measured_us is less than twice the state's exit latency, the difference
between it and the exit latency may be very small, so it is better to take a
half of it then as an upper bound estimate of it in case there are some
inaccuracies in the measurement.

> If that succeeds, you proceed to remove the exit_latency from
> measured_us... just once. Given how the condition is formulated, I expected
> measured_us -= 2 * exit_latency there.

The exit latency is subtracted once, because we are interested in the time
between asking the hardware to enter the idle state and the wakeup event.

The exit latency is the time between the wakeup event and the first instruction,
so it shouldn't be counted, roughly speaking.

> More: you acknowledge, in that snippet of code, that there can be
> dev->last_residency's smaller than twice the exit_latency, i.e. not even the
> time to entry/exit the state. Am I reading this right? Is that because both
> exit_latency and dev->last_residency are only approximations?

Yes, they are just approximations.

> I actually see quite a few of those extra-short residencies in my traces, even
> with dev->last_residency < exit_latency.

Well, they are expected to occur at least occasionally.

Cheers,
Rafael



Re: [PATCH] driver core: Add branch prediction hints in really_probe()

2018-11-06 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 2:47 PM Muchun Song  wrote:
>
> If condition is false in most cases. So, add an unlikely() to the if
> condition, so that the optimizer assumes that the condition is false.
>
> Signed-off-by: Muchun Song 

Have you measured the practical impact of this patch in any way?

> ---
>  drivers/base/dd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 169412ee4ae8..8eba453c4e5d 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -450,7 +450,7 @@ static int really_probe(struct device *dev, struct 
> device_driver *drv)
> bool test_remove = IS_ENABLED(CONFIG_DEBUG_TEST_DRIVER_REMOVE) &&
>!drv->suppress_bind_attrs;
>
> -   if (defer_all_probes) {
> +   if (unlikely(defer_all_probes)) {
> /*
>  * Value of defer_all_probes can be set only by
>  * device_defer_all_probes_enable() which, in turn, will call
> @@ -508,7 +508,7 @@ static int really_probe(struct device *dev, struct 
> device_driver *drv)
> goto probe_failed;
> }
>
> -   if (test_remove) {
> +   if (unlikely(test_remove)) {
> test_remove = false;
>
> if (dev->bus->remove)
> --
> 2.17.1
>


Re: [PATCH v2] kobject: to use pr_warn replace KERN_WARNING

2018-11-06 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 8:58 AM Joe Perches  wrote:
>
> On Tue, 2018-11-06 at 08:49 +0100, Rafael J. Wysocki wrote:
> > On Tue, Nov 6, 2018 at 3:42 AM Bo YU  wrote:
> > > Fix warning form checkpatch, use pr_warn replace KERN_WARNING
> > >
> > > Signed-off-by: Bo YU 
> >
> > First off, IMO, you should not change the existing code just in order
> > to make checkpatch happy about it.  That alone is not a good enough
> > reason for modifying it.
> >
> > If the checkpatch warning indicates an issue like broken white space
> > (and I mean really broken and not lines longer than 80 chars etc),
> > then that may be a reason to modify the existing code, depending.
>
> Existing code is slightly broken.
> There is currently a missing terminating newline in the
> non-switch case match, when msg == NULL;

OK, so this should be explained in the patch changelog.

Saying "I do this to make checkpatch happy" in the changelog is just
not enough IMO (if not outright misleading).


Re: [PATCH] driver core: fix comments for device_block_probing()

2018-11-05 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 8:41 AM Randy Dunlap  wrote:
>
> From: Randy Dunlap 
>
> Correct function name and spelling/typo for device_block_probing()
> in drivers/base/dd.c.
>
> Signed-off-by: Randy Dunlap 
> Cc: Greg Kroah-Hartman 
> Cc: "Rafael J. Wysocki" 

Reviewed-by: Rafael J. Wysocki 

> ---
>  drivers/base/dd.c |4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> --- lnx-420-rc1.orig/drivers/base/dd.c
> +++ lnx-420-rc1/drivers/base/dd.c
> @@ -179,7 +179,7 @@ static void driver_deferred_probe_trigge
>  }
>
>  /**
> - * device_block_probing() - Block/defere device's probes
> + * device_block_probing() - Block/defer device's probes
>   *
>   * It will disable probing of devices and defer their probes instead.
>   */
> @@ -453,7 +453,7 @@ static int really_probe(struct device *d
> if (defer_all_probes) {
> /*
>  * Value of defer_all_probes can be set only by
> -* device_defer_all_probes_enable() which, in turn, will call
> +* device_block_probing() which, in turn, will call
>  * wait_for_device_probe() right after that to avoid any 
> races.
>  */
> dev_dbg(dev, "Driver %s force probe deferral\n", drv->name);
>
>


Re: [PATCH v2] kobject: to use pr_warn replace KERN_WARNING

2018-11-05 Thread Rafael J. Wysocki
On Tue, Nov 6, 2018 at 3:42 AM Bo YU  wrote:
>
> Fix warning form checkpatch, use pr_warn replace KERN_WARNING
>
> Signed-off-by: Bo YU 

First off, IMO, you should not change the existing code just in order
to make checkpatch happy about it.  That alone is not a good enough
reason for modifying it.

If the checkpatch warning indicates an issue like broken white space
(and I mean really broken and not lines longer than 80 chars etc),
then that may be a reason to modify the existing code, depending.

> ---
> changes in v2:
> According to Joe's suggestion,drop newline from msg, otherwise
> it can be unterminated with newline.
> ---
>  lib/kobject_uevent.c | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/lib/kobject_uevent.c b/lib/kobject_uevent.c
> index 63d0816ab23b..1837765ebf01 100644
> --- a/lib/kobject_uevent.c
> +++ b/lib/kobject_uevent.c
> @@ -195,12 +195,12 @@ int kobject_synth_uevent(struct kobject *kobj, const 
> char *buf, size_t count)
> enum kobject_action action;
> const char *action_args;
> struct kobj_uevent_env *env;
> -   const char *msg = NULL, *devpath;
> +   const char *msg = NULL;
> int r;
>
> r = kobject_action_type(buf, count, , _args);
> if (r) {
> -   msg = "unknown uevent action string\n";
> +   msg = "unknown uevent action string";
> goto out;
> }
>
> @@ -212,7 +212,7 @@ int kobject_synth_uevent(struct kobject *kobj, const char 
> *buf, size_t count)
> r = kobject_action_args(action_args,
> count - (action_args - buf), );
> if (r == -EINVAL) {
> -   msg = "incorrect uevent action arguments\n";
> +   msg = "incorrect uevent action arguments";

Second, this change is not what the changelog of your patch is talking about.

> goto out;
> }
>
> @@ -223,8 +223,9 @@ int kobject_synth_uevent(struct kobject *kobj, const char 
> *buf, size_t count)
> kfree(env);
>  out:
> if (r) {
> -   devpath = kobject_get_path(kobj, GFP_KERNEL);
> -   printk(KERN_WARNING "synth uevent: %s: %s",
> +   char *devpath = kobject_get_path(kobj, GFP_KERNEL);
> +
> +   pr_warn("synth uevent: %s: %s\n",
>devpath ?: "unknown device",
>msg ?: "failed to send uevent");
> kfree(devpath);
> --
> 2.11.0
>


[RFC/RFT][PATCH v3] cpuidle: New timer events oriented governor for tickless systems

2018-11-04 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

The venerable menu governor does some thigns that are quite
questionable in my view.  First, it includes timer wakeups in
the pattern detection data and mixes them up with wakeups from
other sources which in some cases causes it to expect what
essentially would be a timer wakeup in a time frame in which
no timer wakeups are possible (becuase it knows the time until
the next timer event and that is later than the expected wakeup
time).  Second, it uses the extra exit latency limit based on
the predicted idle duration and depending on the number of tasks
waiting on I/O, even though those tasks may run on a different
CPU when they are woken up.  Moreover, the time ranges used by it
for the sleep length correction factors depend on whether or not
there are tasks waiting on I/O, which again doesn't imply anything
in particular, and they are not correlated to the list of available
idle states in any way whatever.  Also,  the pattern detection code
in menu may end up considering values that are too large to matter
at all, in which cases running it is a waste of time.

A major rework of the menu governor would be required to address
these issues and the performance of at least some workloads (tuned
specifically to the current behavior of the menu governor) is likely
to suffer from that.  It is thus better to introduce an entirely new
governor without them and let everybody use the governor that works
better with their actual workloads.

The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it applies a different approach to that problem.  First, it
doesn't use "correction factors" for the time till the closest timer,
but instead it tries to correlate the measured idle duration values
with the available idle states and use that information to pick up
the idle state that is most likely to "match" the upcoming CPU idle
interval.  Second, it doesn't take the number of "I/O waiters" into
account at all and the pattern detection code in it tries to avoid
taking timer wakeups into account.  It also only uses idle duration
values less than the current time till the closest timer (with the
tick excluded) for that purpose.

Signed-off-by: Rafael J. Wysocki 
---

v2 -> v3:
 * Simplify the pattern detection code and make it return a value
lower than the time to the closest timer if the majority of recent
idle intervals are below it regardless of their variance (that should
cause it to be slightly more aggressive).
 * Do not count wakeups from state 0 due to the time limit in poll_idle()
   as non-timer.

Note: I will be mostly offline tomorrow, so this goes slightly early.
I have tested it only very lightly, but it is not so much different from
the previous one.

It requires the same additional patches to apply as the previous one too.

---
 drivers/cpuidle/Kconfig|   11 
 drivers/cpuidle/governors/Makefile |1 
 drivers/cpuidle/governors/teo.c|  491 +
 3 files changed, 503 insertions(+)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===
--- /dev/null
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -0,0 +1,491 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Timer events oriented CPU idle governor
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Rafael J. Wysocki 
+ *
+ * The idea of this governor is based on the observation that on many systems
+ * timer events are two or more orders of magnitude more frequent than any
+ * other interrupts, so they are likely to be the most significant source of 
CPU
+ * wakeups from idle states.  Moreover, information about what happened in the
+ * (relatively recent) past can be used to estimate whether or not the deepest
+ * idle state with target residency within the time to the closest timer is
+ * likely to be suitable for the upcoming idle time of the CPU and, if not, 
then
+ * which of the shallower idle states to choose.
+ *
+ * Of course, non-timer wakeup sources are more important in some use cases and
+ * they can be covered by detecting patterns among recent idle time intervals
+ * of the CPU.  However, even in that case it is not necessary to take idle
+ * duration values greater than the time till the closest timer into account, 
as
+ * the patterns that they may belong to produce average values close enough to
+ * the time till the closest timer (sleep length) anyway.
+ *
+ * Thus this governor estimates whether or not the upcoming idle time of the 
CPU
+ * is likely to be significantly shorter than the sleep length and selects an
+ * idle state for it in accordance with that, as follows:
+ *
+ * - If there is a pattern of 5 or more recent non-timer wakeups earlier than
+ *   the closest timer event, expect 

Re: [RFC/RFT][PATCH v2] cpuidle: New timer events oriented governor for tickless systems

2018-11-04 Thread Rafael J. Wysocki
On Friday, November 2, 2018 4:39:42 PM CET Doug Smythies wrote:
> On 2018.10.26 02:12 Rafael J. Wysocki wrote:
> 
> ...[snip]...

Again, thanks a lot for the feedback, it is appreciated very much!

> > The v2 is a re-write of major parts of the original patch.
> >
> > The approach the same in general, but the details have changed significantly
> > with respect to the previous version.  In particular:
> > * The decay of the idle state metrics is implemented differently.
> > * There is a more "clever" pattern detection (sort of along the lines
> >   of what the menu does, but simplified quite a bit and trying to avoid
> >   including timer wakeups).
> > * The "promotion" from the "polling" state is gone.
> > * The "safety net" wakeups are treated as the CPU might have been idle
> >   until the closest timer.
> 
> ...[snip]...
> 
> I have been testing this V2 against a baseline that includes all
> of the pending menu patches. My baseline kernel is somewhere
> after 4.19, at 345671e.
> 
> A side note:
> Recall that with the menu patch set tests, I found that the baseline
> reference performance for the pipe test on one core had changed
> significantly (worse - Kernel 4.19-rc1). Well, now it has changed
> significantly again (better, and even significantly better than it
> was for 4.18). 4.18 ~4.8 uSec/loop; 4.19 ~5.2 uSec/loop; 4.19+
> (345671e) 4.2 uSec/loop.
> 
> This V2 is pretty good.

That's awesome!

> All of the tests that I run gave similar
> performance and power use between the baseline reference and V2.
> I couldn't find any issues with the decay stuff, and I tried.
> (sorry, I didn't do pretty graphs.)
> 
> After reading Giovanni's reply the other day, I tried the
> Phoronix dbench test: 12 clients resulted in similar performance,
> But TEOv2 used a little less processor package power; 256 clients
> had about -7% performance using TEOv2, but (my numbers are not
> exact) also used less processor package power.

Good to know, thank you!

> On 2018.10.31 11:36 Giovanni Gherdovich wrote:
> 
> > Something I'd like to do now is verify that "teo"'s predictions
> > are better than "menu"'s; I'll probably use systemtap to make
> > some histograms of idle times versus what idle state was chosen
> > -- that'd be enough to compare the two.
> 
> I don't know what a "systemtap" is, but I have (crude) tools to
> post process trace data into histograms data. I did 5 minute
> traces during the 12 client Phoronix dbench test and plotted
> the results, [1]. Sometimes, to the right of the autoscaled
> graph is another with fixed scaling. Better grouping of idle
> durations with TEOv2 are clearly visible.
> 
> ... Doug
> 
> [1] http://fast.smythies.com/linux-pm/k419p/histo_compare.htm

Thanks for the graphs.  At least they show the consistent underestimation of
the idle duration in menu if I'm not mistaken.

Cheers,
Rafael



Re: [RFC/RFT][PATCH v2] cpuidle: New timer events oriented governor for tickless systems

2018-11-04 Thread Rafael J. Wysocki
On Wednesday, October 31, 2018 7:36:21 PM CET Giovanni Gherdovich wrote:
> On Fri, 2018-10-26 at 11:12 +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 

[cut]

> 
> Hello Rafael,

Hi Giovanni,

First off, many thanks for doing this work, it is very very much appreciated!

> your new governor has a neutral impact on performance, as you expected. This 
> is
> a positive result, since the purpose of "teo" is to give improved
> predictions on idle times without regressing on the performance side.

Right.

> There are swings here and there but nothing looks extremely bad. v2 is largely
> equivalent to v1 in my tests, except for sockperf and netperf on the
> Haswell machine (v2 slightly worse) and tbench on the Skylake machine
> (again v2 slightly worse).

Thanks for the data.

I have some ideas on what may be the difference between the v1 and the v2 on
these machines, more about that below.

> I've tested your patches applying them on v4.18 (plus the backport
> necessary for v2 as Doug helpfully noted), just because it was the latest
> release when I started preparing this.
> 
> I've tested it on three machines, with different generations of Intel CPUs:
> 
> * single socket E3-1240 v5 (Skylake 8 cores, which I'll call 8x-SKYLAKE-UMA)
> * two sockets E5-2698 v4 (Broadwell 80 cores, 80x-BROADWELL-NUMA from here 
> onwards)
> * two sockets E5-2670 v3 (Haswell 48 cores, 48x-HASWELL-NUMA from here 
> onwards)
> 
> 
> BENCHMARKS WITH NEUTRAL RESULTS
> ===
> 
> These are the workloads where no noticeable difference is measured (on both
> v1 and v2, all machines), together with the corresponding MMTests[1]
> configuration file name:
> 
> * pgbench read-only on xfs, pgbench read/write on xfs
>   * global-dhp__db-pgbench-timed-ro-small-xfs
>   * global-dhp__db-pgbench-timed-rw-small-xfs
> * siege
>   * global-dhp__http-siege
> * hackbench, pipetest
>   * global-dhp__scheduler-unbound
> * Linux kernel compilation
>   * global-dhp__workload_kerndevel-xfs
> * NASA Parallel Benchmarks, C-Class (linear algebra; run both with OpenMP
>   and OpenMPI, over xfs)
>   * global-dhp__nas-c-class-mpi-full-xfs
>   * global-dhp__nas-c-class-omp-full
> * FIO (Flexible IO) in several configurations
>   * global-dhp__io-fio-randread-async-randwrite-xfs
>   * global-dhp__io-fio-randread-async-seqwrite-xfs
>   * global-dhp__io-fio-seqread-doublemem-32k-4t-xfs
>   * global-dhp__io-fio-seqread-doublemem-4k-4t-xfs
> * netperf on loopback over TCP
>   * global-dhp__network-netperf-unbound

The above is great to know.

> BENCHMARKS WITH NON-NEUTRAL RESULTS: OVERVIEW
> =
> 
> These are benchmarks which exhibit a variation in their performance;
> you'll see the magnitude of the changes is moderate and it's highly variable
> from machine to machine. All percentages refer to the v4.18 baseline. In
> more than one case the Haswell machine seems to prefer v1 to v2.
> 
> * xfsrepair
>   * global-dhp__io-xfsrepair-xfs
> 
>   teo-v1  teo-v2
>   -
>   8x-SKYLAKE-UMA  2% worse2% worse
>   80x-BROADWELL-NUMA  1% worse1% worse
>   48x-HASWELL-NUMA1% worse1% worse
> 
> * sqlite (insert operations on xfs)
>   * global-dhp__db-sqlite-insert-medium-xfs
> 
>   teo-v1  teo-v2
>   -
>   8x-SKYLAKE-UMA  no change   no change
>   80x-BROADWELL-NUMA  2% worse3% worse
>   48x-HASWELL-NUMAno change   no change
> 
> * netperf on loopback over UDP
>   * global-dhp__network-netperf-unbound
> 
>   teo-v1  teo-v2
>   -
>   8x-SKYLAKE-UMA  no change   6% worse
>   80x-BROADWELL-NUMA  1% worse4% worse
>   48x-HASWELL-NUMA3% better   5% worse
> 
> * sockperf on loopback over TCP, mode "under load"
>   * global-dhp__network-sockperf-unbound
> 
>   teo-v1  teo-v2
>   -
>   8x-SKYLAKE-UMA  6% worseno change
>   80x-BROADWELL-NUMA  7% better   no change
>   48x-HASWELL-NUMA3% better   2% worse
> 
> *

[GIT PULL] More ACPI updates for v4.20-rc1

2018-10-30 Thread Rafael J. Wysocki
Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 acpi-4.20-rc1-2

with top-most commit 6a9b593d4b6f5994209456de7a3c2db0974b5dda

 ACPI / PMIC: xpower: Add depends on IOSF_MBI to Kconfig entry

on top of commit bd6bf7c10484f026505814b690104cdef27ed460

 Merge tag 'pci-v4.20-changes' of
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

to receive additional ACPI updates for 4.20-rc1.

These rework the handling of the P-unit semaphore on Intel
Baytrail and Cherrytrail systems to avoid race conditions and
excessive overhead related to it (Hans de Goede).

There was a merge conflict between this material and the i2c
tree (merged now) in linux-next that should be straightforward
to resolve [1].

Thanks!

[1] https://lore.kernel.org/lkml/20181029130925.60ea7...@canb.auug.org.au/


---

Hans de Goede (4):
  x86: baytrail/cherrytrail: Rework and move P-Unit PMIC bus semaphore code
  ACPI / PMIC: xpower: Block P-Unit I2C access during read-modify-write
  i2c: designware: Cleanup bus lock handling
  ACPI / PMIC: xpower: Add depends on IOSF_MBI to Kconfig entry

---

 arch/x86/include/asm/iosf_mbi.h  |  39 +++--
 arch/x86/platform/intel/iosf_mbi.c   | 217 ---
 drivers/acpi/Kconfig |   2 +-
 drivers/acpi/pmic/intel_pmic_xpower.c|  21 ++-
 drivers/i2c/busses/i2c-designware-baytrail.c | 139 +
 drivers/i2c/busses/i2c-designware-common.c   |   4 +-
 drivers/i2c/busses/i2c-designware-core.h |   9 +-
 drivers/i2c/busses/i2c-designware-platdrv.c  |   2 -
 8 files changed, 250 insertions(+), 183 deletions(-)


[GIT PULL] More power management updates for v4.20-rc1

2018-10-30 Thread Rafael J. Wysocki
Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 pm-4.20-rc1-2

with top-most commit c4ac6889930d027ffa5cf77e0c202e7e97a4be06

 Merge branches 'pm-cpuidle' and 'pm-cpufreq'

on top of commit bd6bf7c10484f026505814b690104cdef27ed460

 Merge tag 'pci-v4.20-changes' of
git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci

to receive additional power management updates for 4.20-rc1.

These remove a questionable heuristic from the menu cpuidle governor,
fix a recent build regression in the intel_pstate driver, clean up
ARM big-Little support in cpufreq and fix up hung task watchdog's
interaction with system-wide power management transitions.

Specifics:

 - Fix build regression in the intel_pstate driver that doesn't
   build without CONFIG_ACPI after recent changes (Dominik Brodowski).

 - One of the heuristics in the menu cpuidle governor is based on a
   function returning 0 most of the time, so drop it and clean up
   the scheduler code related to it (Daniel Lezcano).

 - Prevent the arm_big_little cpufreq driver from being used on ARM64
   which is not suitable for it and drop the arm_big_little_dt driver
   that is not used any more (Sudeep Holla).

 - Prevent the hung task watchdog from triggering during resume from
   system-wide sleep states by disabling it before freezing tasks and
   enabling it again after they have been thawed (Vitaly Kuznetsov).

There is a merge conflict between the menu governor changes included
here and the (already merged) material from Andrew, but it should be
straightforward to resolve [1].

Thanks!

[1] https://lore.kernel.org/lkml/20181029130124.29034...@canb.auug.org.au/

---

Daniel Lezcano (2):
  sched: Factor out nr_iowait and nr_iowait_cpu
  cpuidle: menu: Remove get_loadavg() from the performance multiplier

Dominik Brodowski (1):
  cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI

Sudeep Holla (2):
  cpufreq: drop ARM_BIG_LITTLE_CPUFREQ support for ARM64
  cpufreq: remove unused arm_big_little_dt driver

Vitaly Kuznetsov (1):
  kernel: hung_task.c: disable on suspend

---

 MAINTAINERS |   1 -
 drivers/cpufreq/Kconfig.arm |   9 +---
 drivers/cpufreq/Makefile|   3 --
 drivers/cpufreq/arm_big_little_dt.c | 100 
 drivers/cpufreq/intel_pstate.c  |  20 
 drivers/cpuidle/governors/menu.c|  25 +++--
 include/linux/sched/stat.h  |   1 -
 kernel/hung_task.c  |  30 ++-
 kernel/sched/core.c |  34 +---
 9 files changed, 60 insertions(+), 163 deletions(-)


Re: linux-next: manual merge of the pm tree with Linus' tree

2018-10-30 Thread Rafael J. Wysocki
On Monday, October 29, 2018 3:01:24 AM CET Stephen Rothwell wrote:
> 
> --Sig_/X_/swfdki+BZTJ8afk_lQqW
> Content-Type: text/plain; charset=US-ASCII
> Content-Transfer-Encoding: quoted-printable
> 
> Hi Rafael,
> 
> Today's linux-next merge of the pm tree got a conflict in:
> 
>   drivers/cpuidle/governors/menu.c
> 
> between commit:
> 
>   8508cf3ffad4 ("sched: loadavg: consolidate LOAD_INT, LOAD_FRAC, CALC_LOAD=
> ")
> 
> from Linus' tree and commit:
> 
>   a7fe5190c03f ("cpuidle: menu: Remove get_loadavg() from the performance m=
> ultiplier")
> 
> from the pm tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> --=20
> Cheers,
> Stephen Rothwell
> 
> diff --cc drivers/cpuidle/governors/menu.c
> index 71979605246e,76df4f947f07..
> --- a/drivers/cpuidle/governors/menu.c
> +++ b/drivers/cpuidle/governors/menu.c
> @@@ -130,11 -130,10 +130,6 @@@ struct menu_device=20
>   int interval_ptr;
>   };
>  =20
> - static inline int get_loadavg(unsigned long load)
> - {
> - return LOAD_INT(load) * 10 + LOAD_FRAC(load) / 10;
> - }
>  -
>  -#define LOAD_INT(x) ((x) >> FSHIFT)
>  -#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
> --
>   static inline int which_bucket(unsigned int duration, unsigned long nr_io=
> waiters)
>   {
>   int bucket =3D 0;
> 
> --Sig_/X_/swfdki+BZTJ8afk_lQqW
> Content-Type: application/pgp-signature
> Content-Description: OpenPGP digital signature
> 
> 
> --Sig_/X_/swfdki+BZTJ8afk_lQqW--
> 

The fix looks good to me, thanks!

Cheers,
Rafael




Re: linux-next: manual merge of the pm tree with the i2c tree

2018-10-30 Thread Rafael J. Wysocki
On Monday, October 29, 2018 3:09:25 AM CET Stephen Rothwell wrote:
> 
> --Sig_/Yi5UDAd5LU=QdOHo+ui7Syk
> Content-Type: text/plain; charset=US-ASCII
> Content-Transfer-Encoding: quoted-printable
> 
> Hi Rafael,
> 
> Today's linux-next merge of the pm tree got conflicts in:
> 
>   drivers/i2c/busses/i2c-designware-baytrail.c
>   drivers/i2c/busses/i2c-designware-core.h
> 
> between commit:
> 
>   9cbeeca05049 ("i2c: designware: Remove Cherry Trail PMIC I2C bus pm_disab=
> led workaround")
> 
> from the i2c tree and commit:
> 
>   8afb46804dfa ("i2c: designware: Cleanup bus lock handling")
> 
> from the pm tree.
> 
> I fixed it up (see below) and can carry the fix as necessary. This
> is now fixed as far as linux-next is concerned, but any non trivial
> conflicts should be mentioned to your upstream maintainer when your tree
> is submitted for merging.  You may also want to consider cooperating
> with the maintainer of the conflicting tree to minimise any particularly
> complex conflicts.
> 
> --=20
> Cheers,
> Stephen Rothwell
> 
> diff --cc drivers/i2c/busses/i2c-designware-baytrail.c
> index 9ca1feaba98f,971b5cde7a93..
> --- a/drivers/i2c/busses/i2c-designware-baytrail.c
> +++ b/drivers/i2c/busses/i2c-designware-baytrail.c
> @@@ -162,18 -36,9 +36,9 @@@ int i2c_dw_probe_lock_support(struct dw
>   return -EPROBE_DEFER;
>  =20
>   dev_info(dev->dev, "I2C bus managed by PUNIT\n");
> - dev->acquire_lock =3D baytrail_i2c_acquire;
> - dev->release_lock =3D baytrail_i2c_release;
> + dev->acquire_lock =3D iosf_mbi_block_punit_i2c_access;
> + dev->release_lock =3D iosf_mbi_unblock_punit_i2c_access;
>  -dev->pm_disabled =3D true;
>  +dev->shared_with_punit =3D true;
>  =20
> - pm_qos_add_request(>pm_qos, PM_QOS_CPU_DMA_LATENCY,
> -PM_QOS_DEFAULT_VALUE);
> -=20
>   return 0;
>   }
> -=20
> - void i2c_dw_remove_lock_support(struct dw_i2c_dev *dev)
> - {
> - if (dev->acquire_lock)
> - pm_qos_remove_request(>pm_qos);
> - }
> diff --cc drivers/i2c/busses/i2c-designware-core.h
> index 9ec8394f4787,152bf56d8404..
> --- a/drivers/i2c/busses/i2c-designware-core.h
> +++ b/drivers/i2c/busses/i2c-designware-core.h
> @@@ -209,10 -208,9 +208,9 @@@
>* @fp_lcnt: fast plus LCNT value
>* @hs_hcnt: high speed HCNT value
>* @hs_lcnt: high speed LCNT value
> -  * @pm_qos: pm_qos_request used while holding a hardware lock on the bus
>* @acquire_lock: function to acquire a hardware lock on the bus
>* @release_lock: function to release a hardware lock on the bus
>  - * @pm_disabled: true if power-management should be disabled for this i2c=
> -bus
>  + * @shared_with_punit: true if this bus is shared with the SoCs PUNIT
>* @disable: function to disable the controller
>* @disable_int: function to disable all interrupts
>* @init: function to initialize the I2C hardware
> @@@ -263,10 -260,9 +261,9 @@@ struct dw_i2c_dev=20
>   u16 fp_lcnt;
>   u16 hs_hcnt;
>   u16 hs_lcnt;
> - struct pm_qos_request   pm_qos;
> - int (*acquire_lock)(struct dw_i2c_dev *dev);
> - void(*release_lock)(struct dw_i2c_dev *dev);
> + int (*acquire_lock)(void);
> + void(*release_lock)(void);
>  -boolpm_disabled;
>  +boolshared_with_punit;
>   void(*disable)(struct dw_i2c_dev *dev);
>   void(*disable_int)(struct dw_i2c_dev *dev);
>   int (*init)(struct dw_i2c_dev *dev);
> 
> --Sig_/Yi5UDAd5LU=QdOHo+ui7Syk
> Content-Type: application/pgp-signature
> Content-Description: OpenPGP digital signature
> 
> 
> --Sig_/Yi5UDAd5LU=QdOHo+ui7Syk--
> 

The fix looks good to me, thanks!

Cheers,
Rafael




Re: [RFC/RFT][PATCH v2] cpuidle: New timer events oriented governor for tickless systems

2018-10-30 Thread Rafael J. Wysocki
On Saturday, October 27, 2018 8:37:24 AM CET Doug Smythies wrote:
> This is just for anybody else trying to compile:
> 
> On 2018.10.26 02:12 Rafael J. Wysocki wrote:
> 
> > The venerable menu governor does some thigns that are quite
> 
> Typo: thigns -> things
> 
> ...[snip]...
> 
> > The patch should apply on top of 4.19, although I'm running it on
> > top of my linux-next branch.
> 
> No, it uses "poll_time_limit" which was introduced in patch 1 of 6
> [1] in that group of menu changes from October 2nd.
> 
> "[PATCH 1/6] cpuidle: menu: Fix wakeup statistics updates for polling state"

Right, sorry for missing that.

Thanks,
Rafael



Re: [REGRESSION 4.19-rc2] sometimes hangs with black screen when resuming from suspend or hibernation (was: Re: Linux 4.19-rc2)

2018-10-30 Thread Rafael J. Wysocki
On Fri, Oct 26, 2018 at 5:49 PM Martin Steigerwald  wrote:
>
> This regression is gone with 4.19-rc8.

Thanks for the update!


> Martin Steigerwald - 11.09.18, 09:53:
> […]
> > Linus Torvalds - 02.09.18, 23:45:
> > > As usual, the rc2 release is pretty small. People are taking a
> >
> > With 4.19-rc2 this ThinkPad T520 with i5 Sandybrdige sometimes hangs
> > with black screen when resuming from suspend or hibernation.  With
> > 4.18.1 it did not. Of course there have been userspace related updates
> > that could be related.
> >
> > I currently have no time to dig into this and on this production
> > laptop I generally do not do bisects between major kernel releases.
> > So currently I only answer questions that do not require much time to
> > answer.
> >
> > For now I switched back to 4.18. If that is stable – and thus likely
> > no userspace component is related –, I go with 4.19-rc3 or whatever
> > is most recent version to see if the issue has been fixed already.
> >
> > % inxi -z -b -G
> > System:Host: […] Kernel: 4.18.1-tp520-btrfstrim x86_64 bits: 64
> > Desktop: KDE Plasma 5.13.5
> >Distro: Debian GNU/Linux buster/sid
> > Machine:   Type: Laptop System: LENOVO product: 42433WG v: ThinkPad
> > T520 serial: 
> >Mobo: LENOVO model: 42433WG serial:  UEFI [Legacy]:
> > LENOVO v: 8AET69WW (1.49 )
> >date: 06/14/2018
> > […]
> > CPU:   Dual Core: Intel Core i5-2520M type: MT MCP speed: 2990 MHz
> > min/max: 800/3200 MHz
> > Graphics:  Device-1: Intel 2nd Generation Core Processor Family
> > Integrated Graphics driver: i915 v: kernel
> >Display: x11 server: X.Org 1.20.1 driver: modesetting
> > resolution: 1920x1080~60Hz
> >OpenGL: renderer: Mesa DRI Intel Sandybridge Mobile v: 3.3
> > Mesa 18.1.7
> > […]
> > Info:  Processes: 322 Uptime: 16m Memory: 15.45 GiB used: 3.12 GiB
> > (20.2%) Shell: zsh inxi: 3.0.22
> >
> > Thanks,
> > Martin
> >
> > > breather after the merge window, and it takes a bit of time for bug
> > > reports to start coming in and get identified.  Plus people were
> > > probably still on vacation (particularly Europe), and some people
> > > were at Open Source Summit NA last week too. Having a calm week was
> > > good.
> > >
> > > Regardless of the reason, it's pretty quiet/ The bulk of it is
> > > drivers (network and gpu stand out), with the rest being a random
> > > collection all over (arch/x86 and generic networking stands out,
> > > but there's misc stuff all over).
> > >
> > > Go out and test.
> > >
> > >  Linus
> > >
> > > ---
> […]
> --
> Martin
>
>


Re: [PATCH v2] cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI

2018-10-26 Thread Rafael J. Wysocki
On Tuesday, October 23, 2018 11:52:12 PM CEST Srinivas Pandruvada wrote:
> On Tue, 2018-10-23 at 21:54 +0200, Dominik Brodowski wrote:
> > While at it, add a few comments which config options #ifdef
> > and #else statements refer to.
> > 
> > Fixes: 86d333a8cc7f ("cpufreq: intel_pstate: Add base_frequency
> > attribute")
> > Cc: Srinivas Pandruvada 
> > Cc: Rafael J. Wysocki 
> > Signed-off-by: Dominik Brodowski 
> Acked-by: Srinivas Pandruvada 
> 
> > 
> > diff --git a/drivers/cpufreq/intel_pstate.c
> > b/drivers/cpufreq/intel_pstate.c
> > index 49c0abf2d48f..9578312e43f2 100644
> > --- a/drivers/cpufreq/intel_pstate.c
> > +++ b/drivers/cpufreq/intel_pstate.c
> > @@ -386,16 +386,11 @@ static int intel_pstate_get_cppc_guranteed(int
> > cpu)
> > return cppc_perf.guaranteed_perf;
> >  }
> >  
> > -#else
> > +#else /* CONFIG_ACPI_CPPC_LIB */
> >  static void intel_pstate_set_itmt_prio(int cpu)
> >  {
> >  }
> > -
> > -static int intel_pstate_get_cppc_guranteed(int cpu)
> > -{
> > -   return -ENOTSUPP;
> > -}
> > -#endif
> > +#endif /* CONFIG_ACPI_CPPC_LIB */
> >  
> >  static void intel_pstate_init_acpi_perf_limits(struct cpufreq_policy
> > *policy)
> >  {
> > @@ -477,7 +472,7 @@ static void intel_pstate_exit_perf_limits(struct
> > cpufreq_policy *policy)
> >  
> > acpi_processor_unregister_performance(policy->cpu);
> >  }
> > -#else
> > +#else /* CONFIG_ACPI */
> >  static inline void intel_pstate_init_acpi_perf_limits(struct
> > cpufreq_policy *policy)
> >  {
> >  }
> > @@ -490,7 +485,14 @@ static inline bool
> > intel_pstate_acpi_pm_profile_server(void)
> >  {
> > return false;
> >  }
> > -#endif
> > +#endif /* CONFIG_ACPI */
> > +
> > +#ifndef CONFIG_ACPI_CPPC_LIB
> > +static int intel_pstate_get_cppc_guranteed(int cpu)
> > +{
> > +   return -ENOTSUPP;
> > +}
> > +#endif /* CONFIG_ACPI_CPPC_LIB */
> >  
> >  static inline void update_turbo_state(void)
> >  {
> 
> 

Patch applied, thanks!



Re: [PATCH v4] kernel/hung_task.c: disable on suspend

2018-10-26 Thread Rafael J. Wysocki
On Thursday, October 18, 2018 9:32:42 AM CEST Rafael J. Wysocki wrote:
> On Wed, Oct 17, 2018 at 1:24 PM Vitaly Kuznetsov  wrote:
> >
> > It is possible to observe hung_task complaints when system goes to
> > suspend-to-idle state:
> >
> >  # echo freeze > /sys/power/state
> >
> >  PM: Syncing filesystems ... done.
> >  Freezing user space processes ... (elapsed 0.001 seconds) done.
> >  OOM killer disabled.
> >  Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
> >  sd 0:0:0:0: [sda] Synchronizing SCSI cache
> >  INFO: task bash:1569 blocked for more than 120 seconds.
> >Not tainted 4.19.0-rc3_+ #687
> >  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> >  bashD0  1569604 0x
> >  Call Trace:
> >   ? __schedule+0x1fe/0x7e0
> >   schedule+0x28/0x80
> >   suspend_devices_and_enter+0x4ac/0x750
> >   pm_suspend+0x2c0/0x310
> >
> > Register a PM notifier to disable the detector on suspend and re-enable
> > back on wakeup.
> >
> > Signed-off-by: Vitaly Kuznetsov 
> 
> Thanks for your patience with this!
> 
> Are there any objections or concerns regarding this patch?

Seeing none, so applied.

Thanks,
Rafael



Re: [PATCH V2 1/2] sched: Factor out nr_iowait and nr_iowait_cpu

2018-10-26 Thread Rafael J. Wysocki
On Thursday, October 4, 2018 2:04:02 PM CEST Daniel Lezcano wrote:
> The function nr_iowait_cpu() can be used directly by nr_iowait() instead
> of duplicating code.
> 
> Call nr_iowait_cpu() from nr_iowait()
> 
> Reviewed-by: Rafael J. Wysocki 
> Signed-off-by: Daniel Lezcano 
> ---
>  kernel/sched/core.c | 41 -
>  1 file changed, 20 insertions(+), 21 deletions(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 625bc98..43efb74 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2875,6 +2875,25 @@ unsigned long long nr_context_switches(void)
>  }
>  
>  /*
> + * Consumers of these two interfaces, like for example the cpufreq menu
> + * governor are using nonsensical data. Boosting frequency for a CPU that has
> + * IO-wait which might not even end up running the task when it does become
> + * runnable.
> + */
> +
> +unsigned long nr_iowait_cpu(int cpu)
> +{
> + return atomic_read(_rq(cpu)->nr_iowait);
> +}
> +
> +void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
> +{
> + struct rq *rq = this_rq();
> + *nr_waiters = atomic_read(>nr_iowait);
> + *load = rq->load.weight;
> +}
> +
> +/*
>   * IO-wait accounting, and how its mostly bollocks (on SMP).
>   *
>   * The idea behind IO-wait account is to account the idle time that we could
> @@ -2909,31 +2928,11 @@ unsigned long nr_iowait(void)
>   unsigned long i, sum = 0;
>  
>   for_each_possible_cpu(i)
> - sum += atomic_read(_rq(i)->nr_iowait);
> + sum += nr_iowait_cpu(i);
>  
>   return sum;
>  }
>  
> -/*
> - * Consumers of these two interfaces, like for example the cpufreq menu
> - * governor are using nonsensical data. Boosting frequency for a CPU that has
> - * IO-wait which might not even end up running the task when it does become
> - * runnable.
> - */
> -
> -unsigned long nr_iowait_cpu(int cpu)
> -{
> - struct rq *this = cpu_rq(cpu);
> - return atomic_read(>nr_iowait);
> -}
> -
> -void get_iowait_load(unsigned long *nr_waiters, unsigned long *load)
> -{
> - struct rq *rq = this_rq();
> - *nr_waiters = atomic_read(>nr_iowait);
> - *load = rq->load.weight;
> -}
> -
>  #ifdef CONFIG_SMP
>  
>  /*
> 

Both [1-2/2] applied with Peter's ACKs, thanks!



Re: 4.18: early boot crash in thermal_cooling_device_destroy_sysfs

2018-10-26 Thread Rafael J. Wysocki
On Monday, October 22, 2018 8:37:25 PM CEST Randy Dunlap wrote:
> 
> On 8/16/18 2:33 PM, Randy Dunlap wrote:
> > Hi,
> > 
> > Sorry for the photo.  That's all I have available so far.
> > 
> > https://www.infradead.org/~rdunlap/doc/IMG_20180816_133254743_HDR.jpg
> > 
> > 
> > Does anyone recognize this?
> > 
> > This is an (older) Toshiba laptop.  The kernel .config is mostly an
> > allmodconfig with some DEBUG options disabled and other options enabled
> > so that it can boot without using an initramfs.  (and with COMPILE_TEST
> > disabled :)
> > 
> > 
> > The full kernel .config file is attached.
> > 
> > Thanks,
> > 
> 
> This is a result of CONFIG_DEBUG_TEST_DRIVER_REMOVE=y.
> [switch from 64-bit to 32-bit machine]
> 
> 
> When using CONFIG_DEBUG_VM=y, it BUGs at:
> [5.553603] [ cut here ]
> [5.553733] kernel BUG at arch/x86/mm/physaddr.c:75!
> [5.557788] invalid opcode:  [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [5.558738] CPU: 1 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7 #4
> [5.558738] Hardware name: Dell Inc. Inspiron 1318   
> /0C236D, BIOS A04 01/15/2009
> [5.558738] EIP: __phys_addr+0x40/0x90
> [5.558738] Code: 00 40 75 2e 8b 15 00 57 23 d5 85 d2 74 12 89 d9 c1 e9 0c 
> 39 ca 72 5b e8 2e ca ff ff 39 d8 75 4a 89 d8 5b 5d c3 8d 74 26 00 90 <0f> 0b 
> 8d b6 00 00 00 00 8b 0d 80 56 23 d5 8d 91 00 00 80 00 39 d0
> [5.558738] EAX: 6b6b6b6b EBX: 6b6b6b6b ECX: 00140011 EDX: 
> [5.558738] ESI: f489 EDI: d4a58d60 EBP: f40c1e0c ESP: f40c1e08
> [5.558738] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210a97
> [5.558738] CR0: 80050033 CR2:  CR3: 14cad000 CR4: 000406d0
> [5.558738] Call Trace:
> [5.558738]  kfree+0x1f/0x160
> [5.558738]  thermal_cooling_device_destroy_sysfs+0x11/0x20
> [5.558738]  thermal_cooling_device_unregister+0x168/0x180
> [5.558738]  acpi_pss_perf_exit.isra.4+0x32/0x50
> [5.558738]  acpi_processor_stop+0x4d/0x60
> [5.558738]  really_probe+0xa3/0x3e0
> [5.558738]  driver_probe_device+0x5b/0x120
> [5.558738]  __driver_attach+0xd9/0x100
> [5.558738]  ? driver_probe_device+0x120/0x120
> [5.558738]  bus_for_each_dev+0x56/0x90
> [5.558738]  driver_attach+0x14/0x20
> [5.558738]  ? driver_probe_device+0x120/0x120
> [5.558738]  bus_add_driver+0x117/0x210
> [5.558738]  driver_register+0x61/0xb0
> [5.558738]  acpi_processor_driver_init+0x19/0x88
> [5.558738]  ? acpi_pci_slot_init+0xf/0xf
> [5.558738]  do_one_initcall+0x3e/0x15a
> [5.558738]  ? do_early_param+0x75/0x75
> [5.558738]  kernel_init_freeable+0x170/0x1f3
> [5.558738]  ? rest_init+0xcd/0xcd
> [5.558738]  kernel_init+0x8/0xdb
> [5.558738]  ret_from_fork+0x2e/0x38
> [5.558738] Modules linked in:
> [5.625269] _warn_unseeded_randomness: 1 callbacks suppressed
> [5.625272] random: get_random_bytes called from init_oops_id+0x3a/0x40 
> with crng_init=0
> [5.629758] ---[ end trace 65b17bf4d18e7692 ]---
> [5.631573] EIP: __phys_addr+0x40/0x90
> [5.633242] Code: 00 40 75 2e 8b 15 00 57 23 d5 85 d2 74 12 89 d9 c1 e9 0c 
> 39 ca 72 5b e8 2e ca ff ff 39 d8 75 4a 89 d8 5b 5d c3 8d 74 26 00 90 <0f> 0b 
> 8d b6 00 00 00 00 8b 0d 80 56 23 d5 8d 91 00 00 80 00 39 d0
> [5.638618] EAX: 6b6b6b6b EBX: 6b6b6b6b ECX: 00140011 EDX: 
> [5.640703] ESI: f489 EDI: d4a58d60 EBP: f40c1e0c ESP: d4cb13dc
> [5.642801] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210a97
> [5.645053] CR0: 80050033 CR2:  CR3: 14cad000 CR4: 000406d0
> [5.647179] Kernel panic - not syncing: Fatal exception
> [5.648172] Kernel Offset: 0x1300 from 0xc100 (relocation range: 
> 0xc000-0xf77fdfff)
> [5.648172] ---[ end Kernel panic - not syncing: Fatal exception ]---
> 
> 
> When not using CONFIG_DEBUG_VM, it BUGs in kfree:
> [5.497864] [ cut here ]
> [5.498215] kernel BUG at mm/slub.c:3901!
> [5.501739] invalid opcode:  [#1] PREEMPT SMP
> [5.502720] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.19.0-rc7 #3
> [5.502720] Hardware name: Dell Inc. Inspiron 1318   
> /0C236D, BIOS A04 01/15/2009
> [5.502720] EIP: kfree+0x117/0x150
> [5.502720] Code: 74 21 8b 06 31 d2 f6 c4 80 74 04 0f b6 56 31 89 f0 e8 7d 
> e0 fa ff e9 7b ff ff ff 8d b4 26 00 00 00 00 90 8b 46 04 a8 01 75 d8 <0f> 0b 
> 8d b4 26 00 00 00 00 8b 75 f0 ff 75 ec 89 d9 89 f8 6a 01 53
> [5.502720] EAX: 0100 EBX: 6b6b6b6b ECX: 00140011 EDX: 
> [5.502720] ESI: f67dac70 EDI: ccc4aca0 EBP: f4083e28 ESP: f4083e10
> [5.502720] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210246
> [5.502720] CR0: 80050033 CR2: ffd14000 CR3: 0ce94000 CR4: 000406d0
> [5.502720] Call Trace:
> [5.502720]  thermal_cooling_device_destroy_sysfs+0x11/0x20
> [5.502720]  thermal_cooling_device_unregister+0x168/0x180
> [5.502720]  

[RFC/RFT][PATCH v2] cpuidle: New timer events oriented governor for tickless systems

2018-10-26 Thread Rafael J. Wysocki
From: Rafael J. Wysocki 

The venerable menu governor does some thigns that are quite
questionable in my view.  First, it includes timer wakeups in
the pattern detection data and mixes them up with wakeups from
other sources which in some cases causes it to expect what
essentially would be a timer wakeup in a time frame in which
no timer wakeups are possible (becuase it knows the time until
the next timer event and that is later than the expected wakeup
time).  Second, it uses the extra exit latency limit based on
the predicted idle duration and depending on the number of tasks
waiting on I/O, even though those tasks may run on a different
CPU when they are woken up.  Moreover, the time ranges used by it
for the sleep length correction factors depend on whether or not
there are tasks waiting on I/O, which again doesn't imply anything
in particular, and they are not correlated to the list of available
idle states in any way whatever.  Also,  the pattern detection code
in menu may end up considering values that are too large to matter
at all, in which cases running it is a waste of time.

A major rework of the menu governor would be required to address
these issues and the performance of at least some workloads (tuned
specifically to the current behavior of the menu governor) is likely
to suffer from that.  It is thus better to introduce an entirely new
governor without them and let everybody use the governor that works
better with their actual workloads.

The new governor introduced here, the timer events oriented (TEO)
governor, uses the same basic strategy as menu: it always tries to
find the deepest idle state that can be used in the given conditions.
However, it applies a different approach to that problem.  First, it
doesn't use "correction factors" for the time till the closest timer,
but instead it tries to correlate the measured idle duration values
with the available idle states and use that information to pick up
the idle state that is most likely to "match" the upcoming CPU idle
interval.  Second, it doesn't take the number of "I/O waiters" into
account at all and the pattern detection code in it tries to avoid
taking timer wakeups into account.  It also only uses idle duration
values less than the current time till the closest timer (with the
tick excluded) for that purpose.

Signed-off-by: Rafael J. Wysocki 
---

The v2 is a re-write of major parts of the original patch.

The approach the same in general, but the details have changed significantly
with respect to the previous version.  In particular:
* The decay of the idle state metrics is implemented differently.
* There is a more "clever" pattern detection (sort of along the lines
  of what the menu does, but simplified quite a bit and trying to avoid
  including timer wakeups).
* The "promotion" from the "polling" state is gone.
* The "safety net" wakeups are treated as the CPU might have been idle
  until the closest timer.

I'm running this governor on all of my systems now without any
visible adverse effects.

Overall, it selects deeper idle states more often than menu on average, but
that doesn't seem to make a significant difference in the majority of cases.

In this preliminary revision it overtakes menu as the default governor
for tickless systems (due to the higher rating), but that is likely
to change going forward.  At this point I'm mostly asking for feedback
and possibly testing with whatever workloads you can throw at it.

The patch should apply on top of 4.19, although I'm running it on
top of my linux-next branch.  This version hasn't been run through
benchmarks yet and that likely will take some time as I will be
traveling quite a bit during the next few weeks.

---
 drivers/cpuidle/Kconfig|   11 
 drivers/cpuidle/governors/Makefile |1 
 drivers/cpuidle/governors/teo.c|  491 +
 3 files changed, 503 insertions(+)

Index: linux-pm/drivers/cpuidle/governors/teo.c
===
--- /dev/null
+++ linux-pm/drivers/cpuidle/governors/teo.c
@@ -0,0 +1,491 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Timer events oriented CPU idle governor
+ *
+ * Copyright (C) 2018 Intel Corporation
+ * Author: Rafael J. Wysocki 
+ *
+ * The idea of this governor is based on the observation that on many systems
+ * timer events are two or more orders of magnitude more frequent than any
+ * other interrupts, so they are likely to be the most significant source of 
CPU
+ * wakeups from idle states.  Moreover, information about what happened in the
+ * (relatively recent) past can be used to estimate whether or not the deepest
+ * idle state with target residency within the time to the closest timer is
+ * likely to be suitable for the upcoming idle time of the CPU and, if not, 
then
+ * which of the shallower idle states to choose.
+ *
+ * Of course, non-timer wakeup so

Re: [RFC/RFT/[PATCH] cpuidle: New timer events oriented governor for tickless systems

2018-10-26 Thread Rafael J. Wysocki
Hi Giovanni,

On Mon, Oct 22, 2018 at 10:51 AM Giovanni Gherdovich
 wrote:
>
> Hello Rafael,
>
> I ran some benchmarks and will send you a detailed report later;

Thanks a lot, much appreciated!

Even though I'm about to send a v2 of the $subject patch which is a
complete rewrite of some parts of the code, results from the previous
one will be good to see anyway.

> for now I have some questions to make sure I understand the code.
>
> First off, I like your algorithm and you make an excellent job at
> documenting it with comments. Since it's a heuristic, it's not "obvious"
> code and comments help a lot.
>
> On Thu, 2018-10-11 at 23:01 +0200, Rafael J. Wysocki wrote:
> > From: Rafael J. Wysocki 
> >

[cut]

> > +static void teo_update(struct cpuidle_driver *drv,
> > +struct cpuidle_device *dev)
> > +{
> > + struct teo_cpu *cpu_data = per_cpu_ptr(_cpus, dev->cpu);
> > + unsigned int measured_us = dev->last_residency;
>
> question: how reliable is the value "dev->last_residency"? does it come
> "from the hardware" somehow? I'm asking because I'm more familiar with
> CPUfreq than CPUidle, and there you have a bit of a disconnect between what
> the OS think and what the hardware actually does. So I wonder if there is a
> similar situation with the last_residency value, and ask if it comes from
> some sort of feedback from the processor itself (if that makes any
> sense). I see it's written to in the cpuidle_enter_state() function in
> cpuidle.c, but I couldn't clearly understand that function either.

It is measured by the kernel, so it is not very precise, but it should
be precise enough for idle duration estimation on average.

> > + int i = cpu_data->last_state;
> > +
> > + if (measured_us >= 2 * drv->states[i].exit_latency)
> > + measured_us -= drv->states[i].exit_latency;
> > + else
> > + measured_us /= 2;
> > +
> > + /* Find the state matching the measured idle duration. */
> > + for (i = drv->state_count - 1; i > 0; i--) {
> > + if (drv->states[i].target_residency <= measured_us)
> > + break;
> > + }
>
> Wait, I'm lost here. You just initialized "int i = cpu_data->last_state;"
> ten lines above, and here you're computing "given how much we idled, what
> would have been the best state to enter?". Was the initialization of i a
> mistake? what's the intention?

It's intentional, but just because of the measured_us computation
earlier that would need to refer to
drv->states[cpu_data->last_state].exit_latency twice.  "i" serves as
auxiliary index storage for that.

> > +
> > + cpu_data->recent_wakeups++;
> > +
> > + /*
> > +  * If the state matches the time till the closest timer event used
> > +  * during the most recent state selection too, it would be sufficient
> > +  * to use that time alone for the idle state selection, so increment
> > +  * the "hits" number for the matching state and clear the "early 
> > series"
> > +  * data.
> > +  *
> > +  * Note that sleep_length_us cannot be less than measured_us (or it
> > +  * would mean a late timer event), so it is sufficient to check the
> > +  * lower boundary here.
> > +  */
> > + if (i == drv->state_count - 1 ||
> > + drv->states[i+1].target_residency > cpu_data->sleep_length_us)
> > {
>
> Correct me if I'm wrong: we know that states[i+1].target_residency > 
> measured_us,
> because that's just how we've chosen i. We know that sleep_length_us >= 
> measured_us,
> because that's how timers work.
>
> Now, the naive way if we're in a "hit" situation would be to check is
> measured_us is equal (or, well, "close") to sleep_length_us. If, on the
> other side, measured_us is a lot smaller than sleep_length_us, it's an
> "early". But you don't check it that way, what you do is to see if
> target_residency[i+1] is in between measured_us and sleep_length_us or
> not. Like this (sorry for the sloppy notation:
>
> target_residency[i] < measured_us <= sleep_length_us < target_residency[i+1]
>
> and that would be a hit. Otherwise:
>
> target_residency[i] < measured_us < target_residency[i+1] <= sleep_length_us
>
> and that's an "early". Right?

The idea here is for "early" to mean "so much earlier than it makes a
real difference", where "a real difference" is when different idle
states need to be used in both cases.

[GIT PULL] ACPI updates for v4.20-rc1

2018-10-22 Thread Rafael J. Wysocki
Hi Linus,

Please pull from the tag

 git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm.git \
 acpi-4.20-rc1

with top-most commit 0a1875ad29ef6436da3c184494b509697c8e0cf7

 Merge branches 'acpi-property' and 'acpi-sbs'

on top of commit 35a7f35ad1b150ddf59a41dcac7b2fa32982be0e

 Linux 4.19-rc8

to receive ACPI updates for 4.20-rc1.

These fix ACPICA issues related to the handling of module-level AML,
fix an ordering issue during ACPI initialization, update ACPICA to
upstream revision 20181003 (including fixes mostly), fix issues with
system-wide suspend/resume related to the ACPI driver for Intel
SoCs (LPSS), fix device enumeration issues on boards with Dollar
Cove or Whiskey Cove Intel PMICs, prevent ACPICA from calling
ktime_get() in unsuitable conditions, update a few drivers and
clean up some code in several places.

Specifics:

 - Fix ACPICA issues related to the handling of module-level AML
   and make the ACPI initialization code parse ECDT before loading
   the definition block tables (Erik Schmauss).

 - Update ACPICA to upstream revision 20181003 including fixes
   related to the ill-defined "generic serial bus" and the handling
   of the _REG object (Bob Moore).

 - Fix some issues with system-wide suspend/resume on Intel BYT/CHT
   related to the handling of I2C controllers in the ACPI LPSS driver
   for Intel SoCs (Hans de Goede).

 - Modify the ACPI namespace scanning code to enumerate INT33FE HID
   devices as platform devices with I2C resources to avoid device
   enumeration problems on boards with Dollar Cove or Whiskey Cove
   Intel PMICs (Hans de Goede).

 - Prevent ACPICA from using ktime_get() during early resume from
   system-wide suspend before resuming the timekeeping which generally
   is unsafe and triggers a warning from the timekeeping code (Bart
   Van Assche).

 - Add low-level real time clock support to the ACPI Time and Aalarm
   Device (TAD) driver (Rafael Wysocki).

 - Fix the ACPI SBS driver to avoid GPE storms on MacBook Pro and
   Oopses when removing modules (Ronald Tschalär).

 - Fix the ACPI PPTT parsing code to handle architecturally unknown
   cache types properly (Jeffrey Hugo).

 - Fix initialization issue in the ACPI processor driver (Dou Liyang).

 - Clean up the code in several places (Andy Shevchenko, Bartlomiej
   Zolnierkiewicz, David Arcari, zhong jiang).

Thanks!


---

Andy Shevchenko (4):
  ACPI / glue: Split dev_is_platform() out of module for wide use
  ACPI / property: Switch to bitmap_zalloc()
  ACPI / PMIC: Sort headers alphabetically
  ACPI / PMIC: Convert drivers to use SPDX identifier

Bart Van Assche (1):
  ACPI / OSL: Use 'jiffies' as the time bassis for acpi_os_get_timer()

Bartlomiej Zolnierkiewicz (1):
  ACPI: remove redundant 'default n' from Kconfig

Bob Moore (6):
  ACPICA: Update for generic_serial_bus and
attrib_raw_process_bytes protocol
  ACPICA: Rename some of the Field Attribute defines
  ACPICA: Update for field unit access
  ACPICA: Split large interpreter file
  ACPICA: Never run _REG on system_memory and system_IO
  ACPICA: Update version to 20181003

David Arcari (1):
  mailbox: PCC: handle parse error

Dou Liyang (1):
  ACPI / processor: Fix the return value of acpi_processor_ids_walk()

Erik Schmauss (4):
  ACPICA: AML interpreter: add region addresses in global list
during initialization
  ACPICA: AML Parser: fix parse loop to correctly skip erroneous
extended opcodes
  ACPICA: Remove acpi_gbl_group_module_level_code and only use
acpi_gbl_execute_tables_as_methods instead
  ACPI: probe ECDT before loading AML tables regardless of
module-level code flag

Hans de Goede (9):
  ACPI / LPSS: Add alternative ACPI HIDs for Cherry Trail DMA controllers
  ACPI / LPSS: Exclude I2C busses shared with PUNIT from pmc_atom_d3_mask
  ACPI / LPSS: Make hid_uid_match helper take an acpi_device as
first argument
  ACPI / LPSS: Make hid_uid_match helper accept a NULL uid argument
  ACPI / LPSS: Make acpi_lpss_find_device() also find PCI devices
  ACPI / LPSS: Add a device link from the GPU to the CHT I2C7 controller
  ACPI / LPSS: Add a device link from the GPU to the BYT I2C5 controller
  ACPI / LPSS: Resume BYT/CHT I2C controllers from resume_noirq
  ACPI / scan: Create platform device for INT33FE ACPI nodes

Jeffrey Hugo (2):
  drivers: base: cacheinfo: Do not populate sysfs for unknown cache types
  ACPI/PPTT: Handle architecturally unknown cache types

Rafael J. Wysocki (1):
  ACPI: TAD: Add low-level support for real time capability

Ronald Tschalär (2):
  ACPI / SBS: Fix GPE storm on recent MacBookPro's
  ACPI / SBS: Fix rare oops when removing modules

zhong jiang (1):
  ACPI: custom_method: remove meaningless null check before debugfs_remove()

---

 drivers/acpi/Kconfig |   6 -
 drivers/acpi/acpi_lpss.c | 118 -

[GIT PULL] Power management updates for v4.20-rc1

2018-10-22 Thread Rafael J. Wysocki
amily 0x17 msr_pstate size
  cpupower: Fix coredump on VMWare

Rafael J. Wysocki (12):
  cpuidle: menu: Replace data->predicted_us with local variable
  cpuidle: menu: Fix wakeup statistics updates for polling state
  cpuidle: menu: Compute first_idx when latency_req is known
  cpuidle: menu: Get rid of first_idx from menu_select()
  cpuidle: menu: Do not update last_state_idx in menu_select()
  cpuidle: menu: Avoid computations for very close timers
  cpuidle: menu: Move the latency_req == 0 special case check
  cpuidle: poll_state: Revise loop termination condition
  cpuidle: menu: Simplify checks related to the polling state
  cpufreq: conservative: Take limits changes into account properly
  cpuidle: menu: Drop redundant comparison
  cpuidle: menu: Avoid computations when result will be discarded

Rajneesh Bhardwaj (1):
  ACPI / PM: LPIT: Register sysfs attributes based on FADT

Rob Herring (2):
  cpufreq: Convert to using %pOFn instead of device_node.name
  PM / devfreq: Convert to using %pOFn instead of device_node.name

Srinivas Pandruvada (3):
  ACPI / CPPC: Add support for guaranteed performance
  cpufreq: intel_pstate: Add base_frequency attribute
  Documentation: intel_pstate: Add base_frequency information

Todd Brandt (3):
  PM / sleep: Show freezing tasks that caused a suspend abort
  PM / tools: sleepgraph: first batch of v5.2 changes
  PM / tools: sleepgraph and bootgraph: upgrade to v5.2

Ulf Hansson (3):
  PM / Domains: Don't treat zero found compatible idle states as an error
  PM / Domains: Deal with multiple states but no governor in genpd
  PM / Domains: Document flags for genpd

Vincent Donnefort (1):
  PM / devfreq: stopping the governor before device_unregister()

Viresh Kumar (15):
  OPP: Free OPP table properly on performance state irregularities
  OPP: Don't try to remove all OPP tables on failure
  OPP: Protect dev_list with opp_table lock
  OPP: Pass index to _of_init_opp_table()
  OPP: Parse OPP table's DT properties from _of_init_opp_table()
  OPP: Don't take OPP table's kref for static OPPs
  OPP: Create separate kref for static OPPs list
  cpufreq: mvebu: Remove OPPs using dev_pm_opp_remove()
  OPP: Don't remove dynamic OPPs from _dev_pm_opp_remove_table()
  OPP: Use a single mechanism to free the OPP table
  OPP: Prevent creating multiple OPP tables for devices sharing OPP nodes
  OPP: Pass OPP table to _of_add_opp_table_v{1|2}()
  OPP: Improve error handling in dev_pm_opp_of_cpumask_add_table()
  OPP: Return error on error from dev_pm_opp_get_opp_count()
  cpufreq: dt: Try freeing static OPPs only if we have added them

Vladimir D. Seleznev (1):
  PM / hibernate: Documentation: fix image_size default value

Zhimin Gu (11):
  x86, hibernate: Fix nosave_regions setup for hibernation
  x86-32/asm/power: Create stack frames in hibernate_asm_32.S
  x86, hibernate: Extract the common code of 64/32 bit system
  x86-32, hibernate: Enable CONFIG_ARCH_HIBERNATION_HEADER on 32bit system
  x86, hibernate: Rename temp_level4_pgt to temp_pgt
  x86-32, hibernate: Use temp_pgt as the temporary page table
  x86-32, hibernate: Use the page size macro instead of constant value
  x86-32, hibernate: Switch to original page table after resumed
  x86-32, hibernate: Switch to relocated restore code during
resume on 32bit system
  x86-32, hibernate: Set up temporary text mapping for 32bit system
  x86-32, hibernate: Adjust in_suspend after resumed on 32bit system

zhong jiang (1):
  PM / devfreq: remove redundant null pointer check before kfree

---

 Documentation/ABI/testing/sysfs-power  |2 +-
 Documentation/admin-guide/pm/intel_pstate.rst  |7 +
 Documentation/power/swsusp.txt |2 +-
 arch/x86/Kconfig   |2 +-
 arch/x86/include/asm/suspend.h |8 +
 arch/x86/include/asm/suspend_32.h  |4 +
 arch/x86/kernel/setup.c|2 +-
 arch/x86/power/Makefile|2 +-
 arch/x86/power/hibernate.c |  248 +++
 arch/x86/power/hibernate_32.c  |   52 +-
 arch/x86/power/hibernate_64.c  |  224 +--
 arch/x86/power/hibernate_asm_32.S  |   37 +-
 arch/x86/power/hibernate_asm_64.S  |2 +-
 drivers/acpi/acpi_lpit.c   |6 +
 drivers/acpi/cppc_acpi.c   |8 +-
 drivers/base/power/domain.c|   20 +-
 drivers/cpufreq/cppc_cpufreq.c |2 +-
 drivers/cpufreq/cpufreq-dt-platdev.c   |6 +-
 drivers/cpufreq/cpufreq-dt.c   |   34 +-
 drivers/cpufreq/cpufreq.c  |2 +-
 dr

Re: [PATCH 0/1] ACPI / scan: Create platform device for INT33FE ACPI nodes

2018-10-19 Thread Rafael J. Wysocki
On Wednesday, October 17, 2018 1:39:54 PM CEST Andy Shevchenko wrote:
> On Wed, Oct 17, 2018 at 11:59 AM Hans de Goede  wrote:
> >
> > Hi Rafael, Andy,
> >
> > For the why and what of this patch see the (somewhat long) commit message.
> >
> > The single patch in this set both touches drivers/acpi/scan.c and
> > drivers/platform/x86/intel_cht_int33fe.c, this is done this way to avoid
> > regressions when bisecting.
> >
> > The main change here really is to ACPI change and intel_cht_int33fe.c is
> > modified to follow suit. Also I do not expect intel_cht_int33fe.c to see
> > much changes this cycle. As such I believe it would be best to merge this
> > patch through Rafael's tree (after review).
> >
> > Andy is that ok with you and we have your ack for this?
> 
> I love this patch!
> 
> Definitely,
> Acked-by: Andy Shevchenko 

Patch applied, thanks!



Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files

2018-10-18 Thread Rafael J. Wysocki
On Thu, Oct 18, 2018 at 11:54 AM Daniel Lezcano
 wrote:
>
> On 18/10/2018 11:42, Rafael J. Wysocki wrote:
> > On Thu, Oct 18, 2018 at 11:36 AM Daniel Lezcano
> >  wrote:
> >>
> >> It was suggested to set the scene for the PM components in the
> >> scheduler code organization in the recent discussion about making the
> >> scheduler aware of the capacity capping from the thermal framework.
> >>
> >> Move the cpufreq files into its own directory as suggested at:
> >>
> >> https://lkml.org/lkml/2018/10/18/353
> >> https://lkml.org/lkml/2018/10/18/408
> >
> > Fair enough, but do we need to do that right now?
>
> I'm not planning to do more code reorg than this patch right now. Up to
> you to decide if you are willing to take them.

The SPDX one certainly is applicable, but I'm not sure about the other one TBH.

Why don't you add the SPDX IDs to those files as they are for now?


Re: [PATCH 1/2] sched/cpufreq: Reorganize the cpufreq files

2018-10-18 Thread Rafael J. Wysocki
On Thu, Oct 18, 2018 at 11:36 AM Daniel Lezcano
 wrote:
>
> It was suggested to set the scene for the PM components in the
> scheduler code organization in the recent discussion about making the
> scheduler aware of the capacity capping from the thermal framework.
>
> Move the cpufreq files into its own directory as suggested at:
>
> https://lkml.org/lkml/2018/10/18/353
> https://lkml.org/lkml/2018/10/18/408

Fair enough, but do we need to do that right now?


Re: [PATCH 1/3] x86: baytrail/cherrytrail: Rework and move P-Unit PMIC bus semaphore code

2018-10-18 Thread Rafael J. Wysocki
On Thursday, October 18, 2018 10:34:57 AM CEST Hans de Goede wrote:
> Hi,
> 
> On 18-10-18 09:29, Rafael J. Wysocki wrote:
> > On Sun, Sep 23, 2018 at 4:45 PM Hans de Goede  wrote:
> >>
> >> On some BYT/CHT systems the SoC's P-Unit shares the I2C bus with the
> >> kernel. The P-Unit has a semaphore for the PMIC bus which we can take to
> >> block it from accessing the shared bus while the kernel wants to access it.
> >>
> >> Currently we have the I2C-controller driver acquiring and releasing the
> >> semaphore around each I2C transfer. There are 2 problems with this:
> >>
> >> 1) PMIC accesses often come in the form of a read-modify-write on one of
> >> the PMIC registers, we currently release the P-Unit's PMIC bus semaphore
> >> between the read and the write. If the P-Unit modifies the register during
> >> this window?, then we end up overwriting the P-Unit's changes.
> >> I believe that this is mostly an academic problem, but I'm not sure.
> >>
> >> 2) To safely access the shared I2C bus, we need to do 3 things:
> >> a) Notify the GPU driver that we are starting a window in which it may not
> >> access the P-Unit, since the P-Unit seems to ignore the semaphore for
> >> explicit power-level requests made by the GPU driver
> >> b) Make a pm_qos request to force all CPU cores out of C6/C7 since entering
> >> C6/C7 while we hold the semaphore hangs the SoC
> >> c) Finally take the P-Unit's PMIC bus semaphore
> >> All 3 these steps together are somewhat expensive, so ideally if we have
> >> a bunch of i2c transfers grouped together we only do this once for the
> >> entire group.
> >>
> >> Taking the read-modify-write on a PMIC register as example then ideally we
> >> would only do all 3 steps once at the beginning and undo all 3 steps once
> >> at the end.
> >>
> >> For this we need to be able to take the semaphore from within e.g. the PMIC
> >> opregion driver, yet we do not want to remove the taking of the semaphore
> >> from the I2C-controller driver, as that is still necessary to protect many
> >> other code-paths leading to accessing the shared I2C bus.
> >>
> >> This means that we first have the PMIC driver acquire the semaphore and
> >> then have the I2C controller driver trying to acquire it again.
> >>
> >> To make this possible this commit does the following:
> >>
> >> 1) Move the semaphore code from being private to the I2C controller driver
> >> into the generic iosf_mbi code, which already has other code to deal with
> >> the shared bus so that it can be accessed outside of the I2C bus driver.
> >>
> >> 2) Rework the code so that it can be called multiple times nested, while
> >> still blocking I2C accesses while e.g. the GPU driver has indicated the
> >> P-Unit needs the bus through a iosf_mbi_punit_acquire() call.
> >>
> >> Signed-off-by: Hans de Goede 
> > 
> > If there are no objections or concerns regarding this patch, I'm
> > inclined to take the entire series including it.
> 
> In that case let me send out a v4, with the following chunk added to the
> 2nd patch:
> 
> --- a/drivers/acpi/Kconfig
> +++ b/drivers/acpi/Kconfig
> @@ -515,7 +515,7 @@ config CRC_PMIC_OPREGION
> 
>   config XPOWER_PMIC_OPREGION
>  bool "ACPI operation region support for XPower AXP288 PMIC"
> -   depends on MFD_AXP20X_I2C
> +   depends on MFD_AXP20X_I2C && IOSF_MBI
>  help
>This config adds ACPI operation region support for XPower AXP288 
> PMIC.
> 
> This is necessary to avoid compilation issues on non x86 systems (where the
> asm/iosf_mbi.h header is not available) and on x86 systems in case
> IOSF_MBI support is not enabled there.  Note that the AXP288 PMIC is
> connected through the LPSS i2c controller, so either we have IOSF_MBI support
> selected through the X86_INTEL_LPSS option, or we have a kernel where the
> opregion will never work anyways.

I'd prefer to get an incremental patch for that at this point.

Thanks,
Rafael



Re: [RFC PATCH 0/7] Introduce thermal pressure

2018-10-18 Thread Rafael J. Wysocki
On Thu, Oct 18, 2018 at 9:50 AM Ingo Molnar  wrote:
>
>
> * Rafael J. Wysocki  wrote:
>
> > > The only long term maintainable solution is to move all high level
> > > cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,
> > > which has been done to a fair degree already in the past ~2 years - but
> > > it's unclear to me to what extent this is true for thermal throttling
> > > policy currently: there might be more governor surgery and code
> > > reshuffling required?
> >
> > It doesn't cover thermal management directly ATM.
> >
> > The EAS work kind of hopes to make a connection in there by adding a
> > common energy model to underlie both the performance scaling and
> > thermal management, but it doesn't change the thermal decision making
> > part AFAICS.
> >
> > So it is fair to say that additional governor surgery and code
> > reshuffling will be required IMO.
>
> BTW., when factoring out high level thermal management code it might make
> sense to increase the prominence of the cpufreq code within the scheduler
> and organize it a bit better, by introducing its own
> kernel/sched/cpufreq/ directory and renaming things the following way:
>
>   kernel/sched/cpufreq.c=> kernel/sched/cpufreq/core.c
>   kernel/sched/cpufreq_schedutil.c  => kernel/sched/cpufreq/metrics.c
>   kernel/sched/thermal.c=> kernel/sched/cpufreq/thermal.c
>
> ... or so?
>
> With no change to functionality, this is just a re-organization and
> expansion/preparation for the bright future. =B-)

No disagreement here. :-)

Cheers,
Rafael


Re: [driver-core PATCH v4 3/6] device core: Consolidate locking and unlocking of parent and device

2018-10-18 Thread Rafael J. Wysocki
On Mon, Oct 15, 2018 at 5:09 PM Alexander Duyck
 wrote:
>
> This patch is meant to try and consolidate all of the locking and unlocking
> of both the parent and device when attaching or removing a driver from a
> given device.
>
> To do that I first consolidated the lock pattern into two functions
> __device_driver_lock and __device_driver_unlock. After doing that I then
> created functions specific to attaching and detaching the driver while
> acquiring this locks. By doing this I was able to reduce the number of
> spots where we touch need_parent_lock from 12 down to 4.
>
> Signed-off-by: Alexander Duyck 

Reviewed-by: Rafael J. Wysocki 

> ---
>  drivers/base/base.h |2 +
>  drivers/base/bus.c  |   23 ++
>  drivers/base/dd.c   |   83 
> ++-
>  3 files changed, 75 insertions(+), 33 deletions(-)
>
> diff --git a/drivers/base/base.h b/drivers/base/base.h
> index 7a419a7a6235..3f22ebd6117a 100644
> --- a/drivers/base/base.h
> +++ b/drivers/base/base.h
> @@ -124,6 +124,8 @@ extern int driver_add_groups(struct device_driver *drv,
>  const struct attribute_group **groups);
>  extern void driver_remove_groups(struct device_driver *drv,
>  const struct attribute_group **groups);
> +int device_driver_attach(struct device_driver *drv, struct device *dev);
> +void device_driver_detach(struct device *dev);
>
>  extern char *make_class_name(const char *name, struct kobject *kobj);
>
> diff --git a/drivers/base/bus.c b/drivers/base/bus.c
> index 8bfd27ec73d6..8a630f9bd880 100644
> --- a/drivers/base/bus.c
> +++ b/drivers/base/bus.c
> @@ -184,11 +184,7 @@ static ssize_t unbind_store(struct device_driver *drv, 
> const char *buf,
>
> dev = bus_find_device_by_name(bus, NULL, buf);
> if (dev && dev->driver == drv) {
> -   if (dev->parent && dev->bus->need_parent_lock)
> -   device_lock(dev->parent);
> -   device_release_driver(dev);
> -   if (dev->parent && dev->bus->need_parent_lock)
> -   device_unlock(dev->parent);
> +   device_driver_detach(dev);
> err = count;
> }
> put_device(dev);
> @@ -211,13 +207,7 @@ static ssize_t bind_store(struct device_driver *drv, 
> const char *buf,
>
> dev = bus_find_device_by_name(bus, NULL, buf);
> if (dev && dev->driver == NULL && driver_match_device(drv, dev)) {
> -   if (dev->parent && bus->need_parent_lock)
> -   device_lock(dev->parent);
> -   device_lock(dev);
> -   err = driver_probe_device(drv, dev);
> -   device_unlock(dev);
> -   if (dev->parent && bus->need_parent_lock)
> -   device_unlock(dev->parent);
> +   err = device_driver_attach(drv, dev);
>
> if (err > 0) {
> /* success */
> @@ -769,13 +759,8 @@ int bus_rescan_devices(struct bus_type *bus)
>   */
>  int device_reprobe(struct device *dev)
>  {
> -   if (dev->driver) {
> -   if (dev->parent && dev->bus->need_parent_lock)
> -   device_lock(dev->parent);
> -   device_release_driver(dev);
> -   if (dev->parent && dev->bus->need_parent_lock)
> -   device_unlock(dev->parent);
> -   }
> +   if (dev->driver)
> +   device_driver_detach(dev);
> return bus_rescan_devices_helper(dev, NULL);
>  }
>  EXPORT_SYMBOL_GPL(device_reprobe);
> diff --git a/drivers/base/dd.c b/drivers/base/dd.c
> index 169412ee4ae8..e845cd2a87af 100644
> --- a/drivers/base/dd.c
> +++ b/drivers/base/dd.c
> @@ -864,6 +864,60 @@ void device_initial_probe(struct device *dev)
> __device_attach(dev, true);
>  }
>
> +/*
> + * __device_driver_lock - acquire locks needed to manipulate dev->drv
> + * @dev: Device we will update driver info for
> + * @parent: Parent device needed if the bus requires parent lock
> + *
> + * This function will take the required locks for manipulating dev->drv.
> + * Normally this will just be the @dev lock, but when called for a USB
> + * interface, @parent lock will be held as well.
> + */
> +static void __device_driver_lock(struct device *dev, struct device *parent)
> +{
> +   if (parent && dev->bus->need_parent_lock)
> +   device_lock(parent);
> +   device_lock(dev);
> +}
> +
> +/*
&g

Re: [PATCH v4] kernel/hung_task.c: disable on suspend

2018-10-18 Thread Rafael J. Wysocki
On Wed, Oct 17, 2018 at 1:24 PM Vitaly Kuznetsov  wrote:
>
> It is possible to observe hung_task complaints when system goes to
> suspend-to-idle state:
>
>  # echo freeze > /sys/power/state
>
>  PM: Syncing filesystems ... done.
>  Freezing user space processes ... (elapsed 0.001 seconds) done.
>  OOM killer disabled.
>  Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
>  sd 0:0:0:0: [sda] Synchronizing SCSI cache
>  INFO: task bash:1569 blocked for more than 120 seconds.
>Not tainted 4.19.0-rc3_+ #687
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  bashD0  1569604 0x
>  Call Trace:
>   ? __schedule+0x1fe/0x7e0
>   schedule+0x28/0x80
>   suspend_devices_and_enter+0x4ac/0x750
>   pm_suspend+0x2c0/0x310
>
> Register a PM notifier to disable the detector on suspend and re-enable
> back on wakeup.
>
> Signed-off-by: Vitaly Kuznetsov 

Thanks for your patience with this!

Are there any objections or concerns regarding this patch?

> ---
> Changes since v3:
> - Handle PM_RESTORE_PREPARE/PM_POST_RESTORE for completeness
>   [Rafael J. Wysocki]
> ---
>  kernel/hung_task.c | 30 +-
>  1 file changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index b9132d1269ef..cb8e3e8ac7b9 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -242,6 +243,28 @@ void reset_hung_task_detector(void)
>  }
>  EXPORT_SYMBOL_GPL(reset_hung_task_detector);
>
> +static bool hung_detector_suspended;
> +
> +static int hungtask_pm_notify(struct notifier_block *self,
> + unsigned long action, void *hcpu)
> +{
> +   switch (action) {
> +   case PM_SUSPEND_PREPARE:
> +   case PM_HIBERNATION_PREPARE:
> +   case PM_RESTORE_PREPARE:
> +   hung_detector_suspended = true;
> +   break;
> +   case PM_POST_SUSPEND:
> +   case PM_POST_HIBERNATION:
> +   case PM_POST_RESTORE:
> +   hung_detector_suspended = false;
> +   break;
> +   default:
> +   break;
> +   }
> +   return NOTIFY_OK;
> +}
> +
>  /*
>   * kthread which checks for tasks stuck in D state
>   */
> @@ -261,7 +284,8 @@ static int watchdog(void *dummy)
> interval = min_t(unsigned long, interval, timeout);
> t = hung_timeout_jiffies(hung_last_checked, interval);
> if (t <= 0) {
> -   if (!atomic_xchg(_hung_task, 0))
> +   if (!atomic_xchg(_hung_task, 0) &&
> +   !hung_detector_suspended)
> check_hung_uninterruptible_tasks(timeout);
> hung_last_checked = jiffies;
> continue;
> @@ -275,6 +299,10 @@ static int watchdog(void *dummy)
>  static int __init hung_task_init(void)
>  {
> atomic_notifier_chain_register(_notifier_list, _block);
> +
> +   /* Disable hung task detector on suspend */
> +   pm_notifier(hungtask_pm_notify, 0);
> +
> watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");
>
> return 0;
> --
> 2.17.1
>


Re: [PATCH 1/3] x86: baytrail/cherrytrail: Rework and move P-Unit PMIC bus semaphore code

2018-10-18 Thread Rafael J. Wysocki
On Sun, Sep 23, 2018 at 4:45 PM Hans de Goede  wrote:
>
> On some BYT/CHT systems the SoC's P-Unit shares the I2C bus with the
> kernel. The P-Unit has a semaphore for the PMIC bus which we can take to
> block it from accessing the shared bus while the kernel wants to access it.
>
> Currently we have the I2C-controller driver acquiring and releasing the
> semaphore around each I2C transfer. There are 2 problems with this:
>
> 1) PMIC accesses often come in the form of a read-modify-write on one of
> the PMIC registers, we currently release the P-Unit's PMIC bus semaphore
> between the read and the write. If the P-Unit modifies the register during
> this window?, then we end up overwriting the P-Unit's changes.
> I believe that this is mostly an academic problem, but I'm not sure.
>
> 2) To safely access the shared I2C bus, we need to do 3 things:
> a) Notify the GPU driver that we are starting a window in which it may not
> access the P-Unit, since the P-Unit seems to ignore the semaphore for
> explicit power-level requests made by the GPU driver
> b) Make a pm_qos request to force all CPU cores out of C6/C7 since entering
> C6/C7 while we hold the semaphore hangs the SoC
> c) Finally take the P-Unit's PMIC bus semaphore
> All 3 these steps together are somewhat expensive, so ideally if we have
> a bunch of i2c transfers grouped together we only do this once for the
> entire group.
>
> Taking the read-modify-write on a PMIC register as example then ideally we
> would only do all 3 steps once at the beginning and undo all 3 steps once
> at the end.
>
> For this we need to be able to take the semaphore from within e.g. the PMIC
> opregion driver, yet we do not want to remove the taking of the semaphore
> from the I2C-controller driver, as that is still necessary to protect many
> other code-paths leading to accessing the shared I2C bus.
>
> This means that we first have the PMIC driver acquire the semaphore and
> then have the I2C controller driver trying to acquire it again.
>
> To make this possible this commit does the following:
>
> 1) Move the semaphore code from being private to the I2C controller driver
> into the generic iosf_mbi code, which already has other code to deal with
> the shared bus so that it can be accessed outside of the I2C bus driver.
>
> 2) Rework the code so that it can be called multiple times nested, while
> still blocking I2C accesses while e.g. the GPU driver has indicated the
> P-Unit needs the bus through a iosf_mbi_punit_acquire() call.
>
> Signed-off-by: Hans de Goede 

If there are no objections or concerns regarding this patch, I'm
inclined to take the entire series including it.

> ---
> Note this commit deliberately limits the i2c-designware changes to
> only touch i2c-designware-baytrail.c, deliberately not doing some cleanups
> which become possible after removing the semaphore code from the
> i2c-designmware code. This is done so that this commit can be merged
> through the x86 tree without causing conflicts in the i2c tree.
>
> The cleanups to the i2c-designware tree will be done in a follow up
> patch which can be merged once this commit is in place.
> ---
>  arch/x86/include/asm/iosf_mbi.h  |  39 +++--
>  arch/x86/platform/intel/iosf_mbi.c   | 161 +--
>  drivers/i2c/busses/i2c-designware-baytrail.c | 125 +-
>  3 files changed, 180 insertions(+), 145 deletions(-)
>
> diff --git a/arch/x86/include/asm/iosf_mbi.h b/arch/x86/include/asm/iosf_mbi.h
> index 3de0489deade..5270ff39b9af 100644
> --- a/arch/x86/include/asm/iosf_mbi.h
> +++ b/arch/x86/include/asm/iosf_mbi.h
> @@ -105,8 +105,10 @@ int iosf_mbi_modify(u8 port, u8 opcode, u32 offset, u32 
> mdr, u32 mask);
>   * the PMIC bus while another driver is also accessing the PMIC bus various 
> bad
>   * things happen.
>   *
> - * To avoid these problems this function must be called before accessing the
> - * P-Unit or the PMIC, be it through iosf_mbi* functions or through other 
> means.
> + * Call this function before sending requests to the P-Unit which may make it
> + * access the PMIC, be it through iosf_mbi* functions or through other means.
> + * This function will block all kernel access to the PMIC I2C bus, so that 
> the
> + * P-Unit can safely access the PMIC over the shared I2C bus.
>   *
>   * Note on these systems the i2c-bus driver will request a sempahore from the
>   * P-Unit for exclusive access to the PMIC bus when i2c drivers are accessing
> @@ -122,6 +124,31 @@ void iosf_mbi_punit_acquire(void);
>   */
>  void iosf_mbi_punit_release(void);
>
> +/**
> + * iosf_mbi_block_punit_i2c_access() - Block P-Unit accesses to the PMIC bus
> + *
> + * Call this function to block P-Unit access to the PMIC I2C bus, so that the
> + * kernel can safely access the PMIC over the shared I2C bus.
> + *
> + * This function acquires the P-Unit bus semaphore and notifies
> + * pmic_bus_access_notifier listeners that they may no longer access the
> + * P-Unit in a way which 

Re: [RFC PATCH 0/7] Introduce thermal pressure

2018-10-18 Thread Rafael J. Wysocki
On Thu, Oct 18, 2018 at 8:48 AM Ingo Molnar  wrote:
>
>
> * Thara Gopinath  wrote:
>
> > On 10/16/2018 03:33 AM, Ingo Molnar wrote:
> > >
> > > * Thara Gopinath  wrote:
> > >
> >  Regarding testing, basic build, boot and sanity testing have been
> >  performed on hikey960 mainline kernel with debian file system.
> >  Further aobench (An occlusion renderer for benchmarking realworld
> >  floating point performance) showed the following results on hikey960
> >  with debain.
> > 
> >  Result  Standard   
> >   Standard
> >  (Time secs) Error  
> >   Deviation
> >  Hikey 960 - no thermal pressure applied 138.67  6.52   
> >   11.52%
> >  Hikey 960 -  thermal pressure applied   122.37  5.78   
> >   11.57%
> > >>>
> > >>> Wow, +13% speedup, impressive! We definitely want this outcome.
> > >>>
> > >>> I'm wondering what happens if we do not track and decay the thermal
> > >>> load at all at the PELT level, but instantaneously decrease/increase
> > >>> effective CPU capacity in reaction to thermal events we receive from
> > >>> the CPU.
> > >>
> > >> The problem with instantaneous update is that sometimes thermal events
> > >> happen at a much faster pace than cpu_capacity is updated in the
> > >> scheduler. This means that at the moment when scheduler uses the
> > >> value, it might not be correct anymore.
> > >
> > > Let me offer a different interpretation: if we average throttling events
> > > then we create a 'smooth' average of 'true CPU capacity' that doesn't
> > > fluctuate much. This allows more stable yet asymmetric task placement if
> > > the thermal characteristics of the different cores is different
> > > (asymmetric). This, compared to instantaneous updates, would reduce
> > > unnecessary task migrations between cores.
> > >
> > > Is that accurate?
> >
> > Yes. I think it is accurate. I will also add that if we don't average
> > throttling events, we will miss the events that occur in between load
> > balancing(LB) period.
>
> Yeah, so I'd definitely suggest to not integrate this averaging into
> pelt.c in the fashion presented, because:
>
>  - This couples your thermal throttling averaging to the PELT decay
>half-time AFAICS, which would break the other user every time the
>decay is changed/tuned.
>
>  - The boolean flag that changes behavior in pelt.c is not particularly
>clean either and complicates the code.
>
>  - Instead maybe factor out a decaying average library into
>kernel/sched/avg.h perhaps (if this truly improves the code), and use
>those methods both in pelt.c and any future thermal.c - and maybe
>other places where we do decaying averages.
>
>  - But simple decaying averages are not that complex either, so I think
>your original solution of open coding it is probably fine as well.
>
> Furthermore, any logic introduced by thermal.c and the resulting change
> to load-balancing behavior would have to be in perfect sync with cpufreq
> governor actions - one mechanism should not work against the other.

Right, that really is required.

> The only long term maintainable solution is to move all high level
> cpufreq logic and policy handling code into kernel/sched/cpufreq*.c,
> which has been done to a fair degree already in the past ~2 years - but
> it's unclear to me to what extent this is true for thermal throttling
> policy currently: there might be more governor surgery and code
> reshuffling required?

It doesn't cover thermal management directly ATM.

The EAS work kind of hopes to make a connection in there by adding a
common energy model to underlie both the performance scaling and
thermal management, but it doesn't change the thermal decision making
part AFAICS.

So it is fair to say that additional governor surgery and code
reshuffling will be required IMO.

Thanks,
Rafael


Re: [PATCH 0/2] cpuidle: menu: Two more refinements

2018-10-17 Thread Rafael J. Wysocki
On Mon, Oct 15, 2018 at 1:57 PM Rafael J. Wysocki  wrote:
>
> Hi All,
>
> The following patches are two more refinements of the menu governor (on top
> of the recent changes in linux-next now).  The are not expected to change the
> overall behavior of it.
>
> Please refer to the patch changelogs for details.

Any objections or concerns regarding the two patches?

If not, I will queue them up.

Thanks,
Rafael


Re: [PATCH] ACPI: TAD: Add low-level support for real time capability

2018-10-17 Thread Rafael J. Wysocki
On Mon, Oct 15, 2018 at 1:59 PM Rafael J. Wysocki  wrote:
>
> From: Rafael J. Wysocki 
>
> Add low-level support for the (optional) real time capability of the
> ACPI Time and Alarm Device (TAD) to the ACPI TAD driver.
>
> This allows the real time to be acquired or set via sysfs with the
> help of the _GRT and _SRT methods of the TAD, respectively.
>
> Signed-off-by: Rafael J. Wysocki 
> Tested-by: Mika Westerberg 

Any objections or concerns here?

If not, I will queue this up.

> ---
>  drivers/acpi/acpi_tad.c |  201 
> 
>  1 file changed, 201 insertions(+)
>
> Index: linux-pm/drivers/acpi/acpi_tad.c
> ===
> --- linux-pm.orig/drivers/acpi/acpi_tad.c
> +++ linux-pm/drivers/acpi/acpi_tad.c
> @@ -52,6 +52,201 @@ struct acpi_tad_driver_data {
> u32 capabilities;
>  };
>
> +struct acpi_tad_rt {
> +   u16 year;  /* 1900 -  */
> +   u8 month;  /* 1 - 12 */
> +   u8 day;/* 1 - 31 */
> +   u8 hour;   /* 0 - 23 */
> +   u8 minute; /* 0 - 59 */
> +   u8 second; /* 0 - 59 */
> +   u8 valid;  /* 0 (failed) or 1 (success) for reads, 0 for writes */
> +   u16 msec;  /* 1 - 1000 */
> +   s16 tz;/* -1440 to 1440 or 2047 (unspecified) */
> +   u8 daylight;
> +   u8 padding[3]; /* must be 0 */
> +} __packed;
> +
> +static int acpi_tad_set_real_time(struct device *dev, struct acpi_tad_rt *rt)
> +{
> +   acpi_handle handle = ACPI_HANDLE(dev);
> +   union acpi_object args[] = {
> +   { .type = ACPI_TYPE_BUFFER, },
> +   };
> +   struct acpi_object_list arg_list = {
> +   .pointer = args,
> +   .count = ARRAY_SIZE(args),
> +   };
> +   unsigned long long retval;
> +   acpi_status status;
> +
> +   if (rt->year < 1900 || rt->year >  ||
> +   rt->month < 1 || rt->month > 12 ||
> +   rt->hour > 23 || rt->minute > 59 || rt->second > 59 ||
> +   rt->tz < -1440 || (rt->tz > 1440 && rt->tz != 2047) ||
> +   rt->daylight > 3)
> +   return -ERANGE;
> +
> +   args[0].buffer.pointer = (u8 *)rt;
> +   args[0].buffer.length = sizeof(*rt);
> +
> +   pm_runtime_get_sync(dev);
> +
> +   status = acpi_evaluate_integer(handle, "_SRT", _list, );
> +
> +   pm_runtime_put_sync(dev);
> +
> +   if (ACPI_FAILURE(status) || retval)
> +   return -EIO;
> +
> +   return 0;
> +}
> +
> +static int acpi_tad_get_real_time(struct device *dev, struct acpi_tad_rt *rt)
> +{
> +   acpi_handle handle = ACPI_HANDLE(dev);
> +   struct acpi_buffer output = { ACPI_ALLOCATE_BUFFER };
> +   union acpi_object *out_obj;
> +   struct acpi_tad_rt *data;
> +   acpi_status status;
> +   int ret = -EIO;
> +
> +   pm_runtime_get_sync(dev);
> +
> +   status = acpi_evaluate_object(handle, "_GRT", NULL, );
> +
> +   pm_runtime_put_sync(dev);
> +
> +   if (ACPI_FAILURE(status))
> +   goto out_free;
> +
> +   out_obj = output.pointer;
> +   if (out_obj->type != ACPI_TYPE_BUFFER)
> +   goto out_free;
> +
> +   if (out_obj->buffer.length != sizeof(*rt))
> +   goto out_free;
> +
> +   data = (struct acpi_tad_rt *)(out_obj->buffer.pointer);
> +   if (!data->valid)
> +   goto out_free;
> +
> +   memcpy(rt, data, sizeof(*rt));
> +   ret = 0;
> +
> +out_free:
> +   ACPI_FREE(output.pointer);
> +   return ret;
> +}
> +
> +static char *acpi_tad_rt_next_field(char *s, int *val)
> +{
> +   char *p;
> +
> +   p = strchr(s, ':');
> +   if (!p)
> +   return NULL;
> +
> +   *p = '\0';
> +   if (kstrtoint(s, 10, val))
> +   return NULL;
> +
> +   return p + 1;
> +}
> +
> +static ssize_t time_store(struct device *dev, struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> +   struct acpi_tad_rt rt;
> +   char *str, *s;
> +   int val, ret = -ENODATA;
> +
> +   str = kmemdup_nul(buf, count, GFP_KERNEL);
> +   if (!str)
> +   return -ENOMEM;
> +
> +   s = acpi_tad_rt_next_field(str, );
> +   if (!s)
> +   goto out_free;
> +
> +   rt.year = val;
> +
> +   s = acpi_tad_rt_next_field(s, );
> +   if (!s)
> +   goto out_free;
> +
> + 

Re: [PATCH v3] kernel/hung_task.c: disable on suspend

2018-10-17 Thread Rafael J. Wysocki
On Tue, Oct 16, 2018 at 6:55 PM Vitaly Kuznetsov  wrote:
>
> It is possible to observe hung_task complaints when system goes to
> suspend-to-idle state:
>
>  # echo freeze > /sys/power/state
>
>  PM: Syncing filesystems ... done.
>  Freezing user space processes ... (elapsed 0.001 seconds) done.
>  OOM killer disabled.
>  Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
>  sd 0:0:0:0: [sda] Synchronizing SCSI cache
>  INFO: task bash:1569 blocked for more than 120 seconds.
>Not tainted 4.19.0-rc3_+ #687
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  bashD0  1569604 0x
>  Call Trace:
>   ? __schedule+0x1fe/0x7e0
>   schedule+0x28/0x80
>   suspend_devices_and_enter+0x4ac/0x750
>   pm_suspend+0x2c0/0x310
>
> Register a PM notifier to disable the detector on suspend and re-enable
> back on wakeup.
>
> Signed-off-by: Vitaly Kuznetsov 
> ---
> Changes since v2:
> - Resurrect 'v1' as zeroing timeouts can be racy [Rafael J. Wysocki]
> ---
>  kernel/hung_task.c | 28 +++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index b9132d1269ef..41955c5d8427 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -242,6 +243,26 @@ void reset_hung_task_detector(void)
>  }
>  EXPORT_SYMBOL_GPL(reset_hung_task_detector);
>
> +static bool hung_detector_suspended;
> +
> +static int hungtask_pm_notify(struct notifier_block *self,
> + unsigned long action, void *hcpu)
> +{
> +   switch (action) {
> +   case PM_SUSPEND_PREPARE:
> +   case PM_HIBERNATION_PREPARE:

Please add PM_RESTORE_PREPARE here ->

> +   hung_detector_suspended = true;
> +   break;
> +   case PM_POST_SUSPEND:
> +   case PM_POST_HIBERNATION:

-> and PM_POST_RESTORE here for completeness.

> +   hung_detector_suspended = false;
> +   break;
> +   default:
> +   break;
> +   }
> +   return NOTIFY_OK;
> +}
> +
>  /*
>   * kthread which checks for tasks stuck in D state
>   */
> @@ -261,7 +282,8 @@ static int watchdog(void *dummy)
> interval = min_t(unsigned long, interval, timeout);
> t = hung_timeout_jiffies(hung_last_checked, interval);
> if (t <= 0) {
> -   if (!atomic_xchg(_hung_task, 0))
> +   if (!atomic_xchg(_hung_task, 0) &&
> +   !hung_detector_suspended)
> check_hung_uninterruptible_tasks(timeout);
> hung_last_checked = jiffies;
> continue;
> @@ -275,6 +297,10 @@ static int watchdog(void *dummy)
>  static int __init hung_task_init(void)
>  {
> atomic_notifier_chain_register(_notifier_list, _block);
> +
> +   /* Disable hung task detector on suspend */
> +   pm_notifier(hungtask_pm_notify, 0);
> +
> watchdog_task = kthread_run(watchdog, NULL, "khungtaskd");
>
> return 0;
> --

Apart from the above it is fine by me.

This is the minimum fix for the issue at hand AFAICS.

Thanks,
Rafael


Re: [PATCH v2] kernel/hung_task.c: disable on suspend

2018-10-16 Thread Rafael J. Wysocki
Hi,

Sorry for the delay here.

On Tuesday, September 25, 2018 2:16:36 PM CEST Vitaly Kuznetsov wrote:
> It is possible to observe hung_task complaints when system goes to
> suspend-to-idle state:
> 
>  PM: Syncing filesystems ... done.
>  Freezing user space processes ... (elapsed 0.001 seconds) done.
>  OOM killer disabled.
>  Freezing remaining freezable tasks ... (elapsed 0.002 seconds) done.
>  sd 0:0:0:0: [sda] Synchronizing SCSI cache
>  INFO: task bash:1569 blocked for more than 120 seconds.
>Not tainted 4.19.0-rc3_+ #687
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  bashD0  1569604 0x
>  Call Trace:
>   ? __schedule+0x1fe/0x7e0
>   schedule+0x28/0x80
>   suspend_devices_and_enter+0x4ac/0x750
>   pm_suspend+0x2c0/0x310
> 
> The root cause of the issue is that under certain circumstances jiffies
> counter keeps advancing, some work to prevent that is currently ongoing.
> However, it seems that it would make sense to disable hung task detector
> on suspend and re-enable it on wakeup regardless.
> 
> Signed-off-by: Vitaly Kuznetsov 
> ---
> Changes since v1:
> - Implement detector disabling by zeroing timeout [Rafael J. Wysocki]
> ---
>  kernel/hung_task.c | 36 +++-
>  1 file changed, 35 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/hung_task.c b/kernel/hung_task.c
> index b9132d1269ef..ac6e8c9306bd 100644
> --- a/kernel/hung_task.c
> +++ b/kernel/hung_task.c
> @@ -15,6 +15,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -242,12 +243,14 @@ void reset_hung_task_detector(void)
>  }
>  EXPORT_SYMBOL_GPL(reset_hung_task_detector);
>  
> +static unsigned long hung_last_checked;
> +
>  /*
>   * kthread which checks for tasks stuck in D state
>   */
>  static int watchdog(void *dummy)
>  {
> - unsigned long hung_last_checked = jiffies;
> + hung_last_checked = jiffies;
>  
>   set_user_nice(current, 0);
>  
> @@ -272,9 +275,40 @@ static int watchdog(void *dummy)
>   return 0;
>  }
>  
> +static int hungtask_pm_notify(struct notifier_block *self,
> +   unsigned long action, void *hcpu)
> +{
> + static unsigned long saved_timeout, saved_interval;
> +
> + switch (action) {
> + case PM_SUSPEND_PREPARE:
> + case PM_HIBERNATION_PREPARE:
> + saved_timeout = sysctl_hung_task_timeout_secs;
> + saved_interval = sysctl_hung_task_check_interval_secs;
> + sysctl_hung_task_timeout_secs = 0;
> + sysctl_hung_task_check_interval_secs = 0;
> + wake_up_process(watchdog_task);
> + break;

AFAICS, this is racy (for example, it still is possible for user space
to update the sysctl_* values after you've set them to 0).  That can be
fixed, but I'm not sure it is worth the effort.

Since watchdog() is not expected to run very often in general, I think
I prefer your v1 after all, so please resend it.

Thanks,
Rafael



Re: [PATCH V4] cpufreq: imx6q: read OCOTP through nvmem for imx6ul/imx6ull

2018-10-16 Thread Rafael J. Wysocki
On Monday, October 8, 2018 8:15:08 AM CEST Viresh Kumar wrote:
> On 08-10-18, 14:07, Anson Huang wrote:
> > On i.MX6UL/i.MX6ULL, accessing OCOTP directly is wrong because
> > the ocotp clock needs to be enabled first. Add support for reading
> > OCOTP through the nvmem API, and keep the old method there to
> > support old dtb.
> > 
> > Signed-off-by: Anson Huang 
> > ---
> > changes since V3:
> > put node earlier to save one line code.
> >  drivers/cpufreq/imx6q-cpufreq.c | 52 
> > +++--
> >  1 file changed, 35 insertions(+), 17 deletions(-)
> > 
> > diff --git a/drivers/cpufreq/imx6q-cpufreq.c 
> > b/drivers/cpufreq/imx6q-cpufreq.c
> > index b2ff423..8cfee0a 100644
> > --- a/drivers/cpufreq/imx6q-cpufreq.c
> > +++ b/drivers/cpufreq/imx6q-cpufreq.c
> > @@ -12,6 +12,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -290,20 +291,32 @@ static void imx6q_opp_check_speed_grading(struct 
> > device *dev)
> >  #define OCOTP_CFG3_6ULL_SPEED_792MHZ   0x2
> >  #define OCOTP_CFG3_6ULL_SPEED_900MHZ   0x3
> >  
> > -static void imx6ul_opp_check_speed_grading(struct device *dev)
> > +static int imx6ul_opp_check_speed_grading(struct device *dev)
> >  {
> > -   struct device_node *np;
> > -   void __iomem *base;
> > u32 val;
> > +   int ret = 0;
> >  
> > -   np = of_find_compatible_node(NULL, NULL, "fsl,imx6ul-ocotp");
> > -   if (!np)
> > -   return;
> > +   if (of_find_property(dev->of_node, "nvmem-cells", NULL)) {
> > +   ret = nvmem_cell_read_u32(dev, "speed_grade", );
> > +   if (ret)
> > +   return ret;
> > +   } else {
> > +   struct device_node *np;
> > +   void __iomem *base;
> > +
> > +   np = of_find_compatible_node(NULL, NULL, "fsl,imx6ul-ocotp");
> > +   if (!np)
> > +   return -ENOENT;
> > +
> > +   base = of_iomap(np, 0);
> > +   of_node_put(np);
> > +   if (!base) {
> > +   dev_err(dev, "failed to map ocotp\n");
> > +   return -EFAULT;
> > +   }
> >  
> > -   base = of_iomap(np, 0);
> > -   if (!base) {
> > -   dev_err(dev, "failed to map ocotp\n");
> > -   goto put_node;
> > +   val = readl_relaxed(base + OCOTP_CFG3);
> > +   iounmap(base);
> > }
> >  
> > /*
> > @@ -314,7 +327,6 @@ static void imx6ul_opp_check_speed_grading(struct 
> > device *dev)
> >  * 2b'11: 9Hz on i.MX6ULL only;
> >  * We need to set the max speed of ARM according to fuse map.
> >  */
> > -   val = readl_relaxed(base + OCOTP_CFG3);
> > val >>= OCOTP_CFG3_SPEED_SHIFT;
> > val &= 0x3;
> >  
> > @@ -334,9 +346,7 @@ static void imx6ul_opp_check_speed_grading(struct 
> > device *dev)
> > dev_warn(dev, "failed to disable 900MHz OPP\n");
> > }
> >  
> > -   iounmap(base);
> > -put_node:
> > -   of_node_put(np);
> > +   return ret;
> >  }
> >  
> >  static int imx6q_cpufreq_probe(struct platform_device *pdev)
> > @@ -394,10 +404,18 @@ static int imx6q_cpufreq_probe(struct platform_device 
> > *pdev)
> > }
> >  
> > if (of_machine_is_compatible("fsl,imx6ul") ||
> > -   of_machine_is_compatible("fsl,imx6ull"))
> > -   imx6ul_opp_check_speed_grading(cpu_dev);
> > -   else
> > +   of_machine_is_compatible("fsl,imx6ull")) {
> > +   ret = imx6ul_opp_check_speed_grading(cpu_dev);
> > +   if (ret == -EPROBE_DEFER)
> > +   return ret;
> > +   if (ret) {
> > +   dev_err(cpu_dev, "failed to read ocotp: %d\n",
> > +   ret);
> > +   return ret;
> > +   }
> > +   } else {
> > imx6q_opp_check_speed_grading(cpu_dev);
> > +   }
> >  
> > /* Because we have added the OPPs here, we must free them */
> > free_opp = true;
> 
> Acked-by: Viresh Kumar 

Patch applied, thanks!



Re: [PATCH v2] cpufreq: dt-platdev: allow RK3399 to have separate tunables per cluster

2018-10-16 Thread Rafael J. Wysocki
On Monday, October 8, 2018 7:55:47 AM CEST Viresh Kumar wrote:
> On 05-10-18, 12:00, Dmitry Torokhov wrote:
> > RK3899 has one cluster with 4 small cores, and another one with 2 big
> > cores, with cores in different clusters having different OPPs and thus
> > needing separate set of tunables. Let's enable this via
> > "have_governor_per_policy" platform data.
> > 
> > Signed-off-by: Dmitry Torokhov 
> > ---
> > 
> > v2 changes: commit message updated.
> > 
> > Not tested, but we had a patch unconditionally enabling
> > CPUFREQ_HAVE_GOVERNOR_PER_POLICY flag in tree we used to ship devices
> > based on RK3399 platform.
> > 
> >  drivers/cpufreq/cpufreq-dt-platdev.c | 5 -
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/cpufreq/cpufreq-dt-platdev.c 
> > b/drivers/cpufreq/cpufreq-dt-platdev.c
> > index fe14c57de6ca..040ec0f711f9 100644
> > --- a/drivers/cpufreq/cpufreq-dt-platdev.c
> > +++ b/drivers/cpufreq/cpufreq-dt-platdev.c
> > @@ -78,7 +78,10 @@ static const struct of_device_id whitelist[] __initconst 
> > = {
> > { .compatible = "rockchip,rk3328", },
> > { .compatible = "rockchip,rk3366", },
> > { .compatible = "rockchip,rk3368", },
> > -   { .compatible = "rockchip,rk3399", },
> > +   { .compatible = "rockchip,rk3399",
> > + .data = &(struct cpufreq_dt_platform_data)
> > +   { .have_governor_per_policy = true, },
> > +   },
> >  
> > { .compatible = "st-ericsson,u8500", },
> > { .compatible = "st-ericsson,u8540", },
> 
> Acked-by: Viresh Kumar 

Patch applied, thanks!



Re: [PATCH] ACPI: remove redundant 'default n' from Kconfig

2018-10-16 Thread Rafael J. Wysocki
On Wednesday, October 10, 2018 4:41:30 PM CEST Bartlomiej Zolnierkiewicz wrote:
> 'default n' is the default value for any bool or tristate Kconfig
> setting so there is no need to write it explicitly.
> 
> Also since commit f467c5640c29 ("kconfig: only write '# CONFIG_FOO
> is not set' for visible symbols") the Kconfig behavior is the same
> regardless of 'default n' being present or not:
> 
> ...
> One side effect of (and the main motivation for) this change is making
> the following two definitions behave exactly the same:
> 
> config FOO
> bool
> 
> config FOO
> bool
> default n
> 
> With this change, neither of these will generate a
> '# CONFIG_FOO is not set' line (assuming FOO isn't selected/implied).
> That might make it clearer to people that a bare 'default n' is
> redundant.
> ...
> 
> Signed-off-by: Bartlomiej Zolnierkiewicz 
> ---
>  drivers/acpi/Kconfig |6 --
>  1 file changed, 6 deletions(-)
> 
> Index: b/drivers/acpi/Kconfig
> ===
> --- a/drivers/acpi/Kconfig2018-09-26 15:54:31.942819766 +0200
> +++ b/drivers/acpi/Kconfig2018-10-10 16:38:09.691897907 +0200
> @@ -138,7 +138,6 @@ config ACPI_REV_OVERRIDE_POSSIBLE
>  
>  config ACPI_EC_DEBUGFS
>   tristate "EC read/write access through /sys/kernel/debug/ec"
> - default n
>   help
> Say N to disable Embedded Controller /sys/kernel/debug interface
>  
> @@ -283,7 +282,6 @@ config ACPI_PROCESSOR
>  config ACPI_IPMI
>   tristate "IPMI"
>   depends on IPMI_HANDLER
> - default n
>   help
> This driver enables the ACPI to access the BMC controller. And it
> uses the IPMI request/response message to communicate with BMC
> @@ -361,7 +359,6 @@ config ACPI_TABLE_UPGRADE
>  
>  config ACPI_DEBUG
>   bool "Debug Statements"
> - default n
>   help
> The ACPI subsystem can produce debug output.  Saying Y enables this
> output and increases the kernel size by around 50K.
> @@ -374,7 +371,6 @@ config ACPI_DEBUG
>  config ACPI_PCI_SLOT
>   bool "PCI slot detection driver"
>   depends on SYSFS
> - default n
>   help
> This driver creates entries in /sys/bus/pci/slots/ for all PCI
> slots in the system.  This can help correlate PCI bus addresses,
> @@ -436,7 +432,6 @@ config ACPI_HED
>  config ACPI_CUSTOM_METHOD
>   tristate "Allow ACPI methods to be inserted/replaced at run time"
>   depends on DEBUG_FS
> - default n
>   help
> This debug facility allows ACPI AML methods to be inserted and/or
> replaced without rebooting the system. For details refer to:
> @@ -481,7 +476,6 @@ config ACPI_EXTLOG
>   tristate "Extended Error Log support"
>   depends on X86_MCE && X86_LOCAL_APIC && EDAC
>   select UEFI_CPER
> - default n
>   help
> Certain usages such as Predictive Failure Analysis (PFA) require
> more information about the error than what can be described in
> 

Applied, thanks!




Re: [RFC PATCH 0/5] device property: Introducing software nodes

2018-10-16 Thread Rafael J. Wysocki
On Tue, Oct 16, 2018 at 10:40 AM Heikki Krogerus
 wrote:
>
> On Tue, Oct 16, 2018 at 09:36:33AM +0200, Rafael J. Wysocki wrote:
> > On Tue, Oct 16, 2018 at 9:35 AM Linus Walleij  
> > wrote:
> > >
> > > On Fri, Oct 12, 2018 at 1:39 PM Heikki Krogerus
> > >  wrote:
> > >
> > > > To continue the discussion started by Dmitry [1], this is my proposal
> > > > that I mentioned in my last mail. In short, the idea is that instead
> > > > of trying to extend the support for the currently used struct
> > > > property_set, I'm proposing that we introduce a completely new,
> > > > independent type of fwnode, and replace the struct property_set with
> > > > it. I'm calling the type "software node" here.
> > >
> > > I'm a big fan of this approach.
> > > Acked-by: Linus Walleij 
> > > for all patches.
> > >
> > > I don't know who can finally review and merge this though,
> > > I guess Rafael?
> >
> > Yes, that would be me. :-)
> >
> > I no one speaks up against them, I'll pick them up.
>
> Let me send a final version of these.
>
> I need to add one more patch to the series where I remove an extra
> device_remove_properties() call from platform_device_del().
>
> It's unnecessary in any case as device_del() calls
> device_remove_properties() for every device, but as the properties are
> removed there before the device is removed, we're unable to deduct
> the final ref count in the "remove" platform notification since our
> node is no longer bind to the device.

OK, I'll wait for an update, then.

Thanks,
Rafael


Re: [GIT PULL] cpupower update for Linux 4.20-rc1

2018-10-16 Thread Rafael J. Wysocki
Hi Shuah,

On Mon, Oct 15, 2018 at 9:21 PM Shuah Khan  wrote:
>
> Hi Rafael,
>
> Please pull the following cpupower update for Linux 4.20-rc1
>
> This cpupower update for Linux 4.20-rc1 consists of fixes for bugs and
> compile warnings from Prarit Bhargava and Anders Roxell.
>
> diff is attached.

Pulled, thanks!


  1   2   3   4   5   6   7   8   9   10   >