Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Tue, Jul 19, 2016 at 02:22:36PM -0400, Vince Weaver wrote: > On Fri, 17 Jun 2016, Huang Rui wrote: > > > On Thu, Jun 16, 2016 at 06:47:00PM +0200, Borislav Petkov wrote: > > > On Thu, Jun 16, 2016 at 01:38:14PM +0800, Huang Rui wrote: > > > > I was told this feature would be supported on fam15h 60h, 70h and > > > > later processors before. Just checked the fam16h model 30h BKDG, yes, > > > > it should be also supported. But I didn't test that platform, if you > > > > confirm it works in your side. We can enable it. > > > > > > You might want to ask around first whether F16M30's acc power machinery > > > is even usable? I.e., no errata and whatnot... > > > > > > > Yep, I already asked the designers, and was waiting for their > > feedbacks. > > Was there ever any feedback about any of the problems encountered with AMD > APM support? > Vince, apology to late response. We are drafting the erratum for this feature. Will let you know if it is public. Thanks, Rui
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Fri, 17 Jun 2016, Huang Rui wrote: > On Thu, Jun 16, 2016 at 06:47:00PM +0200, Borislav Petkov wrote: > > On Thu, Jun 16, 2016 at 01:38:14PM +0800, Huang Rui wrote: > > > I was told this feature would be supported on fam15h 60h, 70h and > > > later processors before. Just checked the fam16h model 30h BKDG, yes, > > > it should be also supported. But I didn't test that platform, if you > > > confirm it works in your side. We can enable it. > > > > You might want to ask around first whether F16M30's acc power machinery > > is even usable? I.e., no errata and whatnot... > > > > Yep, I already asked the designers, and was waiting for their > feedbacks. Was there ever any feedback about any of the problems encountered with AMD APM support? Thanks, Vince
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Fri, 17 Jun 2016, Huang Rui wrote: > Can you try to read the MSR value two times with the same core > (rdmsrl_on_cpu)? I'm reading from userspace using the /dev/cpu/0/msr device so it should always be reading from cpu0. I guess I could code up a custom kernel module to debug this if necessary. It does look that for some reason the 0xc0010280 MSR is only returning the lower 24 bits of the PTSC, rather than the 40 bits that cpuid 8008:ecx seems to think it should have. Vince
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Thu, Jun 16, 2016 at 04:44:20PM -0400, Vince Weaver wrote: > On Thu, 16 Jun 2016, Huang Rui wrote: > > > > 1. In theory this should also work on an amd fam16h model 30h > > > processor too, correct? The current code limits things to fam15h > > > even though the fam16mod30h has all the proper cpuid flags. > > > > > > > I was told this feature would be supported on fam15h 60h, 70h and > > later processors before. Just checked the fam16h model 30h BKDG, yes, > > it should be also supported. But I didn't test that platform, if you > > confirm it works in your side. We can enable it. > > I can confirm I get power readings on my fam16hmod30h board once I apply a > trivial patch to the driver. I'll send the patch in a separate e-mail. > OK, thanks. > > PTSC's frequency is about 100Mhz, it shouldn't be overflow. > > That's what I thought. I'm trying to read the value using the /dev/msr > interface from userspace and I get weird results. > > i.e.: > Jx: read 62d299b84 > PTSC MSR: read 72fe92 > > sleep 5ms > > Jy: read 631b453b9 > PTSC MSR: read 46b25 > > this happens about half the time (PTSC going backwards). Though > admittedly the problem could somehow be in the MSR code I'm using. > Can you try to read the MSR value two times with the same core (rdmsrl_on_cpu)? Thanks, Rui
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Thu, Jun 16, 2016 at 06:47:00PM +0200, Borislav Petkov wrote: > On Thu, Jun 16, 2016 at 01:38:14PM +0800, Huang Rui wrote: > > I was told this feature would be supported on fam15h 60h, 70h and > > later processors before. Just checked the fam16h model 30h BKDG, yes, > > it should be also supported. But I didn't test that platform, if you > > confirm it works in your side. We can enable it. > > You might want to ask around first whether F16M30's acc power machinery > is even usable? I.e., no errata and whatnot... > Yep, I already asked the designers, and was waiting for their feedbacks. Thanks, Rui
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Thu, Jun 16, 2016 at 05:16:04PM -0400, Vince Weaver wrote: > I'd believe the 6W report as a value for how much the CPU is using. > The others seem spurious. I guess I should go check the Errata for > this chip. Maybe this is the reason why it got enabled on F15 only :-) -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Thu, 16 Jun 2016, Vince Weaver wrote: One more followup, if I run the benchmark a bunch of times I get this: 4,472,401.06 mWatts power/power-pkg/ 50,886,303.28 mWatts power/power-pkg/ 81,737,001.44 mWatts power/power-pkg/ 6,525.89 mWatts power/power-pkg/ 6,522.04 mWatts power/power-pkg/ 6,505.68 mWatts power/power-pkg/ 4,938,855.83 mWatts power/power-pkg/ 4,614,620.11 mWatts power/power-pkg/ 79,679,069.41 mWatts power/power-pkg/ 152,794,060.83 mWatts power/power-pkg/ 3,942,429.02 mWatts power/power-pkg/ 6,506.73 mWatts power/power-pkg/ 60,198,884.39 mWatts power/power-pkg/ I'd believe the 6W report as a value for how much the CPU is using. The others seem spurious. I guess I should go check the Errata for this chip. Vince
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Thu, 16 Jun 2016, Huang Rui wrote: > On Thu, Jun 16, 2016 at 01:38:13PM +0800, Huang Rui wrote: > > After considering carefully, the original method should be OK. > > AMD nomenclature for CMT systems: > > [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0 > -> [Compute Unit Core 1] -> Linux CPU 1 > -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2 > -> [Compute Unit Core 1] -> Linux CPU 3 > > The deltaN is power per compute unit. Current one package has two CUs. > In the *same* interval, CU0's power is 10W, CU1's power is 15W. The > package (CU0 + CU1) power should be 15W, right? Because the interval > is the same. > > Q = Q1 + Q2. P = Q/t = (Q1 + Q2)/t = Q1/t + Q2/t = P1 + P2. > > Is that clear? OK, I was misunderstanding. I somehow thought there was a periodic timer that was adding accumulating power over time. But no, the driver just assumes the PTSC does not overflow? And that addition is just there to handle adding all the cores together? If so, then I agree that the addition makes sense, sorry for confusing things. Although I think it would be better if we reported Joules (like RAPL does) rather than average power, but too late to change that now. Also, on my machine I get results that make no physical sense, such as: sudo perf stat -a -e power/power-pkg/ /usr/games/primes 1 5 > /dev/null Performance counter stats for 'system wide': 4,472,401.06 mWatts power/power-pkg/ 6.956135769 seconds time elapsed I somehow don't think the CPU is really burning 4kW of Power. Vince
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Thu, 16 Jun 2016, Huang Rui wrote: > > 1. In theory this should also work on an amd fam16h model 30h > > processor too, correct? The current code limits things to fam15h > > even though the fam16mod30h has all the proper cpuid flags. > > > > I was told this feature would be supported on fam15h 60h, 70h and > later processors before. Just checked the fam16h model 30h BKDG, yes, > it should be also supported. But I didn't test that platform, if you > confirm it works in your side. We can enable it. I can confirm I get power readings on my fam16hmod30h board once I apply a trivial patch to the driver. I'll send the patch in a separate e-mail. > PTSC's frequency is about 100Mhz, it shouldn't be overflow. That's what I thought. I'm trying to read the value using the /dev/msr interface from userspace and I get weird results. i.e.: Jx: read 62d299b84 PTSC MSR: read 72fe92 sleep 5ms Jy: read 631b453b9 PTSC MSR: read 46b25 this happens about half the time (PTSC going backwards). Though admittedly the problem could somehow be in the MSR code I'm using. > mWatts are for processor power not system power. Below data is > calculated on fam15h model 60h which is low power platform. Even > though the method has a minor mistake, the processor power should be > in mWatts field. I have an actual wall-mounted power meter hooked up to my system and the difference from idle to all-cores-busy is 20W, so I would think that that the results we find with perf should be >1W at least. Vince
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Thu, Jun 16, 2016 at 01:38:14PM +0800, Huang Rui wrote: > I was told this feature would be supported on fam15h 60h, 70h and > later processors before. Just checked the fam16h model 30h BKDG, yes, > it should be also supported. But I didn't test that platform, if you > confirm it works in your side. We can enable it. You might want to ask around first whether F16M30's acc power machinery is even usable? I.e., no errata and whatnot... -- Regards/Gruss, Boris. ECO tip #101: Trim your mails when you reply.
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
Hi Vince, Thanks for asking. On Wed, Jun 15, 2016 at 09:13:59PM -0400, Vince Weaver wrote: > > three questions about this functionality: > > 1. In theory this should also work on an amd fam16h model 30h > processor too, correct? The current code limits things to fam15h > even though the fam16mod30h has all the proper cpuid flags. > I was told this feature would be supported on fam15h 60h, 70h and later processors before. Just checked the fam16h model 30h BKDG, yes, it should be also supported. But I didn't test that platform, if you confirm it works in your side. We can enable it. > I've tested the functionality a bit and it seems to work but for > some reason the ptsc seems to occasionally count backwards > on my machine. Any reason that would be? (It doesn't seem to be > an overflow, just reading the ptsc 5ms apart and the values are > slightly lower after than before). > PTSC's frequency is about 100Mhz, it shouldn't be overflow. > 2. Unless I'm misunderstanding things, the code seems to be accumulating > Power. (see chunk below) Power is an instantaneous measurement, it > makes no sense to add values. If you use 5W for 1ms and 10W for > 1ms, the average power across the 2ms interval is not 15W. > > You can add energy, but not power. > > > + delta *= cpu_pwr_sample_ratio * 1000; > > + tdelta = new_ptsc - prev_ptsc; > > + > > + do_div(delta, tdelta); > > + local64_add(delta, &event->count); > You're right. Nice catch! The average power is per compute unit. We cannot add the power simplely for each processor/package. So here, the average power per package should be (delta1 + delta2 + ... + deltaN)/(tdelta_avg). I will work out a fix. Thanks to point out. > 3. The actual results gathered seem rediculously low. 341 seconds of > calculation and only using 183 mWatts of power? > mWatts are for processor power not system power. Below data is calculated on fam15h model 60h which is low power platform. Even though the method has a minor mistake, the processor power should be in mWatts field. > >Performance counter stats for 'system wide': > > > > 183.44 mWatts power/power-pkg/ > > > >341.837270111 seconds time elapsed > > > > root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' > > sleep 10 > Thanks, Rui
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
On Thu, Jun 16, 2016 at 01:38:13PM +0800, Huang Rui wrote: > On Wed, Jun 15, 2016 at 09:13:59PM -0400, Vince Weaver wrote: > > > > 2. Unless I'm misunderstanding things, the code seems to be accumulating > > Power. (see chunk below) Power is an instantaneous measurement, it > > makes no sense to add values. If you use 5W for 1ms and 10W for > > 1ms, the average power across the 2ms interval is not 15W. > > > > You can add energy, but not power. > > > > > + delta *= cpu_pwr_sample_ratio * 1000; > > > + tdelta = new_ptsc - prev_ptsc; > > > + > > > + do_div(delta, tdelta); > > > + local64_add(delta, &event->count); > > > > You're right. Nice catch! The average power is per compute unit. We > cannot add the power simplely for each processor/package. > > So here, the average power per package should be (delta1 + delta2 + ... + > deltaN)/(tdelta_avg). > I will work out a fix. Thanks to point out. > After considering carefully, the original method should be OK. AMD nomenclature for CMT systems: [node 0] -> [Compute Unit 0] -> [Compute Unit Core 0] -> Linux CPU 0 -> [Compute Unit Core 1] -> Linux CPU 1 -> [Compute Unit 1] -> [Compute Unit Core 0] -> Linux CPU 2 -> [Compute Unit Core 1] -> Linux CPU 3 The deltaN is power per compute unit. Current one package has two CUs. In the *same* interval, CU0's power is 10W, CU1's power is 15W. The package (CU0 + CU1) power should be 15W, right? Because the interval is the same. Q = Q1 + Q2. P = Q/t = (Q1 + Q2)/t = Q1/t + Q2/t = P1 + P2. Is that clear? Thanks, Rui
Re: [REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
three questions about this functionality: 1. In theory this should also work on an amd fam16h model 30h processor too, correct? The current code limits things to fam15h even though the fam16mod30h has all the proper cpuid flags. I've tested the functionality a bit and it seems to work but for some reason the ptsc seems to occasionally count backwards on my machine. Any reason that would be? (It doesn't seem to be an overflow, just reading the ptsc 5ms apart and the values are slightly lower after than before). 2. Unless I'm misunderstanding things, the code seems to be accumulating Power. (see chunk below) Power is an instantaneous measurement, it makes no sense to add values. If you use 5W for 1ms and 10W for 1ms, the average power across the 2ms interval is not 15W. You can add energy, but not power. > + delta *= cpu_pwr_sample_ratio * 1000; > + tdelta = new_ptsc - prev_ptsc; > + > + do_div(delta, tdelta); > + local64_add(delta, &event->count); 3. The actual results gathered seem rediculously low. 341 seconds of calculation and only using 183 mWatts of power? >Performance counter stats for 'system wide': > > 183.44 mWatts power/power-pkg/ > >341.837270111 seconds time elapsed > > root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' > sleep 10 Vince
[REDO PATCH v7] perf/x86/amd/power: Add AMD accumulated power reporting mechanism
Introduce an AMD accumlated power reporting mechanism for the Family 15h, Model 60h processor that can be used to calculate the average power consumed by a processor during a measurement interval. The feature support is indicated by CPUID Fn8000_0007_EDX[12]. This feature will be implemented both in hwmon and perf. The current design provides one event to report per package/processor power consumption by counting each compute unit power value. Here the gory details of how the computation is done: - * Tsample: compute unit power accumulator sample period * Tref: the PTSC counter period (PTSC: performance timestamp counter) * N: the ratio of compute unit power accumulator sample period to the PTSC period * Jmax: max compute unit accumulated power which is indicated by MSR_C001007b[MaxCpuSwPwrAcc] * Jx/Jy: compute unit accumulated power which is indicated by MSR_C001007a[CpuSwPwrAcc] * Tx/Ty: the value of performance timestamp counter which is indicated by CU_PTSC MSR_C0010280[PTSC] * PwrCPUave: CPU average power i. Determine the ratio of Tsample to Tref by executing CPUID Fn8000_0007. N = value of CPUID Fn8000_0007_ECX[CpuPwrSampleTimeRatio[15:0]]. ii. Read the full range of the cumulative energy value from the new MSR MaxCpuSwPwrAcc. Jmax = value returned. iii. At time x, software reads CpuSwPwrAcc and samples the PTSC. Jx = value read from CpuSwPwrAcc and Tx = value read from PTSC. iv. At time y, software reads CpuSwPwrAcc and samples the PTSC. Jy = value read from CpuSwPwrAcc and Ty = value read from PTSC. v. Calculate the average power consumption for a compute unit over time period (y-x). Unit of result is uWatt: if (Jy < Jx) // Rollover has occurred Jdelta = (Jy + Jmax) - Jx else Jdelta = Jy - Jx PwrCPUave = N * Jdelta * 1000 / (Ty - Tx) -- Simple example: root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' make -j4 CHK include/config/kernel.release CHK include/generated/uapi/linux/version.h CHK include/generated/utsrelease.h CHK include/generated/timeconst.h CHK include/generated/bounds.h CHK include/generated/asm-offsets.h CALLscripts/checksyscalls.sh CHK include/generated/compile.h SKIPPED include/generated/compile.h Building modules, stage 2. Kernel: arch/x86/boot/bzImage is ready (#40) MODPOST 4225 modules Performance counter stats for 'system wide': 183.44 mWatts power/power-pkg/ 341.837270111 seconds time elapsed root@hr-zp:/home/ray/tip# ./tools/perf/perf stat -a -e 'power/power-pkg/' sleep 10 Performance counter stats for 'system wide': 0.18 mWatts power/power-pkg/ 10.012551815 seconds time elapsed Suggested-by: Peter Zijlstra Suggested-by: Ingo Molnar Suggested-by: Borislav Petkov Reviewed-by: Thomas Gleixner Signed-off-by: Huang Rui Cc: Guenter Roeck --- Hi Boris, I already redo this patch based on tip/master, it depends on some previous patches you applied before. If you need me to send them again, please let me know. Thanks, Rui --- arch/x86/Kconfig| 9 ++ arch/x86/events/Makefile| 1 + arch/x86/events/amd/power.c | 353 include/linux/perf_event.h | 4 + 4 files changed, 367 insertions(+) create mode 100644 arch/x86/events/amd/power.c diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 3e61672..52ef30d 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -1210,6 +1210,15 @@ config MICROCODE_OLD_INTERFACE def_bool y depends on MICROCODE +config PERF_EVENTS_AMD_POWER + depends on PERF_EVENTS && CPU_SUP_AMD + tristate "AMD Processor Power Reporting Mechanism" + ---help--- + Provide power reporting mechanism support for AMD processors. + Currently, it leverages X86_FEATURE_ACC_POWER + (CPUID Fn8000_0007_EDX[12]) interface to calculate the + average power consumption on Family 15h processors. + config X86_MSR tristate "/dev/cpu/*/msr - Model-specific register support" ---help--- diff --git a/arch/x86/events/Makefile b/arch/x86/events/Makefile index fdfea15..f59618a 100644 --- a/arch/x86/events/Makefile +++ b/arch/x86/events/Makefile @@ -1,6 +1,7 @@ obj-y += core.o obj-$(CONFIG_CPU_SUP_AMD) += amd/core.o amd/uncore.o +obj-$(CONFIG_PERF_EVENTS_AMD_POWER)+= amd/power.o obj-$(CONFIG_X86_LOCAL_APIC)+= amd/ibs.o msr.o ifdef CONFIG_AMD_IOMMU obj-$(CONFIG_CPU_SUP_AMD) += amd/iommu.o diff --git a/arch/x86/events/amd/power.c b/arch/x86/events/amd/power.c new file mode 100644 index 000..55a3529 --- /dev/null +++ b/arch/x86/events/amd/power.c @@ -0,0 +1,353 @