from:"Stewart Smith"

Re: [PATCH] powerpc/xive: Update comment referencing magic loads from an ESB

2019-08-01 Thread Stewart Smith

Jordan Niethe  writes:
> The comment above xive_esb_read() references magic loads from an ESB as
> described xive.h. This has been inaccurate since commit 12c1f339cd49
> ("powerpc/xive: Move definition of ESB bits") which moved the
> description. Update the comment to reference the new location of the
> description in xive-regs.h
>
> Signed-off-by: Jordan Niethe 

Acked-by: Stewart Smith 

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] crypto: nx: nx-842-powernv: Add of_node_put() before return

2019-07-24 Thread Stewart Smith

Nishka Dasgupta  writes:
> Each iteration of for_each_child_of_node puts the previous node, but

This is for_each_compatible_node.

otherwise looks okay,

Acked-by: Stewart Smith 

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v3 01/16] powerpc/fadump: move internal fadump code to a new file

2019-06-27 Thread Stewart Smith

Hari Bathini  writes:
> diff --git a/arch/powerpc/kernel/fadump-common.c 
> b/arch/powerpc/kernel/fadump-common.c
> new file mode 100644
> index 000..0182886
> --- /dev/null
> +++ b/arch/powerpc/kernel/fadump-common.c
> @@ -0,0 +1,184 @@
> +/*
> + * Firmware-Assisted Dump internal code.
> + *
> + * Copyright 2018-2019, IBM Corp.
> + * Author: Hari Bathini 
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */

This file takes a bunch of code from fadump.c, which has the (C) header
showing (C) 2011, and author of Mahesh. We should probably preserve that


-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc: Document xive=off option

2019-06-24 Thread Stewart Smith

Michael Neuling  writes:
> commit 243e25112d06 ("powerpc/xive: Native exploitation of the XIVE
> interrupt controller") added an option to turn off Linux native XIVE
> usage via the xive=off kernel command line option.
>
> This documents this option.
>
> Signed-off-by: Michael Neuling 

Acked-by: Stewart Smith 

-- 
Stewart Smith
OPAL Architect, IBM.

Re: crash after NX error

2019-06-05 Thread Stewart Smith

Michael Ellerman  writes:
> Stewart Smith  writes:
>> On my two socket POWER9 system (powernv) with 842 zwap set up, I
>> recently got a crash with the Ubuntu kernel (I haven't tried with
>> upstream, and this is the first time the system has died like this, so
>> I'm not sure how repeatable it is).
>>
>> [2.891463] zswap: loaded using pool 842-nx/zbud
>> ...
>> [15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 500 
>> us, giving up : 00 00 00 00 
>> [16868.932913] Unable to handle kernel paging request for data at address 
>> 0x6655f67da816cdb8
>> [16868.933726] Faulting instruction address: 0xc0391600
>>
>>
>> cpu 0x68: Vector: 380 (Data Access Out of Range) at [c01c9d98b9a0]
>> pc: c0391600: kmem_cache_alloc+0x2e0/0x340
>> lr: c03915ec: kmem_cache_alloc+0x2cc/0x340
>> sp: c01c9d98bc20
>>msr: 9280b033
>>dar: 6655f67da816cdb8
>>   current = 0xc01ad43cb400
>>   paca= 0xcfac7800   softe: 0irq_happened: 0x01
>> pid   = 8319, comm = make
>> Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 
>> 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 
>> (Ubuntu 4.15.0-50.54-generic 4.15.18)
>>
>> 68:mon> t
>> [c01c9d98bc20] c03914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
>> [c01c9d98bc80] c03b1e14 __khugepaged_enter+0x54/0x220
>> [c01c9d98bcc0] c010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
>> [c01c9d98bda0] c010fe4c _do_fork+0xec/0x510
>> [c01c9d98be30] c000b584 ppc_clone+0x8/0xc
>> --- Exception: c00 (System Call) at 7afe9daf87f4
>> SP (7fffca606880) is in userspace
>>
>> So, it looks like there could be a problem in the error path, plausibly
>> fixed by this patch:
>>
>> commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> Author: Haren Myneni 
>> Date:   Wed Jun 13 00:32:40 2018 -0700
>>
>> crypto/nx: Initialize 842 high and normal RxFIFO control registers
>> 
>> NX increments readOffset by FIFO size in receive FIFO control register
>> when CRB is read. But the index in RxFIFO has to match with the
>> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
>> may be processing incorrect CRBs and can cause CRB timeout.
>> 
>> VAS FIFO offset is 0 when the receive window is opened during
>> initialization. When the module is reloaded or in kexec boot, readOffset
>> in FIFO control register may not match with VAS entry. This patch adds
>> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
>> control register for both high and normal FIFOs.
>> 
>> Signed-off-by: Haren Myneni 
>> [mpe: Fixup uninitialized variable warning]
>> Signed-off-by: Michael Ellerman 
>>
>> $ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
>> v4.19-rc1~24^2~50
>>
>>
>> Which was never backported to any stable release, so probably needs to
>> be for v4.14 through v4.18.
>
> Yeah the P9 NX support went in in:
>   b0d6c9bab5e4 ("crypto/nx: Add P9 NX support for 842 compression engine")
>
> Which was: v4.14-rc1~119^2~21, so first released in v4.14.
>
>
> I'm actually less interested in that and more interested in the
> subsequent crash. The time stamps are miles apart though, did we just
> leave some corrupted memory after the NX failed and then hit it later?
> Or did we not correctly signal to the upper level APIs that the request
> failed.
>
> I think we need to do some testing with errors injected into the
> wait_for_csb() path, to ensure that failures there are not causing
> corrupting in zswap. Haren have you done any testing of error
> injection?

So, things died pretty heavily overnight (requiring e2fsck) with a *lot*
of those wait_for_csb() errors in the log.

It certainly *looks* like there's corruption around, as one of the CI
jobs that failed around that time got "internal compiler error" which is
usually a good sign that things have gone poorly somewhere.

-- 
Stewart Smith
OPAL Architect, IBM.

crash after NX error

2019-06-03 Thread Stewart Smith

On my two socket POWER9 system (powernv) with 842 zwap set up, I
recently got a crash with the Ubuntu kernel (I haven't tried with
upstream, and this is the first time the system has died like this, so
I'm not sure how repeatable it is).

[2.891463] zswap: loaded using pool 842-nx/zbud
...
[15626.124646] nx_compress_powernv: ERROR: CSB still not valid after 500 
us, giving up : 00 00 00 00 
[16868.932913] Unable to handle kernel paging request for data at address 
0x6655f67da816cdb8
[16868.933726] Faulting instruction address: 0xc0391600


cpu 0x68: Vector: 380 (Data Access Out of Range) at [c01c9d98b9a0]
pc: c0391600: kmem_cache_alloc+0x2e0/0x340
lr: c03915ec: kmem_cache_alloc+0x2cc/0x340
sp: c01c9d98bc20
   msr: 9280b033
   dar: 6655f67da816cdb8
  current = 0xc01ad43cb400
  paca= 0xcfac7800   softe: 0irq_happened: 0x01
pid   = 8319, comm = make
Linux version 4.15.0-50-generic (buildd@bos02-ppc64el-006) (gcc version 7.3.0 
(Ubuntu 7.3.0-16ubuntu3)) #54-Ubuntu SMP Mon May 6 18:55:18 UTC 2019 (Ubuntu 
4.15.0-50.54-generic 4.15.18)

68:mon> t
[c01c9d98bc20] c03914d4 kmem_cache_alloc+0x1b4/0x340 (unreliable)
[c01c9d98bc80] c03b1e14 __khugepaged_enter+0x54/0x220
[c01c9d98bcc0] c010f0ec copy_process.isra.5.part.6+0xebc/0x1a10
[c01c9d98bda0] c010fe4c _do_fork+0xec/0x510
[c01c9d98be30] c000b584 ppc_clone+0x8/0xc
--- Exception: c00 (System Call) at 7afe9daf87f4
SP (7fffca606880) is in userspace

So, it looks like there could be a problem in the error path, plausibly
fixed by this patch:

commit 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
Author: Haren Myneni 
Date:   Wed Jun 13 00:32:40 2018 -0700

crypto/nx: Initialize 842 high and normal RxFIFO control registers

NX increments readOffset by FIFO size in receive FIFO control register
when CRB is read. But the index in RxFIFO has to match with the
corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
may be processing incorrect CRBs and can cause CRB timeout.

VAS FIFO offset is 0 when the receive window is opened during
initialization. When the module is reloaded or in kexec boot, readOffset
in FIFO control register may not match with VAS entry. This patch adds
nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
control register for both high and normal FIFOs.

Signed-off-by: Haren Myneni 
[mpe: Fixup uninitialized variable warning]
Signed-off-by: Michael Ellerman 

$ git describe --contains 656ecc16e8fc2ab44b3d70e3fcc197a7020d0ca5
v4.19-rc1~24^2~50


Which was never backported to any stable release, so probably needs to
be for v4.14 through v4.18. Notably, Ubuntu is on v4.15 and it doesn't
seem to have picked up the patch. I'm opening an Ubuntu bug for this.

Haren, is this something you can drive through the stable process
(assuming my above crash looks like this failure)?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [TRIVIAL] [PATCH] powerpc/powernv-eeh: Consisely desribe what this file does

2019-05-27 Thread Stewart Smith

Oliver  writes:

> On Tue, May 28, 2019 at 1:29 PM Stewart Smith  wrote:
>>
>> If the previous comment made sense, continue debugging or call your
>> doctor immediately.
>>
>> Signed-off-by: Stewart Smith 
>> ---
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 4 +---
>>  1 file changed, 1 insertion(+), 3 deletions(-)
>>
>> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
>> b/arch/powerpc/platforms/powernv/eeh-powernv.c
>> index f38078976c5d..bea6708be065 100644
>> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
>> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
>> @@ -1,7 +1,5 @@
>>  /*
>> - * The file intends to implement the platform dependent EEH operations on
>> - * powernv platform. Actually, the powernv was created in order to fully
>> - * hypervisor support.
>> + * PowerNV Platform dependent EEH operations
>>   *
>>   * Copyright Benjamin Herrenschmidt & Gavin Shan, IBM Corporation 2013.
>
> Stewart, Thanks for fixing it up. Since you're at it, Please replace
> the maintainer to yourself.

This message intends to implement the middle raising operations on the
finger platform. Actually, the EEH was created in order to fully
phalange extension.

:)

-- 
Stewart Smith
OPAL Architect, IBM.

[TRIVIAL] [PATCH] powerpc/powernv-eeh: Consisely desribe what this file does

2019-05-27 Thread Stewart Smith

If the previous comment made sense, continue debugging or call your
doctor immediately.

Signed-off-by: Stewart Smith 
---
 arch/powerpc/platforms/powernv/eeh-powernv.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index f38078976c5d..bea6708be065 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -1,7 +1,5 @@
 /*
- * The file intends to implement the platform dependent EEH operations on
- * powernv platform. Actually, the powernv was created in order to fully
- * hypervisor support.
+ * PowerNV Platform dependent EEH operations
  *
  * Copyright Benjamin Herrenschmidt & Gavin Shan, IBM Corporation 2013.
  *
-- 
2.21.0

[PATCH] powerpc/powernv: Update firmware archaeology around OPAL_HANDLE_HMI

2019-05-23 Thread Stewart Smith

The first machines to ship with OPAL firmware all got firmware updates
that have the new call, but just in case someone is foolish enough to
believe the first 4 months of firmware is the best, we keep this code
around.

Comment is updated to not refer to late 2014 as recent or the future.

Signed-off-by: Stewart Smith 
---
 arch/powerpc/platforms/powernv/opal.c | 23 +++
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal.c 
b/arch/powerpc/platforms/powernv/opal.c
index f2b063b027f0..89b6ddc3ed38 100644
--- a/arch/powerpc/platforms/powernv/opal.c
+++ b/arch/powerpc/platforms/powernv/opal.c
@@ -206,16 +206,18 @@ static int __init opal_register_exception_handlers(void)
glue = 0x7000;
 
/*
-* Check if we are running on newer firmware that exports
-* OPAL_HANDLE_HMI token. If yes, then don't ask OPAL to patch
-* the HMI interrupt and we catch it directly in Linux.
+* Only ancient OPAL firmware requires this.
+* Specifically, firmware from FW810.00 (released June 2014)
+* through FW810.20 (Released October 2014).
 *
-* For older firmware (i.e currently released POWER8 System Firmware
-* as of today <= SV810_087), we fallback to old behavior and let OPAL
-* patch the HMI vector and handle it inside OPAL firmware.
+* Check if we are running on newer (post Oct 2014) firmware that
+* exports the OPAL_HANDLE_HMI token. If yes, then don't ask OPAL to
+* patch the HMI interrupt and we catch it directly in Linux.
 *
-* For newer firmware (in development/yet to be released) we will
-* start catching/handling HMI directly in Linux.
+* For older firmware (i.e < FW810.20), we fallback to old behavior and
+* let OPAL patch the HMI vector and handle it inside OPAL firmware.
+*
+* For newer firmware we catch/handle the HMI directly in Linux.
 */
if (!opal_check_token(OPAL_HANDLE_HMI)) {
pr_info("Old firmware detected, OPAL handles HMIs.\n");
@@ -225,6 +227,11 @@ static int __init opal_register_exception_handlers(void)
glue += 128;
}
 
+   /*
+* Only applicable to ancient firmware, all modern
+* (post March 2015/skiboot 5.0) firmware will just return
+* OPAL_UNSUPPORTED.
+*/
opal_register_exception_handler(OPAL_SOFTPATCH_HANDLER, 0, glue);
 #endif
 
-- 
2.21.0

Re: [PATCH kernel] prom_init: Fetch flatten device tree from the system firmware

2019-05-02 Thread Stewart Smith

David Gibson  writes:
> On Wed, May 01, 2019 at 01:42:21PM +1000, Alexey Kardashevskiy wrote:
>> At the moment, on 256CPU + 256 PCI devices guest, it takes the guest
>> about 8.5sec to fetch the entire device tree via the client interface
>> as the DT is traversed twice - for strings blob and for struct blob.
>> Also, "getprop" is quite slow too as SLOF stores properties in a linked
>> list.
>> 
>> However, since [1] SLOF builds flattened device tree (FDT) for another
>> purpose. [2] adds a new "fdt-fetch" client interface for the OS to fetch
>> the FDT.
>> 
>> This tries the new method; if not supported, this falls back to
>> the old method.
>> 
>> There is a change in the FDT layout - the old method produced
>> (reserved map, strings, structs), the new one receives only strings and
>> structs from the firmware and adds the final reserved map to the end,
>> so it is (fw reserved map, strings, structs, reserved map).
>> This still produces the same unflattened device tree.
>> 
>> This merges the reserved map from the firmware into the kernel's reserved
>> map. At the moment SLOF generates an empty reserved map so this does not
>> change the existing behaviour in regard of reservations.
>> 
>> This supports only v17 onward as only that version provides dt_struct_size
>> which works as "fdt-fetch" only produces v17 blobs.
>> 
>> If "fdt-fetch" is not available, the old method of fetching the DT is used.
>> 
>> [1] https://git.qemu.org/?p=SLOF.git;a=commitdiff;h=e6fc84652c9c00
>> [2] https://git.qemu.org/?p=SLOF.git;a=commit;h=ecda95906930b80
>> 
>> Signed-off-by: Alexey Kardashevskiy 
>
> Hrm.  I've gotta say I'm not terribly convinced that it's worth adding
> a new interface we'll need to maintain to save 8s on a somewhat
> contrived testcase.

256CPUs aren't that many anymore though. Although I guess that many PCI
devices is still a little uncommon.

A 4 socket POWER8 or POWER9 can easily be that large, and a small test
kernel/userspace will boot in ~2.5-4 seconds. So it's possible that
the device tree fetch could be surprisingly non-trivial percentage of boot
time at least on some machines.


-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc: config: skiroot: Add (back) MLX5 ethernet support

2019-04-03 Thread Stewart Smith

Mathieu Malaterre  writes:
> On Wed, Apr 3, 2019 at 2:51 AM Joel Stanley  wrote:
>>
>> It turns out that some defconfig changes and kernel config option
>> changes meant we accidentally dropped Ethernet support for Mellanox CLX5
>> cards.
>>
>> Reported-by: Carol L Soto 
>> Suggested-by: Carol L Soto 
>> Signed-off-by: Stewart Smith 
>> Signed-off-by: Joel Stanley 
>
> Fixes: cbc39809a398 ("powerpc/configs: Update skiroot defconfig")
>
> ?

Yes.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc/powernv: Make opal log only readable by root

2019-02-26 Thread Stewart Smith

Jordan Niethe  writes:
> Currently the opal log is globally readable. It is kernel policy to limit
> the visibility of physical addresses / kernel pointers to root.
> Given this and the fact the opal log may contain this information it would
> be better to limit the readability to root.
>
> Signed-off-by: Jordan Niethe 

Yeah, this is a really good idea.

Reviewed-by: Stewart Smith 

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc/powernv: move OPAL call wrapper tracing and interrupt handling to C

2019-02-26 Thread Stewart Smith

Nicholas Piggin  writes:
> The OPAL call wrapper gets interrupt disabling wrong. It disables
> interrupts just by clearing MSR[EE], which has two problems:
>
> - It doesn't call into the IRQ tracing subsystem, which means tracing
>   across OPAL calls does not always notice IRQs have been disabled.
>
> - It doesn't go through the IRQ soft-mask code, which causes a minor
>   bug. MSR[EE] can not be restored by saving the MSR then clearing
>   MSR[EE], because a racing interrupt while soft-masked could clear
>   MSR[EE] between the two steps. This can cause MSR[EE] to be
>   incorrectly enabled when the OPAL call returns. Fortunately that
>   should only result in another masked interrupt being taken to
>   disable MSR[EE] again, but it's a bit sloppy.
>
> The existing code also saves MSR to PACA, which is not re-entrant if
> there is a nested OPAL call from different MSR contexts, which can
> happen these days with SRESET interrupts on bare metal.
>
> To fix these issues, move the tracing and IRQ handling code to C, and
> call into asm just for the low level call when everything is ready to
> go. Save the MSR on stack rather than PACA.
>
> Performance cost is kept to a minimum with a few optimisations:
>
> - The endian switch upon return is combined with the MSR restore,
>   which avoids an expensive context synchronizing operation for LE
>   kernels. This makes up for the additional mtmsrd to enable
>   interrupts with local_irq_enable().
>
> - blr is now used to return from the opal_* functions that are called
>   as C functions, to avoid link stack corruption. This requires a
>   skiboot fix as well to keep the call stack balanced.
>
> A NULL call is more costly after this, (410ns->430ns on POWER9), but
> OPAL calls are generally not performance critical at this scale.

At least with this patch we can start to better measure things, and
that's a big plus. Also, When I get to start worrying about <1ms OPAL
calls, I think we can revisit optimising this path :)


-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] mtd: powernv: SPDX and comment fixups

2019-02-05 Thread Stewart Smith

Joel Stanley  writes:
> converts the powernv flash driver to use SPDX, and adds some
> clarifying comments that came out of a discussion on how the mtd driver
> works.
>
> Signed-off-by: Joel Stanley 

We probably don't need to mention the dim dark corners of the FFS format
and I kind of wish it'd die rather than spread further.

Reviewed-by: Stewart Smith 

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v2] powerpc/perf: Quiet IMC PMU registration message

2018-10-10 Thread Stewart Smith

Joel Stanley  writes:
> On a Power9 box we get a few screens full of these on boot. Drop
> them to pr_debug.
>
> [5.993645] nest_centaur6_imc performance monitor hardware support 
> registered
> [5.993728] nest_centaur7_imc performance monitor hardware support 
> registered
> [5.996510] core_imc performance monitor hardware support registered
> [5.996569] nest_mba0_imc performance monitor hardware support registered
> [5.996631] nest_mba1_imc performance monitor hardware support registered
> [5.996685] nest_mba2_imc performance monitor hardware support registered
>
> Signed-off-by: Joel Stanley 

Oh goodness yes.

Reviewed-by: Stewart Smith 

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc/powernv: Make possible for user to force a full ipl cec reboot

2018-09-09 Thread Stewart Smith

Vaibhav Jain  writes:
> Ever since fast reboot is enabled by default in opal,
> opal_cec_reboot() will use fast-reset instead of full IPL to perform
> system reboot. This leaves the user with no direct way to force a full
> IPL reboot except changing an nvram setting that persistently disables
> fast-reset for all subsequent reboots.
>
> This patch provides a more direct way for the user to force a one-shot
> full IPL reboot by passing the command line argument 'full' to the
> reboot command. So the user will be able to tweak the reboot behavior
> via:
>
>   $ sudo reboot full  # Force a full ipl reboot skipping fast-reset
>
>   or
>   $ sudo reboot   # default reboot path (usually fast-reset)
>
> The reboot command passes the un-parsed command argument to the kernel
> via the 'Reboot' syscall which is then passed on to the arch function
> pnv_restart(). The patch updates pnv_restart() to handle this cmd-arg
> and issues opal_cec_reboot2 with OPAL_REBOOT_FULL_IPL to force a full
> IPL reset.

We're about to introduce an MPIPL reboot type (to take a firmware
assisted kdump style thing), and we maybe should have a reboot type to
force attempting a fast-reboot, and this makes me think if we should add
those in now?

>
> Signed-off-by: Vaibhav Jain 
> ---
>  arch/powerpc/include/asm/opal-api.h| 1 +
>  arch/powerpc/platforms/powernv/setup.c | 8 +++-
>  2 files changed, 8 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index 8365353330b4..870fb7b239ea 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -1050,6 +1050,7 @@ enum OpalSysCooling {
>  enum {
>   OPAL_REBOOT_NORMAL  = 0,
>   OPAL_REBOOT_PLATFORM_ERROR  = 1,
> + OPAL_REBOOT_FULL_IPL= 2,
>  };
>  
>  /* Argument to OPAL_PCI_TCE_KILL */
> diff --git a/arch/powerpc/platforms/powernv/setup.c 
> b/arch/powerpc/platforms/powernv/setup.c
> index ae023622..33d2faeacff8 100644
> --- a/arch/powerpc/platforms/powernv/setup.c
> +++ b/arch/powerpc/platforms/powernv/setup.c
> @@ -224,7 +224,13 @@ static void  __noreturn pnv_restart(char *cmd)
>   pnv_prepare_going_down();
>  
>   while (rc == OPAL_BUSY || rc == OPAL_BUSY_EVENT) {
> - rc = opal_cec_reboot();
> +
> + /* See if we need to do a full IPL reboot */
> + if (cmd && strcmp(cmd, "full") == 0)
> + rc = opal_cec_reboot2(OPAL_REBOOT_FULL_IPL, NULL);
> +         else
> + rc = opal_cec_reboot();
> +

If the reboot type isn't supported, what should be the behvaiour? Reboot
the default way or don't reboot at all?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v6 2/2] hwmon: ibmpowernv: Add attributes to enable/disable sensor groups

2018-07-19 Thread Stewart Smith

Shilpasri G Bhat  writes:
> On-Chip-Controller(OCC) is an embedded micro-processor in POWER9 chip
> which measures various system and chip level sensors. These sensors
> comprises of environmental sensors (like power, temperature, current
> and voltage) and performance sensors (like utilization, frequency).
> All these sensors are copied to main memory at a regular interval of
> 100ms. OCC provides a way to select a group of sensors that is copied
> to the main memory to increase the update frequency of selected sensor
> groups. When a sensor-group is disabled, OCC will not copy it to main
> memory and those sensors read 0 values.

OCC is an implementation detail rather than a core part of this firmware
API.

Why not something like this:

OPAL firmware provides the facility for some groups of sensors to be
enabled/disabled at runtime to give the user the option of using the
system resources for collecting these sensors or not.

For example, on POWER9 systems, the On Chip Controller (OCC) gathers
various system and chip level sensors and maintains their values in main
memory.

> +static int init_sensor_group_data(struct platform_device *pdev,
> +   struct platform_data *pdata)
> +{
> + struct sensor_group_data *sgrp_data;
> + struct device_node *groups, *sgrp;
> + enum sensors type;
> + int count = 0, ret = 0;
> +
> + groups = of_find_node_by_path("/ibm,opal/sensor-groups");
> + if (!groups)
> +     return ret;

Why not look for the compatible property?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v2 2/2] powerpc: Document issues with TM on POWER9

2018-06-25 Thread Stewart Smith

Michael Neuling  writes:
> Signed-off-by: Michael Neuling 

Acked-by: Stewart Smith 

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH 2/2] powerpc: Document issues with TM on POWER9

2018-06-22 Thread Stewart Smith

Michael Neuling  writes:
> +POWER9C DD1.2 and above are only avaliable with POWERNV and hence

PowerVM, not POWERNV


-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH 1/2] powerpc: Document issues with the DAWR on POWER9

2018-06-22 Thread Stewart Smith

Michael Neuling  writes:
> Signed-off-by: Michael Neuling 
> ---
>  Documentation/powerpc/DAWR-POWER9.txt | 58 +++
>  1 file changed, 58 insertions(+)
>  create mode 100644 Documentation/powerpc/DAWR-POWER9.txt

Acked-by: Stewart Smith 

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] cpuidle:powernv: Make the snooze timeout dynamic.

2018-06-04 Thread Stewart Smith

Michael Ellerman  writes:
> "Gautham R. Shenoy"  writes:
>
>> From: "Gautham R. Shenoy" 
>>
>> The commit 78eaa10f027c ("cpuidle: powernv/pseries: Auto-promotion of
>> snooze to deeper idle state") introduced a timeout for the snooze idle
>> state so that it could be eventually be promoted to a deeper idle
>> state. The snooze timeout value is static and set to the target
>> residency of the next idle state, which would train the cpuidle
>> governor to pick the next idle state eventually.
>>
>> The unfortunate side-effect of this is that if the next idle state(s)
>> is disabled, the CPU will forever remain in snooze, despite the fact
>> that the system is completely idle, and other deeper idle states are
>> available.
>
> That sounds like a bug, I'll add?
>
> Fixes: 78eaa10f027c ("cpuidle: powernv/pseries: Auto-promotion of snooze to 
> deeper idle state")
> Cc: sta...@vger.kernel.org # v4.2+

Yes, it's a bug - we had a customer bug because we lacked this that
meant we had to do firmware changes rather than just tweaking what stop
states were used.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] crypto/nx: Initialize 842 high and normal RxFIFO control registers

2018-06-03 Thread Stewart Smith

Haren Myneni  writes:
> On 06/03/2018 05:41 PM, Stewart Smith wrote:
>> Haren Myneni  writes:
>>> On 06/01/2018 12:41 AM, Stewart Smith wrote:
>>>> Haren Myneni  writes:
>>>>> NX increments readOffset by FIFO size in receive FIFO control register
>>>>> when CRB is read. But the index in RxFIFO has to match with the
>>>>> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
>>>>> may be processing incorrect CRBs and can cause CRB timeout.
>>>>>
>>>>> VAS FIFO offset is 0 when the receive window is opened during
>>>>> initialization. When the module is reloaded or in kexec boot, readOffset
>>>>> in FIFO control register may not match with VAS entry. This patch adds
>>>>> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
>>>>> control register for both high and normal FIFOs.
>>>>>
>>>>> Signed-off-by: Haren Myneni 
>>>>>
>>>>> diff --git a/arch/powerpc/include/asm/opal-api.h 
>>>>> b/arch/powerpc/include/asm/opal-api.h
>>>>> index d886a5b..ff61e4b 100644
>>>>> --- a/arch/powerpc/include/asm/opal-api.h
>>>>> +++ b/arch/powerpc/include/asm/opal-api.h
>>>>> @@ -206,7 +206,8 @@
>>>>>  #define OPAL_NPU_TL_SET  161
>>>>>  #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR 164
>>>>>  #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR 165
>>>>> -#define OPAL_LAST165
>>>>> +#define  OPAL_NX_COPROC_INIT 167
>>>>> +#define OPAL_LAST167
>>>>>  
>>>>>  /* Device tree flags */
>>>>>  
>>>>> diff --git a/arch/powerpc/include/asm/opal.h 
>>>>> b/arch/powerpc/include/asm/opal.h
>>>>> index 7159e1a..d79eb82 100644
>>>>> --- a/arch/powerpc/include/asm/opal.h
>>>>> +++ b/arch/powerpc/include/asm/opal.h
>>>>> @@ -288,6 +288,7 @@ int64_t opal_imc_counters_init(uint32_t type, 
>>>>> uint64_t address,
>>>>>  int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr);
>>>>>  int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
>>>>>  int opal_sensor_group_clear(u32 group_hndl, int token);
>>>>> +int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
>>>>>  
>>>>>  s64 opal_signal_system_reset(s32 cpu);
>>>>>  
>>>>> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
>>>>> b/arch/powerpc/platforms/powernv/opal-wrappers.S
>>>>> index 3da30c2..c7541a9 100644
>>>>> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
>>>>> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
>>>>> @@ -325,3 +325,4 @@ OPAL_CALL(opal_npu_spa_clear_cache,   
>>>>> OPAL_NPU_SPA_CLEAR_CACHE);
>>>>>  OPAL_CALL(opal_npu_tl_set,   OPAL_NPU_TL_SET);
>>>>>  OPAL_CALL(opal_pci_get_pbcq_tunnel_bar,  
>>>>> OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
>>>>>  OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,  
>>>>> OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
>>>>> +OPAL_CALL(opal_nx_coproc_init,   OPAL_NX_COPROC_INIT);
>>>>> diff --git a/arch/powerpc/platforms/powernv/opal.c 
>>>>> b/arch/powerpc/platforms/powernv/opal.c
>>>>> index 48fbb41..5e13908 100644
>>>>> --- a/arch/powerpc/platforms/powernv/opal.c
>>>>> +++ b/arch/powerpc/platforms/powernv/opal.c
>>>>> @@ -1035,3 +1035,5 @@ void powernv_set_nmmu_ptcr(unsigned long ptcr)
>>>>>  EXPORT_SYMBOL_GPL(opal_int_set_mfrr);
>>>>>  EXPORT_SYMBOL_GPL(opal_int_eoi);
>>>>>  EXPORT_SYMBOL_GPL(opal_error_code);
>>>>> +/* Export the below symbol for NX compression */
>>>>> +EXPORT_SYMBOL(opal_nx_coproc_init);
>>>>> diff --git a/drivers/crypto/nx/nx-842-powernv.c 
>>>>> b/drivers/crypto/nx/nx-842-powernv.c
>>>>> index 1e87637..6c4784d 100644
>>>>> --- a/drivers/crypto/nx/nx-842-powernv.c
>>>>> +++ b/drivers/crypto/nx/nx-842-powernv.c
>>>>> @@ -24,6 +24,8 @@
>>>>>  #include 
>>>>>  #include 
>>>>>  #include 
>>>>> +#include 
>>>>> +#include 
>>>>>  
>>

Re: [PATCH] crypto/nx: Initialize 842 high and normal RxFIFO control registers

2018-06-03 Thread Stewart Smith

Haren Myneni  writes:
> On 06/01/2018 12:41 AM, Stewart Smith wrote:
>> Haren Myneni  writes:
>>> NX increments readOffset by FIFO size in receive FIFO control register
>>> when CRB is read. But the index in RxFIFO has to match with the
>>> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
>>> may be processing incorrect CRBs and can cause CRB timeout.
>>>
>>> VAS FIFO offset is 0 when the receive window is opened during
>>> initialization. When the module is reloaded or in kexec boot, readOffset
>>> in FIFO control register may not match with VAS entry. This patch adds
>>> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
>>> control register for both high and normal FIFOs.
>>>
>>> Signed-off-by: Haren Myneni 
>>>
>>> diff --git a/arch/powerpc/include/asm/opal-api.h 
>>> b/arch/powerpc/include/asm/opal-api.h
>>> index d886a5b..ff61e4b 100644
>>> --- a/arch/powerpc/include/asm/opal-api.h
>>> +++ b/arch/powerpc/include/asm/opal-api.h
>>> @@ -206,7 +206,8 @@
>>>  #define OPAL_NPU_TL_SET161
>>>  #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR   164
>>>  #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR   165
>>> -#define OPAL_LAST  165
>>> +#defineOPAL_NX_COPROC_INIT 167
>>> +#define OPAL_LAST  167
>>>  
>>>  /* Device tree flags */
>>>  
>>> diff --git a/arch/powerpc/include/asm/opal.h 
>>> b/arch/powerpc/include/asm/opal.h
>>> index 7159e1a..d79eb82 100644
>>> --- a/arch/powerpc/include/asm/opal.h
>>> +++ b/arch/powerpc/include/asm/opal.h
>>> @@ -288,6 +288,7 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t 
>>> address,
>>>  int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr);
>>>  int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
>>>  int opal_sensor_group_clear(u32 group_hndl, int token);
>>> +int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
>>>  
>>>  s64 opal_signal_system_reset(s32 cpu);
>>>  
>>> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
>>> b/arch/powerpc/platforms/powernv/opal-wrappers.S
>>> index 3da30c2..c7541a9 100644
>>> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
>>> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
>>> @@ -325,3 +325,4 @@ OPAL_CALL(opal_npu_spa_clear_cache, 
>>> OPAL_NPU_SPA_CLEAR_CACHE);
>>>  OPAL_CALL(opal_npu_tl_set, OPAL_NPU_TL_SET);
>>>  OPAL_CALL(opal_pci_get_pbcq_tunnel_bar,
>>> OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
>>>  OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,
>>> OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
>>> +OPAL_CALL(opal_nx_coproc_init, OPAL_NX_COPROC_INIT);
>>> diff --git a/arch/powerpc/platforms/powernv/opal.c 
>>> b/arch/powerpc/platforms/powernv/opal.c
>>> index 48fbb41..5e13908 100644
>>> --- a/arch/powerpc/platforms/powernv/opal.c
>>> +++ b/arch/powerpc/platforms/powernv/opal.c
>>> @@ -1035,3 +1035,5 @@ void powernv_set_nmmu_ptcr(unsigned long ptcr)
>>>  EXPORT_SYMBOL_GPL(opal_int_set_mfrr);
>>>  EXPORT_SYMBOL_GPL(opal_int_eoi);
>>>  EXPORT_SYMBOL_GPL(opal_error_code);
>>> +/* Export the below symbol for NX compression */
>>> +EXPORT_SYMBOL(opal_nx_coproc_init);
>>> diff --git a/drivers/crypto/nx/nx-842-powernv.c 
>>> b/drivers/crypto/nx/nx-842-powernv.c
>>> index 1e87637..6c4784d 100644
>>> --- a/drivers/crypto/nx/nx-842-powernv.c
>>> +++ b/drivers/crypto/nx/nx-842-powernv.c
>>> @@ -24,6 +24,8 @@
>>>  #include 
>>>  #include 
>>>  #include 
>>> +#include 
>>> +#include 
>>>  
>>>  MODULE_LICENSE("GPL");
>>>  MODULE_AUTHOR("Dan Streetman ");
>>> @@ -803,9 +805,26 @@ static int __init vas_cfg_coproc_info(struct 
>>> device_node *dn, int chip_id,
>>> if (!coproc)
>>> return -ENOMEM;
>>>  
>>> -   if (!strcmp(priority, "High"))
>>> +   if (!strcmp(priority, "High")) {
>>> +   /*
>>> +* (lpid, pid, tid) combination has to be unique for each
>>> +* coprocessor instance in the system. So to make it
>>> +* unique, skiboot uses coprocessor type such as 842 or
>>> +* GZI

Re: [PATCH] crypto/nx: Initialize 842 high and normal RxFIFO control registers

2018-06-01 Thread Stewart Smith

Haren Myneni  writes:
> NX increments readOffset by FIFO size in receive FIFO control register
> when CRB is read. But the index in RxFIFO has to match with the
> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
> may be processing incorrect CRBs and can cause CRB timeout.
>
> VAS FIFO offset is 0 when the receive window is opened during
> initialization. When the module is reloaded or in kexec boot, readOffset
> in FIFO control register may not match with VAS entry. This patch adds
> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
> control register for both high and normal FIFOs.
>
> Signed-off-by: Haren Myneni 
>
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index d886a5b..ff61e4b 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -206,7 +206,8 @@
>  #define OPAL_NPU_TL_SET  161
>  #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR 164
>  #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR 165
> -#define OPAL_LAST165
> +#define  OPAL_NX_COPROC_INIT 167
> +#define OPAL_LAST167
>  
>  /* Device tree flags */
>  
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 7159e1a..d79eb82 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -288,6 +288,7 @@ int64_t opal_imc_counters_init(uint32_t type, uint64_t 
> address,
>  int opal_get_power_shift_ratio(u32 handle, int token, u32 *psr);
>  int opal_set_power_shift_ratio(u32 handle, int token, u32 psr);
>  int opal_sensor_group_clear(u32 group_hndl, int token);
> +int opal_nx_coproc_init(uint32_t chip_id, uint32_t ct);
>  
>  s64 opal_signal_system_reset(s32 cpu);
>  
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S 
> b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index 3da30c2..c7541a9 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -325,3 +325,4 @@ OPAL_CALL(opal_npu_spa_clear_cache,   
> OPAL_NPU_SPA_CLEAR_CACHE);
>  OPAL_CALL(opal_npu_tl_set,   OPAL_NPU_TL_SET);
>  OPAL_CALL(opal_pci_get_pbcq_tunnel_bar,  
> OPAL_PCI_GET_PBCQ_TUNNEL_BAR);
>  OPAL_CALL(opal_pci_set_pbcq_tunnel_bar,  
> OPAL_PCI_SET_PBCQ_TUNNEL_BAR);
> +OPAL_CALL(opal_nx_coproc_init,   OPAL_NX_COPROC_INIT);
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index 48fbb41..5e13908 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -1035,3 +1035,5 @@ void powernv_set_nmmu_ptcr(unsigned long ptcr)
>  EXPORT_SYMBOL_GPL(opal_int_set_mfrr);
>  EXPORT_SYMBOL_GPL(opal_int_eoi);
>  EXPORT_SYMBOL_GPL(opal_error_code);
> +/* Export the below symbol for NX compression */
> +EXPORT_SYMBOL(opal_nx_coproc_init);
> diff --git a/drivers/crypto/nx/nx-842-powernv.c 
> b/drivers/crypto/nx/nx-842-powernv.c
> index 1e87637..6c4784d 100644
> --- a/drivers/crypto/nx/nx-842-powernv.c
> +++ b/drivers/crypto/nx/nx-842-powernv.c
> @@ -24,6 +24,8 @@
>  #include 
>  #include 
>  #include 
> +#include 
> +#include 
>  
>  MODULE_LICENSE("GPL");
>  MODULE_AUTHOR("Dan Streetman ");
> @@ -803,9 +805,26 @@ static int __init vas_cfg_coproc_info(struct device_node 
> *dn, int chip_id,
>   if (!coproc)
>   return -ENOMEM;
>  
> - if (!strcmp(priority, "High"))
> + if (!strcmp(priority, "High")) {
> + /*
> +  * (lpid, pid, tid) combination has to be unique for each
> +  * coprocessor instance in the system. So to make it
> +  * unique, skiboot uses coprocessor type such as 842 or
> +  * GZIP for pid and provides this value to kernel in pid
> +  * device-tree property.
> +  *
> +  * Initialize each NX instance for both high and normal
> +  * priority FIFOs.
> +  */
> + ret = opal_nx_coproc_init(chip_id, pid);
> + if (ret) {
> + pr_err("Failed to initialize NX coproc: %d\n", ret);
> +     ret = opal_error_code(ret);
> + goto err_out;
> + }
> +
>   coproc->ct = VAS_COP_TYPE_842_HIPRI;

I think this should be called for all priority queues as it would be at
least theoretically possible to only have Normal priority queues, in
which case this patch wouldn't fix the problem.


-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] crypto/nx: Initialize 842 high and normal RxFIFO control registers

2018-05-31 Thread Stewart Smith

Haren Myneni  writes:
> NX increments readOffset by FIFO size in receive FIFO control register
> when CRB is read. But the index in RxFIFO has to match with the
> corresponding entry in FIFO maintained by VAS in kernel. Otherwise NX
> may be processing incorrect CRBs and can cause CRB timeout.
>
> VAS FIFO offset is 0 when the receive window is opened during
> initialization. When the module is reloaded or in kexec boot, readOffset
> in FIFO control register may not match with VAS entry. This patch adds
> nx_coproc_init OPAL call to reset readOffset and queued entries in FIFO
> control register for both high and normal FIFOs.
>
> Signed-off-by: Haren Myneni 

I've yet to go and check out the skiboot patch properly, but should this
be both:
Fixes: b0d6c9bab crypto/nx: Add P9 NX support for 842 compression engine
CC: stable # v4.14+

as otherwise "rmmod ; insmod" will crash, and possibly even issues over kexec?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] cpuidle/powernv : Add Description for cpuidle state

2018-05-28 Thread Stewart Smith

Abhishek Goel  writes:
> @@ -215,7 +216,7 @@ static inline void add_powernv_state(int index, const 
> char *name,
>u64 psscr_val, u64 psscr_mask)
>  {
>   strlcpy(powernv_states[index].name, name, CPUIDLE_NAME_LEN);
> - strlcpy(powernv_states[index].desc, name, CPUIDLE_NAME_LEN);
> + strlcpy(powernv_states[index].desc, desc, CPUIDLE_DESC_LEN);

We should still fall back to using name in the event of desc being null,
as not all firmware will expose the descriptions.

> @@ -311,6 +313,11 @@ static int powernv_add_idle_states(void)
>   pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-names in 
> DT\n");
>   goto out;
>   }
> + if (of_property_read_string_array(power_mgt,
> + "ibm,cpu-idle-state-descs", descs, dt_idle_states) < 0) {
> + pr_warn("cpuidle-powernv: missing ibm,cpu-idle-state-descs in 
> DT\n");
> + goto out;
> + }

I don't think pr_warn is appropriate here, as for all current released
firmware we don't have that property. I think perhaps just silently
continuing on is okay, as we have to keep compatibility with that firmware.

> --- a/include/linux/cpuidle.h
> +++ b/include/linux/cpuidle.h
> @@ -17,7 +17,7 @@
>
>  #define CPUIDLE_STATE_MAX10
>  #define CPUIDLE_NAME_LEN 16
> -#define CPUIDLE_DESC_LEN 32
> +#define CPUIDLE_DESC_LEN 60

Do we really get that long?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc-opal: fix spelling mistake "Uniterrupted" -> "Uninterrupted"

2018-05-27 Thread Stewart Smith

Colin King <colin.k...@canonical.com> writes:
> From: Colin Ian King <colin.k...@canonical.com>
>
> Trivial fix to spelling mistake in hmi_error_types text
>
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>

Reviewed-by: Stewart Smith <stew...@linux.ibm.com>

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] cpuidle/powernv : init all present cpus for deep states

2018-05-27 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:
> Akshay Adiga <akshay.ad...@linux.vnet.ibm.com> writes:
>
>> Init all present cpus for deep states instead of "all possible" cpus.
>> Init fails if the possible cpu is gaurded. Resulting in making only
>> non-deep states available for cpuidle/hotplug.
>
> This is basically the opposite of what we just did for IMC.
>
> There we switched from present to possible, to make it work when some
> CPUs are guarded.
>
> Which makes me think we need a better way of dealing with guarded CPUs,
> because working out which code should use present or possible seems to
> be basically trial-and-error.
>
> I'm not actually sure why Guarded CPUs are showing up as possible but
> not present, did we do that on purpose or is it just happening by
> accident?

My guess is that it flows through from firmware putting the guarded out
CPUs in the device tree with a not "okay" status (which, I just
realised, we're putting something in 'status' that isn't what the
current DeviceTree spec says we should... gah -
https://github.com/open-power/skiboot/issues/178 filed for that one).

The idea behind that is that you can answer "where did all my CPUs go?"
by looking at the device tree rather than having to know the platform
specific way of how guards are stored.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] cpuidle/powernv : init all present cpus for deep states

2018-05-16 Thread Stewart Smith

Akshay Adiga <akshay.ad...@linux.vnet.ibm.com> writes:
> Init all present cpus for deep states instead of "all possible" cpus.
> Init fails if the possible cpu is gaurded. Resulting in making only
> non-deep states available for cpuidle/hotplug.

Should this also head to stable? It means that for single threaded
workloads, if you guard out a CPU core you'll not get WoF, which means
that performance goes down when you wouldn't expect it to. Right?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: administrivia: mails containing HTML attachments

2018-05-16 Thread Stewart Smith

Stephen Rothwell <s...@canb.auug.org.au> writes:
> I have decided that any email sent to the linuxppc-dev mailing list
> that contains an HTML attachment (or is just an HTML email) will be
> rejected.  The vast majority of such mail are spam (and I have to spend
> time dropping them manually at the moment) and, I presume, anyone on
> this list is capable of sending no HTML email.

certainly one way to get those pesky IBM Verse users off. To be fair,
Documentation/process/email-clients.rst also warns them appropriately.

I fully support ditching HTML mail.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [Skiboot] [PATCH 1/2] SLW: Remove stop1_lite and stop0 stop states

2018-05-06 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:
> Stewart Smith <stew...@linux.vnet.ibm.com> writes:
> ...
>>
>> Slightly stupid question: should we be disabling these here or should
>> Linux be better and deciding what states to use?
>>
>> I'm inclined to say this is a Linux problem as it should make the
>> decision of what hardware feature to used based on the ones OPAL says
>> *can* be used.
>
> Yeah I agree.
>
> Firmware shouldn't be implementing the policy around what states to use,
> it should tell the operating system (which might be Linux) what states
> are available and what their features are.

Yeah... I think I should work out somewhere to put this in the
documentation, a kind of design philosophy we can point back to.

> The exception to that would be that we have unfixable crash bugs in
> existing kernels, in that case firmware might have to filter out states
> that are known to cause those.

s/in/with/ and *cough* stop11 for example.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [Skiboot] [PATCH 1/2] SLW: Remove stop1_lite and stop0 stop states

2018-05-03 Thread Stewart Smith

Nicholas Piggin <npig...@gmail.com> writes:
> On Thu, 3 May 2018 14:36:47 +0530
> Akshay Adiga <akshay.ad...@linux.vnet.ibm.com> wrote:
>
>> On Tue, May 01, 2018 at 01:47:23PM +1000, Nicholas Piggin wrote:
>> > On Mon, 30 Apr 2018 14:42:08 +0530
>> > Akshay Adiga <akshay.ad...@linux.vnet.ibm.com> wrote:
>> >   
>> > > Powersaving for stop0_lite and stop1_lite is observed to be quite similar
>> > > and both states resume without state loss. Using context_switch test [1]
>> > > we observe that stop0_lite has slightly lower latency, hence removing
>> > > stop1_lite.
>> > > 
>> > > [1] linux/tools/testing/selftests/powerpc/benchmarks/context_switch.c
>> > > 
>> > > Signed-off-by: Akshay Adiga <akshay.ad...@linux.vnet.ibm.com>  
>> > 
>> > I'm okay for removing stop1_lite and stop2_lite -- SMT switching
>> > is very latency critical. If we decide to actually start saving
>> > real power then SMT should already have been switched.
>> > 
>> > So I would put stop1_lite and stop2_lite removal in the same patch.  
>> 
>> I can do this.
>> 
>> > 
>> > Then what do we have? stop0_lite, stop0, stop1 for our fast idle
>> > states.  
>> 
>> Currently we were looking at  stop0_lite , stop1 as the fast idle states
>> because stop0 and stop1 have similar latency and powersaving.
>> Having so many low latency states does not make sense.
>> 
>> > 
>> > I would be against removing stop0 if that is our fastest way to
>> > release SMT resources, even if there is only a small advantage. Why
>> > not remove stop1 instead?
>> >  
>> SMT-folding comes into picture only when we have at least one thread
>> running in the core. stop0 and stop1 has exactly same power-saving and
>> both will release SMT resources if at least one thread in the core is
>> running.
>
> Right, but you don't know that other threads are running or will remain
> running when you enter stop. If not, then latency is higher for stop1,
> no? So we need to be using stop0.
>
>> 
>> As soon as all threads are idle core enters stop0/stop1, where stop1
>> does a bit more powersaving than stop0.
>> 
>> > We also need to better evaluate stop0_lite. How much advantage does
>> > that have over snooze?  
>> 
>> I evaluated snooze and stop0_lite, there is an additional ipi latency of
>> a few microseconds in case of stop0_lite. So snooze cannot still be
>> replaced by stop0_lite.
>
> I meant the other way around. Replace stop0_lite with snooze.
>
> So we would have snooze, stop0, stop2, and stop4 and/or 5.

Slightly stupid question: should we be disabling these here or should
Linux be better and deciding what states to use?

I'm inclined to say this is a Linux problem as it should make the
decision of what hardware feature to used based on the ones OPAL says
*can* be used.

I'm also open to be being convinced otherwise though...

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] cpufreq: powernv: Fix the hardlockup by synchronus smp_call in timer interrupt

2018-04-23 Thread Stewart Smith

Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> writes:
> gpstate_timer_handler() uses synchronous smp_call to set the pstate
> on the requested core. This causes the below hard lockup:
>
> [c03fe566b320] [c01d5340] smp_call_function_single+0x110/0x180 
> (unreliable)
> [c03fe566b390] [c01d55e0] smp_call_function_any+0x180/0x250
> [c03fe566b3f0] [c0acd3e8] gpstate_timer_handler+0x1e8/0x580
> [c03fe566b4a0] [c01b46b0] call_timer_fn+0x50/0x1c0
> [c03fe566b520] [c01b4958] expire_timers+0x138/0x1f0
> [c03fe566b590] [c01b4bf8] run_timer_softirq+0x1e8/0x270
> [c03fe566b630] [c0d0d6c8] __do_softirq+0x158/0x3e4
> [c03fe566b710] [c0114be8] irq_exit+0xe8/0x120
> [c03fe566b730] [c0024d0c] timer_interrupt+0x9c/0xe0
> [c03fe566b760] [c0009014] decrementer_common+0x114/0x120
> --- interrupt: 901 at doorbell_global_ipi+0x34/0x50
> LR = arch_send_call_function_ipi_mask+0x120/0x130
> [c03fe566ba50] [c004876c] 
> arch_send_call_function_ipi_mask+0x4c/0x130 (unreliable)
> [c03fe566ba90] [c01d59f0] smp_call_function_many+0x340/0x450
> [c03fe566bb00] [c0075f18] pmdp_invalidate+0x98/0xe0
> [c03fe566bb30] [c03a1120] change_huge_pmd+0xe0/0x270
> [c03fe566bba0] [c0349278] change_protection_range+0xb88/0xe40
> [c03fe566bcf0] [c03496c0] mprotect_fixup+0x140/0x340
> [c03fe566bdb0] [c0349a74] SyS_mprotect+0x1b4/0x350
> [c03fe566be30] [c000b184] system_call+0x58/0x6c
>
> Fix this by using the asynchronus smp_call in the timer interrupt handler.
> We don't have to wait in this handler until the pstates are changed on
> the core. This change will not have any impact on the global pstate
> ramp-down algorithm.
>
> Reported-by: Nicholas Piggin <npig...@gmail.com>
> Reported-by: Pridhiviraj Paidipeddi <ppaid...@linux.vnet.ibm.com>
> Signed-off-by: Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com>
> ---
>  drivers/cpufreq/powernv-cpufreq.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/cpufreq/powernv-cpufreq.c 
> b/drivers/cpufreq/powernv-cpufreq.c
> index 0591874..7e0c752 100644
> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
> @@ -721,7 +721,7 @@ void gpstate_timer_handler(struct timer_list *t)
>   spin_unlock(>gpstate_lock);
>
>   /* Timer may get migrated to a different cpu on cpu hot unplug */
> - smp_call_function_any(policy->cpus, set_pstate, _data, 1);
> + smp_call_function_any(policy->cpus, set_pstate, _data, 0);
>  }

Should this have:
Fixes: eaa2c3aeef83f
and CC stable v4.7+ ?

-- 
Stewart Smith
OPAL Architect, IBM.

[PATCH] hvc_opal: don't set tb_ticks_per_usec in udbg_init_opal_common()

2018-03-29 Thread Stewart Smith

time_init() will set up tb_ticks_per_usec based on reality.
time_init() is called *after* udbg_init_opal_common() during boot.

from arch/powerpc/kernel/time.c:
  unsigned long tb_ticks_per_usec = 100; /* sane default */

Currently, all powernv systems have a timebase frequency of 512mhz
(51200/100 == 0x200) - although there's nothing written
down anywhere that I can find saying that we couldn't make that
different based on the requirements in the ISA.

So, we've been (accidentally) thwacking the (currently) correct
(for powernv at least) value for tb_ticks_per_usec earlier than
we otherwise would have.

The "sane default" seems to be adequate for our purposes between
udbg_init_opal_common() and time_init() being called, and if it isn't,
then we should probably be setting it somewhere that isn't hvc_opal.c!

Signed-off-by: Stewart Smith <stew...@linux.ibm.com>
---
 drivers/tty/hvc/hvc_opal.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/tty/hvc/hvc_opal.c b/drivers/tty/hvc/hvc_opal.c
index 2ed07ca6389e..9645c0062a90 100644
--- a/drivers/tty/hvc/hvc_opal.c
+++ b/drivers/tty/hvc/hvc_opal.c
@@ -318,7 +318,6 @@ static void udbg_init_opal_common(void)
udbg_putc = udbg_opal_putc;
udbg_getc = udbg_opal_getc;
udbg_getc_poll = udbg_opal_getc_poll;
-   tb_ticks_per_usec = 0x200; /* Make udelay not suck */
 }
 
 void __init hvc_opal_init_early(void)
-- 
2.14.3

Re: [PATCH] powerpc/powernv/nvram: opal_nvram_write handle unknown OPAL errors

2018-03-28 Thread Stewart Smith

Nicholas Piggin <npig...@gmail.com> writes:
> opal_nvram_write currently just assumes success if it encounters an
> error other than OPAL_BUSY or OPAL_BUSY_EVENT. Have it return -EIO
> on other errors instead.
>
> Signed-off-by: Nicholas Piggin <npig...@gmail.com>
> ---
>  arch/powerpc/platforms/powernv/opal-nvram.c | 2 ++
>  1 file changed, 2 insertions(+)

Acked-by: Stewart Smith <stew...@linux.ibm.com>

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc/powernv/mce: Don't silently restart the machine

2018-03-07 Thread Stewart Smith

Balbir Singh <bsinghar...@gmail.com> writes:
> On MCE the current code will restart the machine with
> ppc_md.restart(). This case was extremely unlikely since
> prior to that a skiboot call is made and that resulted in
> a checkstop for analysis.
>
> With newer skiboots, on P9 we don't checkstop the box by
> default, instead we return back to the kernel to extract
> useful information at the time of the MCE. While we still
> get this information, this patch converts the restart to
> a panic(), so that if configured a dump can be taken and
> we can track and probably debug the potential issue causing
> the MCE.

This will likely change again, but I can send a patch that changes the
comment (along with the logic of decoding it all and having enough
information to make sensible decisions). But... I kind of don't want to
bikeshed a comment to death :)

I reckon the panic() here is the right thing to do no matter
what.

Reviewed-by: Stewart Smith <stew...@linux.vnet.ibm.com>

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] cpuidle/powernv : Restore different PSSCR for idle and hotplug

2018-02-25 Thread Stewart Smith

Akshay Adiga <akshay.ad...@linux.vnet.ibm.com> writes:
> commit 1e1601b38e6e ("powerpc/powernv/idle: Restore SPRs for deep idle
> states via stop API.") uses stop-api provided by the firmware to restore
> PSSCR. PSSCR restore is required for handling special wakeup. When special
> wakeup is completed, the core enters stop state based on restored PSSCR.
>
> Currently PSSCR is restored to deepest available stop state, causing
> a idle cpu to enter deeper stop state on a special wakeup, which causes
> the cpu to hang on wakeup.
>
> A "sensors" command which reads temperature (through DTS sensors) on idle
> cpu can trigger special wakeup.
>
> Failed Scenario :
> Request restore of PSSCR with RL = 11
> cpu enters idle state (stop5)
>   user triggers "sensors" command
>Assert special wakeup on cpu
>  Restores PSSCR with RL = 11  < Done by firmware
>   Read DTS sensor
>Deassert special wakeup
>   cpu enters idle state (stop11) <-- Instead of stop5
>
> Cpu hang is caused because cpu ended up in a deeper state than it requested
>
> This patch fixes instability caused by special wakeup when stop11 is
> enabled. Requests restore of PSSCR to deepest stop state used by cpuidle.
> Only when offlining cpu, request restore of PSSCR to deepest stop state.
> On onlining cpu, request restore of PSSCR to deepest stop state used by
> cpuidle.
>
> Fixes : 1e1601b38e6e ("powerpc/powernv/idle: Restore SPRs for deep idle
> states via stop API.")

This should CC stable ?

We'll need this to enable stop11 in firmware and not break things, right?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc/powernv: Turn on SCSI_AACRAID in powernv_defconfig

2018-02-22 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:
> Brian King <brk...@linux.vnet.ibm.com> writes:
>> On 09/03/2017 06:19 PM, Stewart Smith wrote:
>>> Michael Ellerman <m...@ellerman.id.au> writes:
>>>>> 2. On a bare metal machine, if you set ipr.fast_reboot=1 on the skiboot
>>>>>kernel, then we should also avoid resetting the ipr adapter, so ipr
>>>>>init on the kernel being kexec booted from skiboot should be extremely 
>>>>> fast. 
>>>>
>>>> OK, I didn't know that was an option, so that might help.
>>>>
>>>>> ...
>>>>> If you've got cases where ipr init is taking a long time, I'd be
>>>>> interested to know what scenarios are the most annoying to see if there
>>>>> is any opportunity to improve.
>>>>
>>>> Yeah booting bare metal is where I see it (not using ipr.fast_reboot).
>>> 
>>> Hrm... We should probably enable that by default for petitboot then.
>>> 
>>> It'd at least cut some time off booting straight through to OS.
>>
>> Agreed. I'd be interested to hear if that helps address the issue
>> Michael is seeing.
>>
>> You can easily test this by exiting to a petitboot shell:
>>
>> echo 1 > /sys/module/ipr/parameters/fast_reboot
>>
>> Then go back to petitboot and boot the OS.
>
> Just following up on this (!).
>
> This does work, and I've now been running it in my CI for about a month
> (~1000 boots) with no problems.
>
> You can also make it persistent by doing:
>
>   $ nvram -p ibm,skiboot --update-config bootargs="ipr.fast_reboot=1"

Okay, cool. https://github.com/open-power/op-build/pull/1900 will set it
in firmware - we may as well run with this and fix any bugs we find.

Any reason why it isn't the default behaviour?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v9 1/2] powerpc/powernv: Enable tunneled operations

2018-02-20 Thread Stewart Smith

Philippe Bergheaud <fe...@linux.vnet.ibm.com> writes:
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index 94bd1bf2c873..07b5e2240ecc 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -204,7 +204,9 @@
>  #define OPAL_NPU_SPA_SETUP   159
>  #define OPAL_NPU_SPA_CLEAR_CACHE 160
>  #define OPAL_NPU_TL_SET  161
> -#define OPAL_LAST161
> +#define OPAL_PCI_GET_PBCQ_TUNNEL_BAR 162
> +#define OPAL_PCI_SET_PBCQ_TUNNEL_BAR 163
> +#define OPAL_LAST163


These numbers will change as other calls have gone in with those OPAL
API ids.

Feel free to send a tiny patch to skiboot reserving some though.

I should probably encourage people to do that when they start working on
new API calls.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [RFC] powerpc/powernv/mce: Don't silently restart the machine

2018-02-20 Thread Stewart Smith

Balbir Singh <bsinghar...@gmail.com> writes:
> On MCE the current code will restart the machine with
> ppc_md.restart(). This case was extremely unlikely since
> prior to that a skiboot call is made and that resulted in
> a checkstop for analysis.
>
> With newer skiboots, on P9 we don't checkstop the box by
> default, instead we return back to the kernel to extract
> useful information at the time of the MCE. While we still
> get this information, this patch converts the restart to
> a panic(), so that if configured a dump can be taken and
> we can track and probably debug the potential issue causing
> the MCE.

I agree with the patch, although I'd be nervous stating that skiboot is
going to keep this behaviour. In *theory* we should only ever get a
platform error when there's actually something that isn't the kernel's
fault.

Like any firmware promise though, it's slightly less reliable than one
from a politician.

I'd say that in this case deferring to policy on what to do in event of
panic() is the right thing.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v2 (skiboot)] dt: add /cpus/ibm, powerpc-cpu-features device tree bindings

2018-02-20 Thread Stewart Smith

Nicholas Piggin <npig...@gmail.com> writes:
> This is a new CPU feature advertising interface that is fine-grained,
> extensible, aware of privilege levels, and gives control of features
> to all levels of the stack (firmware, hypervisor, and OS).
>
> The design and binding specification is described in detail in doc/.
>
> Signed-off-by: Nicholas Piggin <npig...@gmail.com>
> ---
> Since v1:
>
> - Fold branch-v3 into fixed-point-v3, as pointed out by Segher they
>   are all fixed point facilities.
>
> - Fixed typo (Segher)
>
>  core/Makefile.inc  |   2 +-
>  core/cpufeatures.c | 921 
> +
>  core/device.c  |   7 +
>  core/init.c|   1 +
>  .../ibm,powerpc-cpu-features/binding.txt   | 245 ++
>  .../ibm,powerpc-cpu-features/design.txt| 157 
>  include/device.h   |   1 +
>  include/skiboot.h  |   5 +
>  8 files changed, 1338 insertions(+), 1 deletion(-)
>  create mode 100644 core/cpufeatures.c
>  create mode 100644 doc/device-tree/ibm,powerpc-cpu-features/binding.txt
>  create mode 100644
>  doc/device-tree/ibm,powerpc-cpu-features/design.txt

Merged to master as of 7f4c8e8ce0b78ca046643d7f4f63d81f4fd11746 and has
made it into -rc4

-- 
Stewart Smith
OPAL Architect, IBM.

Re: rtc-opal: Fix handling of firmware error codes, prevent busy loops

2018-02-05 Thread Stewart Smith

Alexandre Belloni <alexandre.bell...@bootlin.com> writes:
> On 02/08/2016 at 11:50:16 +1000, Stewart Smith wrote:
>> According to the OPAL docs:
>> https://github.com/open-power/skiboot/blob/skiboot-5.2.5/doc/opal-api/opal-rtc-read-3.txt
>> https://github.com/open-power/skiboot/blob/skiboot-5.2.5/doc/opal-api/opal-rtc-write-4.txt
>> OPAL_HARDWARE may be returned from OPAL_RTC_READ or OPAL_RTC_WRITE and this
>> indicates either a transient or permanent error.
>> 
>> Prior to this patch, Linux was not dealing with OPAL_HARDWARE being a
>> permanent error particularly well, in that you could end up in a busy
>> loop.
>> 
>> This was not too hard to trigger on an AMI BMC based OpenPOWER machine
>> doing a continuous "ipmitool mc reset cold" to the BMC, the result of
>> that being that we'd get stuck in an infinite loop in opal_get_rtc_time.
>> 
>> We now retry a few times before returning the error higher up the stack.
>> 
>> Cc: sta...@vger.kernel.org
>> Signed-off-by: Stewart Smith <stew...@linux.vnet.ibm.com>
>> ---
>>  drivers/rtc/rtc-opal.c | 12 ++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>> 
>
> Just a note to let you know that this patch should have gone through my
> tree but it was not sent to linux-rtc or me.
>
> I guess what happened is that Michael cleaned up the Linux PPC patchwork
> queue.

Apologies for not sending there. My (18 month ago self) bad.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH 3/5] powerpc: Reduce log level of "OPAL detected !" message

2018-01-15 Thread Stewart Smith

Benjamin Herrenschmidt <b...@kernel.crashing.org> writes:
> This message isn't terribly useful.
>
> Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>

Acked-by: Stewart Smith <stew...@linux.vnet.ibm.com>

> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -127,7 +127,7 @@ int __init early_init_dt_scan_opal(unsigned long node,
>
>   if (of_flat_dt_is_compatible(node, "ibm,opal-v3")) {
>   powerpc_firmware_features |= FW_FEATURE_OPAL;
> - pr_info("OPAL detected !\n");
> + pr_debug("OPAL detected !\n");

I've been meaning to add in a thing to grab the version from the device
tree and print that too.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc/powernv : Add support to enable sensor groups

2017-12-03 Thread Stewart Smith

Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> writes:
> On 11/28/2017 05:07 PM, Michael Ellerman wrote:
>> Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> writes:
>> 
>>> Adds support to enable/disable a sensor group. This can be used to
>>> select the sensor groups that needs to be copied to main memory by
>>> OCC. Sensor groups like power, temperature, current, voltage,
>>> frequency, utilization can be enabled/disabled at runtime.
>>>
>>> Signed-off-by: Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com>
>>> ---
>>> The skiboot patch for the opal call is posted below:
>>> https://lists.ozlabs.org/pipermail/skiboot/2017-November/009713.html
>> 
>> Can you remind me why we're doing this with a completely bespoke sysfs
>> API, rather than using some generic sensors API?
>> 
>
> Disabling/Enabling sensor groups is not supported in the current generic 
> sensors
> API. And also we dont export all type of sensors in HWMON as not all of them 
> are
> environment sensors (like performance).

Are there barriers to adding such concepts to the generic sensors API?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc/powernv: Add kernel cmdline parameter to disable imc

2017-10-12 Thread Stewart Smith

Anju T Sudhakar <a...@linux.vnet.ibm.com> writes:
> On Wednesday 11 October 2017 01:55 AM, Stewart Smith wrote:
>> Michael Ellerman <m...@ellerman.id.au> writes:
>>> Anju T Sudhakar <a...@linux.vnet.ibm.com> writes:
>>>
>>>> Add a kernel command line parameter option to disable In-Memory Collection
>>>> (IMC) counters and add documentation. This helps in debug.
>>> I'd really rather we didn't. Do we *really* need this?
>>>
>>> We don't have command line parameters to disable any of the other ~20
>>> PMUs, why is this one special?
>
> This one is really helpful in debugging, incase if we want to proceed 
> without nest counters  OR
> core counters . But if we have the facility to do the same from 
> petitboot, its fine.

https://patchwork.ozlabs.org/patch/824497/ (skiboot patch) should do the
same thing, right? That means we come up with them off (as we should
have from the start) and if you don't use the PMUs then the ucode won't
be started.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc/powernv: Add kernel cmdline parameter to disable imc

2017-10-10 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:
> Anju T Sudhakar <a...@linux.vnet.ibm.com> writes:
>
>> Add a kernel command line parameter option to disable In-Memory Collection
>> (IMC) counters and add documentation. This helps in debug.
>
> I'd really rather we didn't. Do we *really* need this?
>
> We don't have command line parameters to disable any of the other ~20
> PMUs, why is this one special?

You could also do the same thing by editing the device tree before
booting your kernel, we do have the facility to do that in petitboot.

A recent firmware patch: https://patchwork.ozlabs.org/patch/823249/
would fix the firmware implementation where the counters were already
running before the INIT/START calls, which are likely the cause of the
problems that this patch is trying to work around.

I propose we have the firmware do the right thing and nothing special in
kernel. i.e. not to merge this.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powernv: Add OCC driver to mmap sensor area

2017-10-01 Thread Stewart Smith

Shilpasri G Bhat <shilpa.b...@linux.vnet.ibm.com> writes:
> This driver provides interface to mmap the OCC sensor area
> to userspace to parse and read OCC inband sensors.

Why?

Is this for debug? If so, the existing exports interface should be used.

If there's actual sensors, we already have two ways of exposing sensors
to Linux: the OPAL_SENSOR API and the IMC API.

Why this method and not use the existing ones?

-- 
Stewart Smith
OPAL Architect, IBM.

[PATCH] drivers: of: increase MAX_RESERVED_REGIONS to 32

2017-09-26 Thread Stewart Smith

There are two types of memory reservations firmware can ask the kernel
to make in the device tree: static and dynamic.
See Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt

If you have greater than 16 entries in /reserved-memory (as we do on
POWER9 systems) you would get this scary looking error message:
 [0.00] OF: reserved mem: not enough space all defined regions.

This is harmless if all your reservations are static (which with OPAL on
POWER9, they are).

It is not harmless if you have any dynamic reservations after the 16th.

In the first pass over the fdt to find reservations, the child nodes of
/reserved-memory are added to a static array in of_reserved_mem.c so that
memory can be reserved in a 2nd pass. The array has 16 entries. This is why,
on my dual socket POWER9 system, I get that error 4 times with 20 static
reservations.

We don't have a problem on ppc though, as in arch/powerpc/kernel/prom.c
we look at the new style /reserved-ranges property to do reservations,
and this logic was introduced in 0962e8004e974 (well before any powernv
system shipped).

A Google search shows up no occurances of that exact error message, so we're
probably safe in that no machine that people use has memory not being reserved
when it should be.

The simple fix is to bump the length of the array to 32 which "should be
enough for everyone(TM)". The simple fix of not recording static allocations
in the array would cause problems for devices with "memory-region" properties.
A more future-proof fix is likely possible, although more invasive and this
simple fix is perfectly suitable in the meantime while a more future-proof
fix is developed.

Signed-off-by: Stewart Smith <stew...@linux.vnet.ibm.com>
---
 drivers/of/of_reserved_mem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/of/of_reserved_mem.c b/drivers/of/of_reserved_mem.c
index d507c3569a88..32771c2ced7b 100644
--- a/drivers/of/of_reserved_mem.c
+++ b/drivers/of/of_reserved_mem.c
@@ -25,7 +25,7 @@
 #include 
 #include 
 
-#define MAX_RESERVED_REGIONS   16
+#define MAX_RESERVED_REGIONS   32
 static struct reserved_mem reserved_mem[MAX_RESERVED_REGIONS];
 static int reserved_mem_count;
 
-- 
2.13.5

Re: [PATCH] drivers: of: static DT reservations incorrectly added to dynamic list

2017-09-26 Thread Stewart Smith

Rob Herring <r...@kernel.org> writes:
> On Thu, Sep 14, 2017 at 5:24 AM, Stewart Smith
> <stew...@linux.vnet.ibm.com> wrote:
>> There are two types of memory reservations firmware can ask the kernel
>> to make in the device tree: static and dynamic.
>> See Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt
>>
>> If you have greater than 16 entries in /reserved-memory (as we do on
>> POWER9 systems) you would get this scary looking error message:
>> [0.00] OF: reserved mem: not enough space all defined regions.
>>
>> This is harmless if all your reservations are static (which with OPAL on
>> POWER9, they are).
>>
>> It is not harmless if you have any dynamic reservations after the 16th.
>>
>> In the first pass over the fdt to find reservations, the child nodes of
>> /reserved-memory are added to a static array in of_reserved_mem.c so that
>> memory can be reserved in a 2nd pass. The array has 16 entries. This is why,
>> on my dual socket POWER9 system, I get that error 4 times with 20 static
>> reservations.
>>
>> We don't have a problem on ppc though, as in arch/powerpc/kernel/prom.c
>> we look at the new style /reserved-ranges property to do reservations,
>> and this logic was introduced in 0962e8004e974 (well before any powernv
>> system shipped).
>>
>> Google shows up no occurances of that exact error message, so we're probably
>> safe in that no machine that people use has memory not being reserved when
>> it should be.
>>
>> The fix is simple, as there's a different code path for static and dynamic
>> allocations, we just don't add the region to the list if it's static. Since
>> it's a static *OR* dynamic region, this is a perfectly valid thing to do
>> (although I have not checked every real world device tree on the planet
>> for this condition)
>
> If the region is dynamic (i.e. no reg prop), then we bail from
> __reserved_mem_reserve_reg. So wouldn't this change make the list be
> empty always?

We get the dynamic node in __fdt_scan_reserved_mem() rather
than__reserved_mem_reserve_reg():


static int __init __reserved_mem_reserve_reg(unsigned long node,
 const char *uname)
{
...
prop = of_get_flat_dt_prop(node, "reg", );
if (!prop)
return -ENOENT;
...
}

static int __init __fdt_scan_reserved_mem(unsigned long node, const char *uname,
  int depth, void *data)
{


err = __reserved_mem_reserve_reg(node, uname);
if (err == -ENOENT && of_get_flat_dt_prop(node, "size", NULL))
fdt_reserved_mem_save_node(node, uname, 0, 0);

/* scan next node */
return 0;
}

So that should capture the dynamic reservations (as they're the ones
with the size property) to be handled by fdt_init_reserved_mem() later
in boot.

> Won't we have problems with lookups for devices with a "memory-region"
> property if static allocations are not in the list?

Ahh yep, I see the issue.

The array is being used for two things: reserving the memory and looking it
up during device init (seems like only used on ARM, which is why it
Worked For Me(TM) on POWER :)

It looks a bit more involved making that work, although not impossible.

> I'm inclined to just make the safe and easy change of increasing the
> array to 32 entries. That should be enough for everyone (TM).

that would certainly solve my immediate problem :)

(of course, given a CPU generation or two I'm sure we'd hit it again)

I'll send a patch that does that, and can asynchronously work on a patch
that addresses the device lookup of memory region problem (although
that'll be fairly down on my list of things to look at).

-- 
Stewart Smith
OPAL Architect, IBM.

[PATCH] drivers: of: static DT reservations incorrectly added to dynamic list

2017-09-14 Thread Stewart Smith

There are two types of memory reservations firmware can ask the kernel
to make in the device tree: static and dynamic.
See Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt

If you have greater than 16 entries in /reserved-memory (as we do on
POWER9 systems) you would get this scary looking error message:
[0.00] OF: reserved mem: not enough space all defined regions.

This is harmless if all your reservations are static (which with OPAL on
POWER9, they are).

It is not harmless if you have any dynamic reservations after the 16th.

In the first pass over the fdt to find reservations, the child nodes of
/reserved-memory are added to a static array in of_reserved_mem.c so that
memory can be reserved in a 2nd pass. The array has 16 entries. This is why,
on my dual socket POWER9 system, I get that error 4 times with 20 static
reservations.

We don't have a problem on ppc though, as in arch/powerpc/kernel/prom.c
we look at the new style /reserved-ranges property to do reservations,
and this logic was introduced in 0962e8004e974 (well before any powernv
system shipped).

Google shows up no occurances of that exact error message, so we're probably
safe in that no machine that people use has memory not being reserved when
it should be.

The fix is simple, as there's a different code path for static and dynamic
allocations, we just don't add the region to the list if it's static. Since
it's a static *OR* dynamic region, this is a perfectly valid thing to do
(although I have not checked every real world device tree on the planet
for this condition)

Fixes: 3f0c8206644836e4f10a6b9fc47cda6a9a372f9b
Signed-off-by: Stewart Smith <stew...@linux.vnet.ibm.com>
---
NOTE: I've done only fairly limited testing of this on POWER, I
certainly haven't tested on ARM or *anything* with dynamic
allocations. So, testing and comments welcome.
---
 drivers/of/fdt.c | 6 +-
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/of/fdt.c b/drivers/of/fdt.c
index ce30c9a588a4..a9a44099ed69 100644
--- a/drivers/of/fdt.c
+++ b/drivers/of/fdt.c
@@ -587,7 +587,7 @@ static int __init __reserved_mem_reserve_reg(unsigned long 
node,
phys_addr_t base, size;
int len;
const __be32 *prop;
-   int nomap, first = 1;
+   int nomap;
 
prop = of_get_flat_dt_prop(node, "reg", );
if (!prop)
@@ -614,10 +614,6 @@ static int __init __reserved_mem_reserve_reg(unsigned long 
node,
uname, , (unsigned long)size / SZ_1M);
 
len -= t_len;
-   if (first) {
-   fdt_reserved_mem_save_node(node, uname, base, size);
-   first = 0;
-   }
}
return 0;
 }
-- 
2.13.5

Re: [PATCH] powerpc/powernv: Turn on SCSI_AACRAID in powernv_defconfig

2017-09-03 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:
>> 2. On a bare metal machine, if you set ipr.fast_reboot=1 on the skiboot
>>kernel, then we should also avoid resetting the ipr adapter, so ipr
>>init on the kernel being kexec booted from skiboot should be extremely 
>> fast. 
>
> OK, I didn't know that was an option, so that might help.
>
>> ...
>> If you've got cases where ipr init is taking a long time, I'd be
>> interested to know what scenarios are the most annoying to see if there
>> is any opportunity to improve.
>
> Yeah booting bare metal is where I see it (not using ipr.fast_reboot).

Hrm... We should probably enable that by default for petitboot then.

It'd at least cut some time off booting straight through to OS.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc/powernv: Turn on SCSI_AACRAID in powernv_defconfig

2017-08-31 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:
> OK. So maybe when the petitboot kernel moves up to 4.14 they may want to
> flip it back to being a module.

Yeah. We tend to keep all RAID adapters as modules because they almost
all uniquely try to undo all boot time optimization done by anybody,
anywhere, ever.

> I think the majority of machines that have one of these adapters will be
> using it for their root disk, so I think the slow initialisation is just
> something we have to suffer.

It does seem that way currently.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH V8 1/3] powernv: powercap: Add support for powercap framework

2017-07-28 Thread Stewart Smith

Cyril Bur <cyril...@gmail.com> writes:
> This is more for MPE and Stewart:
> If there is some mutual exclusion that needs to happen, it should be
> done in skiboot. We've had problems in the past with double entering
> skiboot and hard to interpret errors ending up in userspace, this
> solves that but it seems more like a coverup than an actual solution.

Yeah, I'd like us to be in a position where we don't force mutual
exclusion on such things out to the OS.

The real issue this papers over is the async token stuff you're
reworking. There's that, plus for this specific thing, you *could* go
and set a differnt powercap 180 times concurrently and we should do the
right thing in skiboot... but then you're sitting in skiboot rather than
sitting in linux being able to go run some other task on the thread.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH 2/2] mtd: powernv_flash: Lock around concurrent access to OPAL

2017-07-17 Thread Stewart Smith

Cyril Bur <cyril...@gmail.com> writes:
> OPAL can only manage one flash access at a time and will return an
> OPAL_BUSY error for each concurrent access to the flash. The simplest
> way to prevent this from happening is with a mutex.
>
> Signed-off-by: Cyril Bur <cyril...@gmail.com>
> ---
> This is to address https://github.com/open-power/skiboot/issues/80

part of me wants to say "let's only take the lock if we ever get back
OPAL_BUSY" or something like that (have a DT property to say the flash
supports concurrent ops?)

Although the other part of me says "you're overthinking this, get back
to the work you're meant to be doing" :)

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [Skiboot] [RFC PATCH] powerpc/powernv: report error messages from opal

2017-07-06 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:
> Stewart Smith <stew...@linux.vnet.ibm.com> writes:
>> Oliver O'Halloran <ooh...@gmail.com> writes:
>>> diff --git a/arch/powerpc/include/asm/opal-api.h 
>>> b/arch/powerpc/include/asm/opal-api.h
>>> index 0e2e57bcab50..cb9c0e6afb33 100644
>>> --- a/arch/powerpc/include/asm/opal-api.h
>>> +++ b/arch/powerpc/include/asm/opal-api.h
>>> @@ -167,7 +167,8 @@
>>>  #define OPAL_INT_EOI   124
>>>  #define OPAL_INT_SET_MFRR  125
>>>  #define OPAL_PCI_TCE_KILL  126
>>> -#define OPAL_LAST  126
>>> +#define OPAL_SCRAPE_LOG128
>>
>> (another thought, along with the skiboot thoughts), I don't like the
>> SCRAPE_LOG name so much, as it's more of a "hey linux, here's some log
>> messages from firmware, possibly before you were
>> involved"... OPAL_FETCH_LOG ?
>
> I'm not a huge fan of an interrupt followed by an opal call just to
> fetch a single line of log.
>
> Can't we do something more like the existing msglog code, where we have
> a ring buffer and then the interrupt just becomes "hey Linux you should
> look at your ring buffer".

Yeah... that would probably be a bit better...

Although that would mean we can only ever tell the running kernel what's
in the ring buffer. We couldn't work out a way to (after kexec) resend
important bits of info such as "we garded things out, your OCC didn't
start and we trained everything at really slow speeds"


-- 
Stewart Smith
OPAL Architect, IBM.

Re: [RFC PATCH] powerpc/powernv: report error messages from opal

2017-07-05 Thread Stewart Smith

Oliver O'Halloran <ooh...@gmail.com> writes:
> Recent versions of skiboot will raise an OPAL event (read: interrupt)
> when firmware writes an error message to its internal console. In
> conjunction they provide an OPAL call that the kernel can use to extract
> these messages from the OPAL log to allow them to be written into the
> kernel's log buffer where someone will (hopefully) look at them.
>
> For the companion skiboot patches see:
>
>   https://lists.ozlabs.org/pipermail/skiboot/2016-December/005861.html
>
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
> ---
>  arch/powerpc/include/asm/opal-api.h|  5 +++-
>  arch/powerpc/include/asm/opal.h|  1 +
>  arch/powerpc/platforms/powernv/opal-msglog.c   | 41 
> ++
>  arch/powerpc/platforms/powernv/opal-wrappers.S |  1 +
>  4 files changed, 47 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/include/asm/opal-api.h 
> b/arch/powerpc/include/asm/opal-api.h
> index 0e2e57bcab50..cb9c0e6afb33 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -167,7 +167,8 @@
>  #define OPAL_INT_EOI 124
>  #define OPAL_INT_SET_MFRR125
>  #define OPAL_PCI_TCE_KILL126
> -#define OPAL_LAST126
> +#define OPAL_SCRAPE_LOG  128

(another thought, along with the skiboot thoughts), I don't like the
SCRAPE_LOG name so much, as it's more of a "hey linux, here's some log
messages from firmware, possibly before you were
involved"... OPAL_FETCH_LOG ?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powerpc: Invalidate ERAT on powersave wakeup for POWER9

2017-06-22 Thread Stewart Smith

Michael Neuling <mi...@neuling.org> writes:
> On POWER9 the ERAT may be incorrect on wakeup from some stop states
> that lose state. This causes random segvs and illegal instructions
> when these stop states are enabled.
>
> This patch invalidates the ERAT on wakeup on POWER9 to prevent this
> from causing a problem.
>
> Signed-off-by: Michael Neuling <mi...@neuling.org>

CC: stable?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v8 07/10] powerpc/perf: PMU functions for Core IMC and hotplugging

2017-05-17 Thread Stewart Smith

Anju T Sudhakar <a...@linux.vnet.ibm.com> writes:
> --- a/arch/powerpc/include/asm/imc-pmu.h
> +++ b/arch/powerpc/include/asm/imc-pmu.h
> @@ -24,6 +24,7 @@
>   */
>  #define IMC_MAX_CHIPS32
>  #define IMC_MAX_PMUS 32
> +#define IMC_MAX_CORES32
>
>  /*
>   * This macro is used for memory buffer allocation of
> @@ -38,6 +39,11 @@
>  #define IMC_NEST_MAX_PAGES   64
>
>  /*
> + * IMC Core engine expects 8K bytes of memory for counter collection.
> + */
> +#define IMC_CORE_COUNTER_MEM 8192

Any reason for this not to be in the device tree?

This is the size of memory that Linux needs to give OPAL, so the size of
that should probably come from OPAL rather than Linux.

Otherwise, if in the future, we had a counter with an offset greater
than 8192 in, we'd have no way to detect that and we'd fail silently by
overwriting memory.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v8 05/10] powerpc/perf: IMC pmu cpumask and cpuhotplug support

2017-05-11 Thread Stewart Smith

Madhavan Srinivasan <ma...@linux.vnet.ibm.com> writes:
>>  * in patch 9 should opal_imc_counters_init return something other
>>than OPAL_SUCCESS in the case on invalid arguments? Maybe
>>OPAL_PARAMETER? (I think you fix this in a later patch anyway?)
>
> So, init call will return OPAL_PARAMETER for the unsupported
> domains (core and nest are supported). And if the init operation
> fails for any reason, it would return OPAL_HARDWARE. And this is
> documented.

(I'll comment on the skiboot one too), but I think that if the class
exists but init is a no-op, then OPAL_IMC_COUNTERS_INIT should return
OPAL_SUCCESS and just do nothing. This future proofs everything, and the
API is that one *must* call _INIT before start.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v8 02/10] powerpc/powernv: Autoload IMC device driver module

2017-05-11 Thread Stewart Smith

   return -ENODEV;
> +
> + imc_dev = pdev->dev.of_node;
> +
> + /*
> +  * Nest counter data are saved in a reserved memory called HOMER.
> +  * "imc-nest-offset" identifies the counter data location within HOMER.
> +  * size : size of the entire nest-counters region
> +  */
> + if (of_property_read_u32(imc_dev, "imc-nest-offset", _offset))
> + goto err;
> +
> + if (of_property_read_u32(imc_dev, "imc-nest-size", _size))
> + goto err;
> +
> + /* Sanity check */
> + if ((nest_size/PAGE_SIZE) > IMC_NEST_MAX_PAGES)
> + goto err;
> +
> + /* Find the "HOMER region" for each chip */
> + rm_node = of_find_node_by_path("/reserved-memory");
> + if (!rm_node)
> + goto err;
> +
> + /*
> +  * We need to look for the "ibm,homer-image" node in the
> +  * "/reserved-memory" node.
> +  */
> + for (dn = of_find_node_by_name(rm_node, "ibm,homer-image"); dn;
> + dn = of_find_node_by_name(dn, "ibm,homer-image")) {
> +
> + /* Get the chip id to which the above homer region belongs to */
> + if (of_property_read_u32(dn, "ibm,chip-id", _id))
> + goto err;

So, I was thinking on this (and should probably comment on the firmware
side as well).

I'd prefer an OPAL interface where instead of looking up where
ibm,homer-image is, we provide the kernel with a base address and then
have offsets into it.

That way, we don't tie the kernel code to counters that are only in the
HOMER region.


-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions

2017-04-06 Thread Stewart Smith

Madhavan Srinivasan <ma...@linux.vnet.ibm.com> writes:
> +#define IMC_MAX_CHIPS32
> +#define IMC_MAX_PMUS 32

The max chips and PMUs we'd be able to work out from the device tre
though, right? We could just allocate the correct amount of memory on
boot.

We may hot plug/unplug CPUs, but we're not doing that from a hardware
level, what CPUs you get in the DT on PowerNV on boot is all you're
getting.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v6 03/11] powerpc/powernv: Detect supported IMC units and its events

2017-04-06 Thread Stewart Smith

Madhavan Srinivasan <ma...@linux.vnet.ibm.com> writes:
> --- a/arch/powerpc/platforms/powernv/opal-imc.c
> +++ b/arch/powerpc/platforms/powernv/opal-imc.c
> @@ -33,6 +33,388 @@

> +static void imc_pmu_setup(struct device_node *parent)
> +{
> + struct device_node *child;
> + int pmu_count = 0, rc = 0;
> + const struct property *pp;
> +
> + if (!parent)
> + return;
> +
> + /* Setup all the IMC pmus */
> + for_each_child_of_node(parent, child) {
> + pp = of_get_property(child, "compatible", NULL);
> + if (pp) {
> + /*
> +  * If there is a node with a "compatible" field,
> +  * that's a PMU node
> +  */
> + rc = imc_pmu_create(child, pmu_count);
> + if (rc)
> + return;
> + pmu_count++;
> + }
> + }
> +}

This doesn't strike me as the right kind of structure, the presence of a
compatible property really just says "hey, there's this device and it's
compatible with these ways of accessing it".

I'm guessing the idea behind having imc-nest-offset/size in a top level
node is because it's common to everything under it and the aim is to not
blow up the device tree to be enormous.

So why not go after each ibm,imc-counters-nest compatible node under the
top level ibm,opal-in-memory-counters node? (i'm not convinced that
having ibm,ibmc-counters-nest versus ibm,imc-counters-core and
ibm,imc-counters-thread as I see in the dts is correct though, as
they're all accessed exactly the same way?)

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v6 01/11] powerpc/powernv: Data structure and macros definitions

2017-04-06 Thread Stewart Smith

Madhavan Srinivasan <ma...@linux.vnet.ibm.com> writes:

> From: Hemant Kumar <hem...@linux.vnet.ibm.com>
>
> Create new header file "imc-pmu.h" to add the data structures
> and macros needed for IMC pmu support.
>
> Signed-off-by: Anju T Sudhakar <a...@linux.vnet.ibm.com>
> Signed-off-by: Hemant Kumar <hem...@linux.vnet.ibm.com>
> Signed-off-by: Madhavan Srinivasan <ma...@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/imc-pmu.h | 68 
> ++
>  1 file changed, 68 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/imc-pmu.h
>
> diff --git a/arch/powerpc/include/asm/imc-pmu.h 
> b/arch/powerpc/include/asm/imc-pmu.h
> new file mode 100644
> index ..a3d4f1bf9492
> --- /dev/null
> +++ b/arch/powerpc/include/asm/imc-pmu.h
> @@ -0,0 +1,68 @@
> +#ifndef PPC_POWERNV_IMC_PMU_DEF_H
> +#define PPC_POWERNV_IMC_PMU_DEF_H
> +
> +/*
> + * IMC Nest Performance Monitor counter support.
> + *
> + * Copyright (C) 2016 Madhavan Srinivasan, IBM Corporation.
> + *   (C) 2016 Hemant K Shaw, IBM Corporation.
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License version 2 as published
> + * by the Free Software Foundation.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define IMC_MAX_CHIPS32
> +#define IMC_MAX_PMUS 32
> +#define IMC_MAX_PMU_NAME_LEN 256

Why do we need a max length? We get the actual lengths from the device
tree, so we know at each point in time what the length of any new string
should be, right?

Otherwise you appear to be, in the general case, using 10x the memory
than you could.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v6 02/11] powerpc/powernv: Autoload IMC device driver module

2017-04-06 Thread Stewart Smith

   _name))
> + continue;
> + if (strncmp("ibm,homer-image", node_name,
> + strlen("ibm,homer-image")))
> + continue;

A better way to do this would be to reference the memory region, like
what's shown in
Documentation/devicetree/bindings/reserved-memory/reserved-memory.txt

just reference the phandle of the memory region.

seeing as these are per chip, why not just have something linking
together chip-id and the IMC layout node?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH 4/5] crypto/nx: Add P9 NX support for 842 compression engine.

2017-04-02 Thread Stewart Smith

Haren Myneni <ha...@linux.vnet.ibm.com> writes:
> @@ -656,13 +953,21 @@ static __init int nx842_powernv_init(void)
>   BUILD_BUG_ON(DDE_BUFFER_ALIGN % DDE_BUFFER_SIZE_MULT);
>   BUILD_BUG_ON(DDE_BUFFER_SIZE_MULT % DDE_BUFFER_LAST_MULT);
>
> - for_each_compatible_node(dn, NULL, "ibm,power-nx")
> - nx842_powernv_probe(dn);
> + if (is_vas_available()) {
> + for_each_compatible_node(dn, NULL, "ibm,xscom")
> + nx842_powernv_probe_vas(dn);

I'm not keen on how the device bindings work, instead, I think firmware
should provide a 'ibm,vas' compatible node, rather than simply searching
through all the ibm,xscom nodes.

XSCOMs aren't something that Linux should really know about, it's a
debug interface, and one we use through PRD to do PRD-things, XSCOMs
aren't part of the architecture.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [Skiboot] [PATCH][OPAL] cpufeatures: add base and POWER8, POWER9 /cpus/features dt

2017-03-23 Thread Stewart Smith

Nicholas Piggin <npig...@gmail.com> writes:
> On Tue, 21 Mar 2017 15:42:22 +1100
> Stewart Smith <stew...@linux.vnet.ibm.com> wrote:
>
>> Nicholas Piggin <npig...@gmail.com> writes:
>> > With this patch and the Linux one, I can boot (in mambo) a POWER8 or
>> > POWER9 without looking up any cpu tables, and mainly looking at PVR
>> > for MCE and PMU.  
>> 
>> That's pretty neat. I now await the day where somebody produces a chip
>> with uneven feature sets between cores.
>> 
>> > Machine and ISA speicfic features that are not abstracted by firmware
>> > and not captured here will have to be handled on a case by case basis,
>> > using PVR if necessary. Major ones that remain are PMU and machine
>> > check.  
>> 
>> At least for machine check we could probably get something "good enough"
>> without too much effort.
>
> Machine check I'd like to put it in opal with a special case entry from
> the hypervisor. At the moment it just has a lot of implementation specific
> decoding of bits, and we can't really call opal to do anything useful
> with normal call because it re-enters the stack.

Ahh yep. Something along the lines of a machine check specific stack in
OPAL could do it, and we queue up calls etc etc.

It turns out my "not too much effort" metric is possibly different from
other people's :)

>> For PMU, I wonder if we could get something that's suitably common with
>> the IMC work being done so that we could "just work" on new PMUs (albeit
>> calling into OPAL for enable/disable)
>
> I don't know the PMU code, but if we could have some kind of firmware
> fallback like that it would be perfect.
>
> For machine specific implementations that are faster or more capable,
> I guess looking at PVR is not taboo as such. Just that it shouldn't
> be needed to boot and get some reasonable baseline.

Yeah, I'm not that familiar with it either. I need to go look over it
all in my copious amount of free time.

>> > Open question is where and how to develop and document these features?
>> > Not the dt representation created by skiboot, but the exact nature of
>> > each feature. What exact behaviour does a particular feature imply,
>> > etc.  
>> 
>> Part of me seems to think this could be something for the Architecture,
>> and then we just have a 1-to-1 mapping of the arch bits and we're all
>> one big happy family
>> 
>> Benh/Paulus can probably laugh at me suitably hard for suggesting such a
>> thing being possible though :)
>
> I think to a degree we might be moving in that direction. Probably
> can't say much more in public at the moment, but at least this dt
> implementation must be flexible enough to cope with a range of
> approaches we might decide to take (fine/coarse grained, fscr bits
> or not, etc).

Yeah, I think so too.

>> > --- a/hdata/cpu-common.c
>> > +++ b/hdata/cpu-common.c  
>> 
>> > +  { "vector-crypto",
>> > +  CPU_ALL,
>> > +  ISA_BASE, USABLE_HV|USABLE_OS|USABLE_PR,
>> > +  HV_SUPPORT_NONE, OS_SUPPORT_NONE,
>> > +  -1, -1, 38,
>> > +  "vector", },  
>> 
>> Did we want to break this down at all? Specifically maybe just the AES
>> instructions?
>> 
>> AFAIK it's been the only set where there was some amount of discussion
>> about somebody possibly not wanting to include.
>
> I don't have much opinion on it. Userspace has only the one feature bit
> there, so missing part of the instructions would have to disable the
> entire thing.
>
> That's not to say we couldn't add another bit and start getting userspace
> to use it. So if there is some need or strong potential future need,
> we should consider it.

Hopefully we're right on ISA3.00 for new instructions that random folk
may say could be disabled if they fab'd their own chip :)

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [Skiboot] [PATCH][OPAL] cpufeatures: add base and POWER8, POWER9 /cpus/features dt

2017-03-20 Thread Stewart Smith

Nicholas Piggin <npig...@gmail.com> writes:
> With this patch and the Linux one, I can boot (in mambo) a POWER8 or
> POWER9 without looking up any cpu tables, and mainly looking at PVR
> for MCE and PMU.

That's pretty neat. I now await the day where somebody produces a chip
with uneven feature sets between cores.

> Machine and ISA speicfic features that are not abstracted by firmware
> and not captured here will have to be handled on a case by case basis,
> using PVR if necessary. Major ones that remain are PMU and machine
> check.

At least for machine check we could probably get something "good enough"
without too much effort.

For PMU, I wonder if we could get something that's suitably common with
the IMC work being done so that we could "just work" on new PMUs (albeit
calling into OPAL for enable/disable)

> Open question is where and how to develop and document these features?
> Not the dt representation created by skiboot, but the exact nature of
> each feature. What exact behaviour does a particular feature imply,
> etc.

Part of me seems to think this could be something for the Architecture,
and then we just have a 1-to-1 mapping of the arch bits and we're all
one big happy family

Benh/Paulus can probably laugh at me suitably hard for suggesting such a
thing being possible though :)

I see the kernel patch has docs for the DT bindings, I'd like to also
keep a copy in the skiboot tree, if only to further my impossible quest
to somewhat document the OPAL ABI.

> diff --git a/core/device.c b/core/device.c
> index 30b31f46..1900ba71 100644
> --- a/core/device.c
> +++ b/core/device.c
> @@ -548,6 +548,13 @@ u32 dt_property_get_cell(const struct dt_property *prop, 
> u32 index)
>   return fdt32_to_cpu(((const u32 *)prop->prop)[index]);
>  }
>  
> +void dt_property_set_cell(struct dt_property *prop, u32 index, u32 val)
> +{
> + assert(prop->len >= (index+1)*sizeof(u32));
> + /* Always aligned, so this works. */
> + ((u32 *)prop->prop)[index] = cpu_to_fdt32(val);
> +}
> +
>  /* First child of this node. */
>  struct dt_node *dt_first(const struct dt_node *root)
>  {
> diff --git a/core/init.c b/core/init.c
> index 58f96f47..938920eb 100644
> --- a/core/init.c
> +++ b/core/init.c
> @@ -703,6 +703,8 @@ static void per_thread_sanity_checks(void)
>  /* Called from head.S, thus no prototype. */
>  void main_cpu_entry(const void *fdt);
>  
> +extern void mambo_add_cpu_features(struct dt_node *root);
> +
>  void __noreturn __nomcount main_cpu_entry(const void *fdt)
>  {
>   /*
> @@ -774,6 +776,7 @@ void __noreturn __nomcount main_cpu_entry(const void *fdt)
>   abort();
>   } else {
>   dt_expand(fdt);
> + mambo_add_cpu_features(dt_root);
>   }

(without checking closely, just going from memory), this would also the
the case on P8 OpenPOWER hardware, as we get the device tree from
Hostboot there too.

>   /* Now that we have a full devicetree, verify that we aren't on fire. */
> diff --git a/hdata/cpu-common.c b/hdata/cpu-common.c
> index aa2752c1..1da1b1cb 100644
> --- a/hdata/cpu-common.c
> +++ b/hdata/cpu-common.c

> + { "vector-crypto",
> + CPU_ALL,
> + ISA_BASE, USABLE_HV|USABLE_OS|USABLE_PR,
> + HV_SUPPORT_NONE, OS_SUPPORT_NONE,
> + -1, -1, 38,
> + "vector", },

Did we want to break this down at all? Specifically maybe just the AES
instructions?

AFAIK it's been the only set where there was some amount of discussion
about somebody possibly not wanting to include.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v2] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails

2017-02-28 Thread Stewart Smith

Vipin K Parashar <vi...@linux.vnet.ibm.com> writes:
> Added check for OPAL_WRONG_STATE error code returned from OPAL.
> Currently Linux flashes "unexpected error" over console for this
> error. This will avoid throwing such message and return I/O error
> for such OPAL failures.
>
> Signed-off-by: Vipin K Parashar <vi...@linux.vnet.ibm.com>
> ---
> Changes in v2:
>  - Added log message indicating sleeping/offline core
>for OPAL_WRONG_STATE
>
>  arch/powerpc/platforms/powernv/opal.c | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/powernv/opal.c 
> b/arch/powerpc/platforms/powernv/opal.c
> index 86d9fde..8af230e 100644
> --- a/arch/powerpc/platforms/powernv/opal.c
> +++ b/arch/powerpc/platforms/powernv/opal.c
> @@ -869,8 +869,11 @@ int opal_error_code(int rc)
>   case OPAL_UNSUPPORTED:  return -EIO;
>   case OPAL_HARDWARE: return -EIO;
>   case OPAL_INTERNAL_ERROR:   return -EIO;
> + case OPAL_WRONG_STATE:
> + pr_notice("%s: Core sleeping/offline\n", __func__);
> + return -EIO;

Since this is part of opal_error_code() though, this will be printed for
any OPAL call that returns that.

Why not have the sensor code do this:

rc = opal_sensor_read(foo)
if (rc == OPAL_WRONG_STATE)
   return -EIO;
else
   return oal_error_code(rc);

?

>   default:
> - pr_err("%s: unexpected OPAL error %d\n", __func__, rc);
> + pr_err("%s: Unexpected OPAL error %d\n", __func__, rc);

Do we need this?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [Patch v4] powerpc/powernv: add hdat attribute to sysfs

2017-02-28 Thread Stewart Smith

Matt Brown <matthew.brown@gmail.com> writes:
> The HDAT data area is consumed by skiboot and turned into a device-tree.
> In some cases we would like to look directly at the HDAT, so this patch
> adds a sysfs node to allow it to be viewed.  This is not possible through
> /dev/mem as it is reserved memory which is stopped by the /dev/mem filter.
>
> Signed-off-by: Matt Brown <matthew.brown@gmail.com>
> ---
> Changes between v3 and v4:
>   - changed sysfs attribute permissions from 0444 to 0400
>   - fixed makefile to be on same line
>   - fixed authorship/copyright info
>   - re-ordered includes
>   - changed hdat_info struct to a static struct
>
> ---
>  arch/powerpc/include/asm/opal.h|  1 +
>  arch/powerpc/platforms/powernv/Makefile|  2 +-
>  arch/powerpc/platforms/powernv/opal-hdat.c | 60 
> ++
>  arch/powerpc/platforms/powernv/opal.c  |  2 +
>  4 files changed, 64 insertions(+), 1 deletion(-)
>  create mode 100644 arch/powerpc/platforms/powernv/opal-hdat.c
>
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index 5c7db0f..b26944e 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -277,6 +277,7 @@ extern int opal_async_comp_init(void);
>  extern int opal_sensor_init(void);
>  extern int opal_hmi_handler_init(void);
>  extern int opal_event_init(void);
> +extern void opal_hdat_sysfs_init(void);
>
>  extern int opal_machine_check(struct pt_regs *regs);
>  extern bool opal_mce_check_early_recovery(struct pt_regs *regs);
> diff --git a/arch/powerpc/platforms/powernv/Makefile 
> b/arch/powerpc/platforms/powernv/Makefile
> index b5d98cb..3826b70 100644
> --- a/arch/powerpc/platforms/powernv/Makefile
> +++ b/arch/powerpc/platforms/powernv/Makefile
> @@ -2,7 +2,7 @@ obj-y += setup.o opal-wrappers.o opal.o 
> opal-async.o idle.o
>  obj-y+= opal-rtc.o opal-nvram.o opal-lpc.o 
> opal-flash.o
>  obj-y+= rng.o opal-elog.o opal-dump.o 
> opal-sysparam.o opal-sensor.o
>  obj-y+= opal-msglog.o opal-hmi.o opal-power.o 
> opal-irqchip.o
> -obj-y+= opal-kmsg.o
> +obj-y+= opal-kmsg.o opal-hdat.o
>
>  obj-$(CONFIG_SMP)+= smp.o subcore.o subcore-asm.o
>  obj-$(CONFIG_PCI)+= pci.o pci-ioda.o npu-dma.o
> diff --git a/arch/powerpc/platforms/powernv/opal-hdat.c 
> b/arch/powerpc/platforms/powernv/opal-hdat.c
> new file mode 100644
> index 000..19647fd
> --- /dev/null
> +++ b/arch/powerpc/platforms/powernv/opal-hdat.c
> @@ -0,0 +1,60 @@
> +/*
> + * PowerNV OPAL HDAT interface
> + *
> + * Copyright 2017, Matt Brown, IBM Corp.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static struct {
> + char *base;
> + u64 size;
> +} hdat_info;
> +
> +/* Read function for HDAT attribute in sysfs */
> +static ssize_t hdat_read(struct file *file, struct kobject *kobj,
> +  struct bin_attribute *bin_attr, char *to,
> +  loff_t pos, size_t count)
> +{
> + if (!hdat_info.base)
> + return -ENODEV;
> +
> + return memory_read_from_buffer(to, count, , hdat_info.base,
> + hdat_info.size);
> +}
> +
> +/* HDAT attribute for sysfs */
> +static struct bin_attribute hdat_attr = {
> + .attr = {.name = "hdat", .mode = 0400},
> + .read = hdat_read
> +};

Why not BIN_ATTR_RO (like opal_export_symmap() does) ?

(I have comments/thoughts on the OPAL side as well)

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails

2017-02-22 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:

> Stewart Smith <stew...@linux.vnet.ibm.com> writes:
>
>> Vipin K Parashar <vi...@linux.vnet.ibm.com> writes:
>>> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote:
>>>> Vipin K Parashar <vi...@linux.vnet.ibm.com> writes:
>>>>
>>>>> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>>>>>
>>>>> done to read any core FIR which is sleeping, offline.
>>>> OK.
>>>>
>>>> Do we know why Linux is causing that to happen?
>>>
>>> This issue is originally seen upon running STAF (Software Test
>>> Automation Framework) stress tests and off-lining some cores
>>> with stress tests running.
>>>
>>> It can also be re-created after off-lining few cores and following
>>> one of below methods.
>>> 1. Executing Linux "sensors" command
>>> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input,
>>> where X is offline CPU.
>>>
>>> Its "opal_get_sensor_data" Linux API that that triggers
>>> OPAL call "opal_sensor_read", performing XSCOM ops here.
>>> If core is found sleeping/offline Linux throws up
>>> "opal_error_code: Unexpected OPAL error" error onto console.
>>>
>>> Currently Linux isn't aware about OPAL_WRONG_STATE return code
>>> from OPAL. Thus it prints "Unexpected OPAL error" message, same
>>> as it would log for any unknown OPAL return codes.
>>>
>>> Seeing this error over console has been a concern for Test and
>>> would puzzle real user as well. This patch makes Linux aware about
>>> OPAL_WRONG_STATE return code from OPAL and stops printing
>>> "Unexpected OPAL error" message onto console for OPAL fails
>>> with OPAL_WRONG_STATE
>>
>> Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we
>> return the xscom_read return code in event of error.
>>
>> I would argue that converting to EIO in that instance is probably
>> correct... or EAGAIN? EAGAIN may be more correct in the situation where
>> the core is just sleeping.
>>
>> What kind of offlining are you doing?
>>
>> Arguably, the correct behaviour would be to remove said sensors when the
>> core is offline.
>
> Right, that would be ideal. There appear to be at least two other hwmon
> drivers that are CPU hotplug aware (coretemp and via-cputemp).
>
> But perhaps it's not possible to work out which sensors are attached to
> which CPU etc., I haven't looked in detail.

Each core-temp@ sensor has a ibm,pir property, so linking back to what
core shouldn't be too hard. For mem-temp@ sensors, we have the chip-id.

> In that case changing just opal_get_sensor_data() to handle
> OPAL_WRONG_STATE would be OK, with a comment explaining that we might be
> asked to read a sensor on an offline CPU and we aren't able to detect
> that.

Agree.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [RFC] Remove memory from nodes for memtrace.

2017-02-22 Thread Stewart Smith

Rashmica Gupta <rashmic...@gmail.com> writes:
> +}
> +machine_device_initcall(powernv, memtrace_init);
> +
> +
> +
> +/* XXX FIXME DEBUG CRAP */
> +machine_device_initcall(pseries, memtrace_init);

Should the fixme be there?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails

2017-02-15 Thread Stewart Smith

Vipin K Parashar <vi...@linux.vnet.ibm.com> writes:
> On Monday 13 February 2017 06:13 AM, Michael Ellerman wrote:
>> Vipin K Parashar <vi...@linux.vnet.ibm.com> writes:
>>
>>> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>>>
>>> done to read any core FIR which is sleeping, offline.
>> OK.
>>
>> Do we know why Linux is causing that to happen?
>
> This issue is originally seen upon running STAF (Software Test
> Automation Framework) stress tests and off-lining some cores
> with stress tests running.
>
> It can also be re-created after off-lining few cores and following
> one of below methods.
> 1. Executing Linux "sensors" command
> 2. Reading contents of file /sys/class/hwmon/hwmon0/tempX_input,
> where X is offline CPU.
>
> Its "opal_get_sensor_data" Linux API that that triggers
> OPAL call "opal_sensor_read", performing XSCOM ops here.
> If core is found sleeping/offline Linux throws up
> "opal_error_code: Unexpected OPAL error" error onto console.
>
> Currently Linux isn't aware about OPAL_WRONG_STATE return code
> from OPAL. Thus it prints "Unexpected OPAL error" message, same
> as it would log for any unknown OPAL return codes.
>
> Seeing this error over console has been a concern for Test and
> would puzzle real user as well. This patch makes Linux aware about
> OPAL_WRONG_STATE return code from OPAL and stops printing
> "Unexpected OPAL error" message onto console for OPAL fails
> with OPAL_WRONG_STATE

Ahh... so this is a DTS sensor, which indeed is just XSCOMs and we
return the xscom_read return code in event of error.

I would argue that converting to EIO in that instance is probably
correct... or EAGAIN? EAGAIN may be more correct in the situation where
the core is just sleeping.

What kind of offlining are you doing?

Arguably, the correct behaviour would be to remove said sensors when the
core is offline.

>> It's also returned from many of the XIVE routines if we're in the wrong
>> xive mode, all of which would indicate a fairly bad Linux bug.
>>
>> Also the skiboot patch which added WRONG_STATE for XSCOM ops did so
>> explicitly so we could differentiate from other errors:
>>
>>  commit 9c2d82394fd2303847cac4a665dee62556ca528a
>>  Author: Russell Currey <rus...@russell.cc>
>>  AuthorDate: Mon Mar 21 12:00:00 2016 +1100
>>
>>  xscom: Return OPAL_WRONG_STATE on XSCOM ops if CPU is asleep
>>  
>>  xscom_read and xscom_write return OPAL_SUCCESS if they worked, and
>>  OPAL_HARDWARE if they didn't.  This doesn't provide information about 
>> why
>>  the operation failed, such as if the CPU happens to be asleep.
>>  
>>  This is specifically useful in error scanning, so if every CPU is being
>>  scanned for errors, sleeping CPUs likely aren't the cause of failures.
>>  
>>  So, return OPAL_WRONG_STATE in xscom_read and xscom_write if the CPU is
>>  sleeping.
>>  
>>  Signed-off-by: Russell Currey <rus...@russell.cc>
>>  Reviewed-by: Alistair Popple <alist...@popple.id.au>
>>  Signed-off-by: Stewart Smith <stew...@linux.vnet.ibm.com>
>>
>>
>>
>> So I'm still not convinced that quietly swallowing this error and
>> mapping it to -EIO along with several of the other error codes is the
>> right thing to do.
>
> How about returning -ENXIO upon receiving OPAL_WRONG_STATE ?

I think ENXIO would be incorrect, as it is "No such device or address"
when that almost certainly isn't the case. It's a solvable error -
there's just some XSCOMS you can't do at certain times.


-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCHv3 4/4] MAINTAINERS: Remove powerpc's opal match

2017-02-15 Thread Stewart Smith

Jon Derrick <jonathan.derr...@intel.com> writes:
> PPC's 'opal' match pattern also matches block/sed-opal.c, where it looks
> like the 'arch/powerpc' file pattern should be enough to match powerpc
> opal code by itself. Remove the opal regex pattern from powerpc.

This patch will end up missing some code, what about this instead:


Remove OPAL regex in powerpc to avoid false match

Signed-off-by: Stewart Smith <stew...@linux.vnet.ibm.com>

---
 MAINTAINERS |8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3960e7f..25ed25a 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7393,18 +7393,24 @@ L:  linuxppc-dev@lists.ozlabs.org
 Q: http://patchwork.ozlabs.org/project/linuxppc-dev/list/
 T: git git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git
 S: Supported
+F: Documentation/ABI/stable/sysfs-firmware-opal-*
+F: Documentation/devicetree/bindings/powerpc/opal/
+F: Documentation/devicetree/bindings/rtc/rtc-opal.txt
+F: Documentation/devicetree/bindings/i2c/i2c-opal.txt
 F: Documentation/powerpc/
 F: arch/powerpc/
 F: drivers/char/tpm/tpm_ibmvtpm*
 F: drivers/crypto/nx/
 F: drivers/crypto/vmx/
+F: drivers/i2c/busses/ic2-opal.c
 F: drivers/net/ethernet/ibm/ibmveth.*
 F: drivers/net/ethernet/ibm/ibmvnic.*
 F: drivers/pci/hotplug/pnv_php.c
 F: drivers/pci/hotplug/rpa*
+F: drivers/rtc/rtc-opal.c
 F: drivers/scsi/ibmvscsi/
+F: drivers/tty/hvc/hvc_opal.c
 F: tools/testing/selftests/powerpc
-N: opal
 N: /pmac
 N: powermac
 N: powernv


-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH 2/2] powerpc/powernv/opal-dump : Use IRQ_HANDLED instead of numbers in interrupt handler

2017-02-14 Thread Stewart Smith

Mukesh Ojha <mukes...@linux.vnet.ibm.com> writes:

> Converts all the return explicit number to a more proper IRQ_HANDLED,
> which looks proper incase of interrupt handler returning case.
>
> Signed-off-by: Mukesh Ojha <mukes...@linux.vnet.ibm.com>
> Reviewed-by: Vasant Hegde <hegdevas...@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/opal-dump.c | 9 +++--
>  1 file changed, 3 insertions(+), 6 deletions(-)

Should also have:
Fixes: 8034f715f ("powernv/opal-dump: Convert to irq domain")

?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH 1/2] powerpc/powernv/opal-dump : Handles opal_dump_info properly

2017-02-14 Thread Stewart Smith

Mukesh Ojha <mukes...@linux.vnet.ibm.com> writes:
> Moves the return value check of 'opal_dump_info' to a proper place which
> was previously unnecessarily filling all the dump info even on failure.
>
> Signed-off-by: Mukesh Ojha <mukes...@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/opal-dump.c | 9 ++---
>  1 file changed, 6 insertions(+), 3 deletions(-)


Acked-by: Stewart Smith <stew...@linux.vnet.ibm.com>

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powernv/opal: Handle OPAL_WRONG_STATE error from OPAL fails

2017-02-14 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:
> Vipin K Parashar <vi...@linux.vnet.ibm.com> writes:
>
>> OPAL returns OPAL_WRONG_STATE for XSCOM operations
>>
>> done to read any core FIR which is sleeping, offline.
>
> OK.
>
> Do we know why Linux is causing that to happen?
>
> It's also returned from many of the XIVE routines if we're in the wrong
> xive mode, all of which would indicate a fairly bad Linux bug.
>
> Also the skiboot patch which added WRONG_STATE for XSCOM ops did so
> explicitly so we could differentiate from other errors:
>
> commit 9c2d82394fd2303847cac4a665dee62556ca528a
> Author: Russell Currey <rus...@russell.cc>
> AuthorDate: Mon Mar 21 12:00:00 2016 +1100
>
> xscom: Return OPAL_WRONG_STATE on XSCOM ops if CPU is asleep
> 
> xscom_read and xscom_write return OPAL_SUCCESS if they worked, and
> OPAL_HARDWARE if they didn't.  This doesn't provide information about why
> the operation failed, such as if the CPU happens to be asleep.
> 
> This is specifically useful in error scanning, so if every CPU is being
> scanned for errors, sleeping CPUs likely aren't the cause of failures.
> 
> So, return OPAL_WRONG_STATE in xscom_read and xscom_write if the CPU is
> sleeping.
> 
> Signed-off-by: Russell Currey <rus...@russell.cc>
> Reviewed-by: Alistair Popple <alist...@popple.id.au>
> Signed-off-by: Stewart Smith <stew...@linux.vnet.ibm.com>
>
>
>
> So I'm still not convinced that quietly swallowing this error and
> mapping it to -EIO along with several of the other error codes is the
> right thing to do.

FWIW I agree - pretty limited cases where it should just be converted
into -EIO and passed on - probably *just* the debugfs interface to be honest.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH RFC] powerpc/powernv: sysfs entry to force full IPL reboot

2016-12-22 Thread Stewart Smith

Andrew Donnellan <andrew.donnel...@au1.ibm.com> writes:
> skiboot now supports "fast reboot", a reboot procedure where skiboot
> reinitialises hardware and loads a new kernel without re-IPLing the
> machine. At present, fast reboot support is still experimental and is not
> enabled by default, however it is intended that it will be enabled by
> default in a near-future release.
>
> There may be some circumstances where the user wants to force a full IPL
> reboot rather than using fast reboot. Add support for the
> OPAL_REBOOT_FULL_IPL reboot type, enabled by writing 1 to
> /sys/firmware/opal/force_full_ipl_reboot. On versions of skiboot that
> implement the OPAL_REBOOT_FULL_IPL reboot type, this will force an IPL. On
> versions that do not, print an error message on reboot and proceed with a
> regular reboot (which could be a full IPL or a fast reboot).
>
> Cc: Stewart Smith <stew...@linux.vnet.ibm.com>
> Cc: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> Signed-off-by: Andrew Donnellan <andrew.donnel...@au1.ibm.com>
>
> ---
>
> Corresponding skiboot patch: http://patchwork.ozlabs.org/patch/697601/

FWIW I've just merged the skiboot patch.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH RFC] powerpc/powernv: sysfs entry to force full IPL reboot

2016-12-22 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:
> Andrew Donnellan <andrew.donnel...@au1.ibm.com> writes:
>
>> On 23/11/16 12:37, Andrew Donnellan wrote:
>>>> There's existing logic in kernel/reboot.c to handle a reboot= command
>>>> line parameter, which can set the reboot_mode, so my preference would be
>>>> that we use that.
>>>>
>>>> Currently we completely ignore the reboot_mode, so there's no backward
>>>> compatibility issue.
>>>>
>>>> It looks like the default is REBOOT_COLD, whatever that means. So I
>>>> think we could define that on powernv REBOOT_HARD means "do a full IPL".
>>>
>>> Sounds good.
>>
>> Thinking about this more - you can't change that at runtime. :(
>
> Yeah. Is that a problem? From my POV this is basically a "my firmware is
> broken" workaround, so you're going to want to set it permanently until
> you update your firmware.

The only real world application I can see wanting to do that is perhaps
HTX - where it chooses what kind of bootme run it wants. So, well, for
certain definitions of "real world".

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH 2/2] powernv: Pass PSSCR value and mask to power9_idle_stop

2016-10-11 Thread Stewart Smith

Gautham R Shenoy <e...@linux.vnet.ibm.com> writes:
> On Tue, Oct 04, 2016 at 10:33:27PM +1100, Balbir Singh wrote:
>> 
>> 
>> On 04/10/16 21:32, Michael Ellerman wrote:
>> > "Gautham R. Shenoy" <e...@linux.vnet.ibm.com> writes:
>> > 
>> >> From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com>
>> >>
>> >> The power9_idle_stop method currently takes only the requested stop
>> >> level as a parameter and picks up the rest of the PSSCR bits from a
>> >> hand-coded macro. This is not a very flexible design, especially when
>> >> the firmware has the capability to communicate the psscr value and the
>> >> mask associated with a particular stop state via device tree.
>> >>
>> >> This patch modifies the power9_idle_stop API to take as parameters the
>> >> PSSCR value and the PSSCR mask corresponding to the stop state that
>> >> needs to be set. These PSSCR value and mask are respectively obtained
>> >> by parsing the "ibm,cpu-idle-state-psscr" and
>> >> "ibm,cpu-idle-state-psscr-mask" fields from the device tree.
>> >>
>> >> In addition to this, the patch adds support for handling stop states
>> >> for which ESL and EC bits in the PSSCR are zero. As per the
>> >> architecture, a wakeup from these stop states resumes execution from
>> >> the subsequent instruction as opposed to waking up at the System
>> >> Vector.
>> > 
>> > That looks good.
>> > 
>> >> This patch depends on the following skiboot patch that exports the
>> >> PSSCR values and the mask for all the stop states:
>> >> https://lists.ozlabs.org/pipermail/skiboot/2016-September/004869.html
>> > 
>> > But we can't depend on a skiboot patch. The kernel has to cope with
>> > running on an old skiboot.
>>
>
> Hmm.. We can still do that. The older skiboot only provides the RL
> field of the PSSCR value for each stop state and the corresponding
> PSSCR mask is set to 0xF in the older skiboot for all the stop states.
>
> We can insist that the future skiboot sets the ESL, EC, PSLL, TR, MTL
> and the the RL fields of the PSSCR for any exported stop state. This
> should be reflected in the psscr_mask of that stop state.  Thus, the
> psscr_mask of any stop state proposed in the future will have:
> (PSSCR_ESL_MASK | PSCCR_EC_MASK | PSCCR_PSLL_MASK | PSSCR_TR_MASK |
> PSSCR_MTL_MASK | PSSCR_RL_MASK) bits set in the skiboot.
>
> To handle the older firmware, we can do something like the following
> during the discovery of the stop states to mimic the behaviour present
> in the 4.8 kernel running on older firmware.
>
> === drivers/cpuidle/cpuidle-powernv.c ===
> /*
>  * By default we set the ESL and EC bits in the PSSCR.
>  * The MTL and PSLL are set to the maximum value possible as per the
>  * ISA, i.e 15.
>  * The Transition Rate is set to the Maximum value 3.
>  */
> #define DEFAULT_PSSCR_VAL  PSSCR_ESL_MASK |   \
>  PSCCR_EC_MASK | PSCCR_PSLL_MASK |\
>  PSSCR_TR_MASK | PSSCR_MTL_MASK
>
> #define DEFAULT_PSSCR_MASK PSSCR_ESL_MASK |   \
>  PSCCR_EC_MASK | PSCCR_PSLL_MASK |\
>  PSSCR_TR_MASK | PSSCR_MTL_MASK | \
>  PSSCR_RL_MASK
>
>
> static int powernv_add_idle_states(void)
> {
>   .
>   .
>   .
>   for (i = 0; i < dt_idle_states; i++) {
>   u64 val, mask;
>   .   
>   .
>   .
>   val = (DEFAULT_PSSCR_VAL & ~psscr_mask[i]) | psscr_val[i];
>   mask = DEFAULT_PSSCR_MASK | psscr_mask[i];
>   stop_psscr_table[nr_idle_states].val = val;
>   stop_psscr_table[nr_idle_states].mask = mask;
>   }
> } 
> 
>
>
> Is this approach ok ?

What if we just treat the 0xF state from firmware as special and set it
to DEFAULT_PSSCR_MASK in that case? That deals with old skiboot, new
kernel, and sets a pretty small special case that's easy to track into
the future as something we should watch out for.

Additionally, if we make skiboot set sane values in ~DEFAULT_PSSCR_MASK
for valid fields in PSSCR on boot/(also kexec?), then
we should end up in a situation where everything works with everything
(even if you don't get the best power saving). Specifically, new
skiboot, old kernel... but it looks like there's nothing currently
missing there

Should this patch also have Fixes: 3005c597ba4 and CC to stable?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH] powernv: Search for new flash DT node location

2016-09-26 Thread Stewart Smith

Michael Ellerman <m...@ellerman.id.au> writes:
> Jack Miller <j...@codezen.org> writes:
>
>> On Wed, Aug 03, 2016 at 05:16:34PM +1000, Michael Ellerman wrote:
>>> We could instead just search for all nodes that are compatible with
>>> "ibm,opal-flash". We do that for i2c, see opal_i2c_create_devs().
>>> 
>>> Is there a particular reason not to do that?
>>
>> I'm actually surprised that this is preferred. Jeremy mentioned something
>> similar, but I guess I just don't like the idea of finding devices in weird
>> places in the tree.
>
> But where is "weird". Arguably "/opal/flash" is weird. What does it
> mean? There's a bus called "opal" and a device on it called "flash"? No.
>
> Point being the structure is fairly arbitrary, or at least debatable, so
> tying the code 100% to the structure is inflexible. As we have discovered.
>
> Our other option is to tell skiboot to get stuffed, and leave the flash
> node where it was on P8.
>
>> Then again, if we can't trust the DT we're in bigger
>> trouble than erroneous flash nodes =).
>
> Quite :)
>
>> If we really just want to find compatible nodes anywhere, let's simplify i2c
>> and pdev_init into one function and make that behavior consistent with this
>> new patch.
>
> That seems OK to me.
>
> We should get an ack from Stewart though for the other node types.

For finding nodes based on compatible no matter where they are in the tree,

Acked-by: Stewart Smith <stew...@linux.vnet.ibm.com>

(and yes, includes other nodes too)

The exact location then isn't too important, and having a /flash that's
ibm,opal-flash and allows for some other driver to bind to it I think is
also something we shouldn't rule out.

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH 1/4] dt-bindings: add doc for ibm,hotplug-aperture

2016-08-10 Thread Stewart Smith

Balbir Singh <bsinghar...@gmail.com> writes:
> On 09/08/16 04:27, Reza Arbab wrote:
>> Signed-off-by: Reza Arbab <ar...@linux.vnet.ibm.com>
>> ---
>>  .../bindings/powerpc/opal/hotplug-aperture.txt | 26 
>> ++
>>  1 file changed, 26 insertions(+)
>>  create mode 100644
>> Documentation/devicetree/bindings/powerpc/opal/hotplug-aperture.txt

Forgive me for being absent on the whole discussion here, but is this an
OPAL specific binding? If so, shouldn't the docs also appear in the
skiboot tree?

-- 
Stewart Smith
OPAL Architect, IBM.

Re: [PATCH v2 3/3] powernv: Fix MCE handler to avoid trashing CR0/CR1 registers.

2016-08-03 Thread Stewart Smith

Mahesh J Salgaonkar <mah...@linux.vnet.ibm.com> writes:
> From: Mahesh Salgaonkar <mah...@linux.vnet.ibm.com>
>
> The current implementation of MCE early handling modifies CR0/1 registers
> without saving its old values. Fix this by moving early check for
> powersaving mode to machine_check_handle_early().

>From (internal bug report) it seems as though in a test where one
injects continuous SLB Multi Hit errors, this bug could lead to rebooting
"due to to Platform error" rather than continuing to recover
successfully. It might be a good idea to mention that in commit message
here.

Also, should this go to stable?

-- 
Stewart Smith
OPAL Architect, IBM.

[PATCH] rtc-opal: Fix handling of firmware error codes, prevent busy loops

2016-08-01 Thread Stewart Smith

According to the OPAL docs:
https://github.com/open-power/skiboot/blob/skiboot-5.2.5/doc/opal-api/opal-rtc-read-3.txt
https://github.com/open-power/skiboot/blob/skiboot-5.2.5/doc/opal-api/opal-rtc-write-4.txt
OPAL_HARDWARE may be returned from OPAL_RTC_READ or OPAL_RTC_WRITE and this
indicates either a transient or permanent error.

Prior to this patch, Linux was not dealing with OPAL_HARDWARE being a
permanent error particularly well, in that you could end up in a busy
loop.

This was not too hard to trigger on an AMI BMC based OpenPOWER machine
doing a continuous "ipmitool mc reset cold" to the BMC, the result of
that being that we'd get stuck in an infinite loop in opal_get_rtc_time.

We now retry a few times before returning the error higher up the stack.

Cc: sta...@vger.kernel.org
Signed-off-by: Stewart Smith <stew...@linux.vnet.ibm.com>
---
 drivers/rtc/rtc-opal.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/rtc/rtc-opal.c b/drivers/rtc/rtc-opal.c
index 9c18d6fd8107..fab19e3e2fba 100644
--- a/drivers/rtc/rtc-opal.c
+++ b/drivers/rtc/rtc-opal.c
@@ -58,6 +58,7 @@ static void tm_to_opal(struct rtc_time *tm, u32 *y_m_d, u64 
*h_m_s_ms)
 static int opal_get_rtc_time(struct device *dev, struct rtc_time *tm)
 {
long rc = OPAL_BUSY;
+   int retries = 10;
u32 y_m_d;
u64 h_m_s_ms;
__be32 __y_m_d;
@@ -67,8 +68,11 @@ static int opal_get_rtc_time(struct device *dev, struct 
rtc_time *tm)
rc = opal_rtc_read(&__y_m_d, &__h_m_s_ms);
if (rc == OPAL_BUSY_EVENT)
opal_poll_events(NULL);
-   else
+   else if (retries-- && (rc == OPAL_HARDWARE
+  || rc == OPAL_INTERNAL_ERROR))
msleep(10);
+   else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT)
+   break;
}
 
if (rc != OPAL_SUCCESS)
@@ -84,6 +88,7 @@ static int opal_get_rtc_time(struct device *dev, struct 
rtc_time *tm)
 static int opal_set_rtc_time(struct device *dev, struct rtc_time *tm)
 {
long rc = OPAL_BUSY;
+   int retries = 10;
u32 y_m_d = 0;
u64 h_m_s_ms = 0;
 
@@ -92,8 +97,11 @@ static int opal_set_rtc_time(struct device *dev, struct 
rtc_time *tm)
rc = opal_rtc_write(y_m_d, h_m_s_ms);
if (rc == OPAL_BUSY_EVENT)
opal_poll_events(NULL);
-   else
+   else if (retries-- && (rc == OPAL_HARDWARE
+  || rc == OPAL_INTERNAL_ERROR))
msleep(10);
+   else if (rc != OPAL_BUSY && rc != OPAL_BUSY_EVENT)
+   break;
}
 
return rc == OPAL_SUCCESS ? 0 : -EIO;
-- 
2.1.4

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-13 Thread Stewart Smith

Arnd Bergmann <a...@arndb.de> writes:
> On Wednesday, July 13, 2016 10:36:14 AM CEST Dave Young wrote:
>> On 07/12/16 at 03:50pm, Mark Rutland wrote:
>> > On Tue, Jul 12, 2016 at 04:24:10PM +0200, Arnd Bergmann wrote:
>> > > On Tuesday, July 12, 2016 10:18:11 AM CEST Vivek Goyal wrote:
>> > 
>> > /proc/devicetree (aka /sys/firmware/devicetree) is a filesystem derived
>> > from the raw DTB (which is exposed at /sys/firmware/fdt).
>> > 
>> > The blob that was handed to the kernel at boot time is exposed at
>> > /sys/firmware/fdt.
>> 
>> I believe the blob can be read and passed to kexec kernel in kernel code 
>> without
>> the extra fd.
>> 
>> But consider we can kexec to a different kernel and a different initrd so 
>> there
>> will be use cases to pass a total different dtb as well. From my 
>> understanding
>> it is reasonable but yes I think we should think carefully about the design.
>
> Ok, I can see four interesting use cases here:
>
> - Using the dtb that the kernel has saved at boot time. Ideally this should 
> not
>   require an additional step of signing it, since the running kernel already
>   trusts it.

- using current view of the hardware, flattened into a new dtb.
  This should already be trusted, as it's what we're running now (boot +
  runtime changes)

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-13 Thread Stewart Smith

Ard Biesheuvel <ard.biesheu...@linaro.org> writes:
> On 13 July 2016 at 09:36, Russell King - ARM Linux
> <li...@armlinux.org.uk> wrote:
>> On Wed, Jul 13, 2016 at 02:59:51PM +1000, Stewart Smith wrote:
>>> Russell King - ARM Linux <li...@armlinux.org.uk> writes:
>>> > On Tue, Jul 12, 2016 at 10:58:05PM +0200, Petr Tesarik wrote:
>>> >> I'm not an expert on DTB, so I can't provide an example of code
>>> >> execution, but you have already mentioned the /chosen/linux,stdout-path
>>> >> property. If an attacker redirects the bootloader to an insecure
>>> >> console, they may get access to the system that would otherwise be
>>> >> impossible.
>>> >
>>> > I fail to see how kexec connects with the boot loader - the DTB image
>>> > that's being talked about is one which is passed from the currently
>>> > running kernel to the to-be-kexec'd kernel.  For ARM (and I suspect
>>> > also ARM64) that's a direct call chain which doesn't involve any
>>> > boot loader or firmware, and certainly none that would involve the
>>> > passed DTB image.
>>>
>>> For OpenPOWER machines, kexec is the bootloader. Our bootloader is a
>>> linux kernel and initramfs with a UI (petitboot) - this means we never
>>> have to write a device driver twice: write a kernel one and you're done
>>> (for booting from the device and using it in your OS).
>>
>> I think you misunderstood my point.
>>
>> On ARM, we do not go:
>>
>> kernel (kexec'd from) -> boot loader -> kernel (kexec'd to)
>>
>> but we go:
>>
>> kernel (kexec'd from) -> kernel (kexec'd to)
>>
>> There's no intermediate step involving any bootloader.
>>
>> Hence, my point is that the dtb loaded by kexec is _only_ used by the
>> kernel which is being kexec'd to, not by the bootloader, nor indeed
>> the kernel which it is loaded into.
>>
>> Moreover, if you read the bit that I quoted (which is what I was
>> replying to), you'll notice that it is talking about the DTB loaded
>> by kexec somehow causing the _bootloader_ to be redirected to an
>> alternative console.  This point is wholely false on ARM.
>>
>
> The particular example may not apply, but the argument that the DTB
> -as a description of the hardware topology- needs to be signed if the
> kernel is also signed is valid. We do the same in the UEFI stub, i.e.,
> it normally takes a dtb= argument to allow the DTB to be overridden,
> but this feature is disabled when Secure Boot is in effect. By the
> same reasoning, if any kind of kexec kernel image validation is in
> effect, we should either validate the DTB image as well, or disallow
> external DTBs and only perform kexec with the kernel's current DTB
> (the blob it was booted with, not the unflattened data structure)

DTB booted with != current description of hardware

We could have had: PCI hotplug, CPU/memory/cache offlined due to
hardware error, change in available pstates / CPU frequencies.

There is merit in having a signed dtb if you're booting a signed kernel
in a secure boot scenario. However, we still need to set up /chosen/ and
we still need a way to do something like the offb hack.

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-13 Thread Stewart Smith

Russell King - ARM Linux <li...@armlinux.org.uk> writes:
> On Wed, Jul 13, 2016 at 02:59:51PM +1000, Stewart Smith wrote:
>> Russell King - ARM Linux <li...@armlinux.org.uk> writes:
>> > On Tue, Jul 12, 2016 at 10:58:05PM +0200, Petr Tesarik wrote:
>> >> I'm not an expert on DTB, so I can't provide an example of code
>> >> execution, but you have already mentioned the /chosen/linux,stdout-path
>> >> property. If an attacker redirects the bootloader to an insecure
>> >> console, they may get access to the system that would otherwise be
>> >> impossible.
>> >
>> > I fail to see how kexec connects with the boot loader - the DTB image
>> > that's being talked about is one which is passed from the currently
>> > running kernel to the to-be-kexec'd kernel.  For ARM (and I suspect
>> > also ARM64) that's a direct call chain which doesn't involve any
>> > boot loader or firmware, and certainly none that would involve the
>> > passed DTB image.
>> 
>> For OpenPOWER machines, kexec is the bootloader. Our bootloader is a
>> linux kernel and initramfs with a UI (petitboot) - this means we never
>> have to write a device driver twice: write a kernel one and you're done
>> (for booting from the device and using it in your OS).
>
> I think you misunderstood my point.
>
> On ARM, we do not go:
>
>   kernel (kexec'd from) -> boot loader -> kernel (kexec'd to)
>
> but we go:
>
>   kernel (kexec'd from) -> kernel (kexec'd to)
>
> There's no intermediate step involving any bootloader.
>
> Hence, my point is that the dtb loaded by kexec is _only_ used by the
> kernel which is being kexec'd to, not by the bootloader, nor indeed
> the kernel which it is loaded into.
>
> Moreover, if you read the bit that I quoted (which is what I was
> replying to), you'll notice that it is talking about the DTB loaded
> by kexec somehow causing the _bootloader_ to be redirected to an
> alternative console.  This point is wholely false on ARM.

Ahh.. I missed the bootloader bit there.

In which case, we're the same on OpenPOWER, there is no intermediate
bootloader - in our case we have linux (with kexec) taking on what uboot
or grub is typically used for on other platforms.


-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-12 Thread Stewart Smith

Russell King - ARM Linux <li...@armlinux.org.uk> writes:
> On Tue, Jul 12, 2016 at 10:58:05PM +0200, Petr Tesarik wrote:
>> I'm not an expert on DTB, so I can't provide an example of code
>> execution, but you have already mentioned the /chosen/linux,stdout-path
>> property. If an attacker redirects the bootloader to an insecure
>> console, they may get access to the system that would otherwise be
>> impossible.
>
> I fail to see how kexec connects with the boot loader - the DTB image
> that's being talked about is one which is passed from the currently
> running kernel to the to-be-kexec'd kernel.  For ARM (and I suspect
> also ARM64) that's a direct call chain which doesn't involve any
> boot loader or firmware, and certainly none that would involve the
> passed DTB image.

For OpenPOWER machines, kexec is the bootloader. Our bootloader is a
linux kernel and initramfs with a UI (petitboot) - this means we never
have to write a device driver twice: write a kernel one and you're done
(for booting from the device and using it in your OS).

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 1/2] crypto: vmx - Adding asm subroutines for XTS

2016-07-12 Thread Stewart Smith

Stephen Rothwell <s...@canb.auug.org.au> writes:
> On Mon, 11 Jul 2016 16:07:39 -0300 Paulo Flabiano Smorigo 
> <pfsmor...@linux.vnet.ibm.com> wrote:
>>
>> diff --git a/drivers/crypto/vmx/aesp8-ppc.pl 
>> b/drivers/crypto/vmx/aesp8-ppc.pl
>> index 2280539..813ffcc 100644
>> --- a/drivers/crypto/vmx/aesp8-ppc.pl
>> +++ b/drivers/crypto/vmx/aesp8-ppc.pl
>> @@ -1,4 +1,11 @@
>> -#!/usr/bin/env perl
>> +#! /usr/bin/env perl
>> +# Copyright 2014-2016 The OpenSSL Project Authors. All Rights Reserved.
>> +#
>> +# Licensed under the OpenSSL license (the "License").  You may not use
>> +# this file except in compliance with the License.  You can obtain a copy
>> +# in the file LICENSE in the source distribution or at
>> +# https://www.openssl.org/source/license.html
>
> So, I assume that this license is compatible with the GPLv2?

https://people.gnome.org/~markmc/openssl-and-the-gpl.html has an
explanation and points to:
https://www.openssl.org/docs/faq.html#LEGAL2

which makes it anything but clearer.

it appears the answer is "probably not, unless you have an explicit
exemption in your license"

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-12 Thread Stewart Smith

Vivek Goyal <vgo...@redhat.com> writes:
> On Tue, Jul 12, 2016 at 10:58:09AM -0300, Thiago Jung Bauermann wrote:
>> Hello Eric,
>> 
>> Am Dienstag, 12 Juli 2016, 08:25:48 schrieb Eric W. Biederman:
>> > AKASHI Takahiro <takahiro.aka...@linaro.org> writes:
>> > > Device tree blob must be passed to a second kernel on DTB-capable
>> > > archs, like powerpc and arm64, but the current kernel interface
>> > > lacks this support.
>> > > 
>> > > This patch extends kexec_file_load system call by adding an extra
>> > > argument to this syscall so that an arbitrary number of file descriptors
>> > > can be handed out from user space to the kernel.
>> > > 
>> > > See the background [1].
>> > > 
>> > > Please note that the new interface looks quite similar to the current
>> > > system call, but that it won't always mean that it provides the "binary
>> > > compatibility."
>> > > 
>> > > [1] http://lists.infradead.org/pipermail/kexec/2016-June/016276.html
>> > 
>> > So this design is wrong.  The kernel already has the device tree blob,
>> > you should not be extracting it from the kernel munging it, and then
>> > reinserting it in the kernel if you want signatures and everything to
>> > pass.
>> > 
>> > What x86 does is pass it's equivalent of the device tree blob from one
>> > kernel to another directly and behind the scenes.  It does not go
>> > through userspace for this.
>> > 
>> > Until a persuasive case can be made for going around the kernel and
>> > probably adding a feature (like code execution) that can be used to
>> > defeat the signature scheme I am going to nack this.
>> 
>> There are situations where userspace needs to change things in the device 
>> tree to be used by the next kernel.
>> 
>> For example, Petitboot (the boot loader used in OpenPOWER machines) is a 
>> userspace application running in an intermediary Linux instance and uses 
>> kexec to load the target OS. It has to modify the device tree that will be 
>> used by the next kernel so that the next kernel uses the same console that 
>> petitboot was configured to use (i.e., set the /chosen/linux,stdout-path 
>> property). It also modifies the device tree to allow the kernel to inherit 
>> Petitboot's Openfirmware framebuffer.
>
> Can some of this be done with the help of kernel command line options for
> second kernel?

how would this be any more secure?

Passing in an address for a framebuffer via command line option means
you could scribble over any bit of memory, which is the same kind of
damage you could do by modifying the device tree.

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [RFC 0/3] extend kexec_file_load system call

2016-07-12 Thread Stewart Smith

Petr Tesarik <ptesa...@suse.cz> writes:
> On Tue, 12 Jul 2016 13:25:11 -0300
> Thiago Jung Bauermann <bauer...@linux.vnet.ibm.com> wrote:
>
>> Hi Eric,
>> 
>> I'm trying to understand your concerns leading to your nack. I hope you 
>> don't mind expanding your thoughts on them a bit.
>> 
>> Am Dienstag, 12 Juli 2016, 08:25:48 schrieb Eric W. Biederman:
>> > AKASHI Takahiro <takahiro.aka...@linaro.org> writes:
>> > > Device tree blob must be passed to a second kernel on DTB-capable
>> > > archs, like powerpc and arm64, but the current kernel interface
>> > > lacks this support.
>> > > 
>> > > This patch extends kexec_file_load system call by adding an extra
>> > > argument to this syscall so that an arbitrary number of file descriptors
>> > > can be handed out from user space to the kernel.
>> > > 
>> > > See the background [1].
>> > > 
>> > > Please note that the new interface looks quite similar to the current
>> > > system call, but that it won't always mean that it provides the "binary
>> > > compatibility."
>> > > 
>> > > [1] http://lists.infradead.org/pipermail/kexec/2016-June/016276.html
>> > 
>> > So this design is wrong.  The kernel already has the device tree blob,
>> > you should not be extracting it from the kernel munging it, and then
>> > reinserting it in the kernel if you want signatures and everything to
>> > pass.
>> 
>> I don't understand how the kernel signature will be invalidated. 
>> 
>> There are some types of boot images that can embed a device tree blob in 
>> them, but the kernel can also be handed a separate device tree blob from 
>> firmware, the boot loader, or kexec. This latter case is what we are 
>> discussing, so we are not talking about modifying an embedded blob in the 
>> kernel image.
>> 
>> > What x86 does is pass it's equivalent of the device tree blob from one
>> > kernel to another directly and behind the scenes.  It does not go
>> > through userspace for this.
>> > 
>> > Until a persuasive case can be made for going around the kernel and
>> > probably adding a feature (like code execution) that can be used to
>> > defeat the signature scheme I am going to nack this.
>> 
>> I also don't understand what you mean by code execution. How does passing a 
>> device tree blob via kexec enables code execution? How can the signature 
>> scheme be defeated?
>
> I'm not an expert on DTB, so I can't provide an example of code
> execution, but you have already mentioned the /chosen/linux,stdout-path
> property. If an attacker redirects the bootloader to an insecure
> console, they may get access to the system that would otherwise be
> impossible.

In this case, the user is sitting at the (or one of the) console(s) of
the machine. There could be petitboot UIs running on the VGA display,
IPMI serial over lan, local serial port. The logic behind setting
/chosen/linux,stdout-path is (currently) mostly to set it for the kernel
to what the user is interacting with. i.e. if you select an OS installer
to boot from the VGA console, you get a graphical installer running and
if you selected it from a text console, you get a text installer running
(on the appropriate console).

So the bootloader (petitboot) needs to work out which console is being
interacted with in order to set up /chosen/linux,stdout-path correctly.

This specific option could be passed as a kernel command line to the
next kernel, yes. However, isn't the kernel command line also an attack
vector? Is *every* command line option safe?

> In general, tampering with the hardware inventory of a machine opens up
> a security hole, and one must be very cautious which modifications are
> allowed. You're giving this power to an (unsigned, hence untrusted)
> userspace application; Eric argues that only the kernel should have
> this power.

In the case of petitboot on OpenPOWER, this (will) be a signed and
trusted kernel and userspace and verified by a previous bit of firmware.

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 01/14] powerpc/powernv: Add XICS emulation APIs

2016-07-12 Thread Stewart Smith

Benjamin Herrenschmidt <b...@kernel.crashing.org> writes:
> OPAL provides an emulated XICS interrupt controller to
> use as a fallback on newer processors that don't have a
> XICS. It's meant as a way to provide backward compatibility
> with future processors. Add the corresponding interfaces.
>
> Signed-off-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>

These OPAL calls have the correct OPAL API numbers and they've been
assigned upstream in skiboot.

Acked-by: Stewart Smith <stew...@linux.vnet.ibm.com>

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V5 2/3] powerpc/opal: Add inline function to get rc from an ASYNC_COMP opal_msg

2016-06-29 Thread Stewart Smith

Suraj Jitindar Singh <sjitindarsi...@gmail.com> writes:
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -276,6 +276,14 @@ extern int opal_error_code(int rc);
>  
>  ssize_t opal_msglog_copy(char *to, loff_t pos, size_t count);
>  
> +static inline int opal_get_async_rc(struct opal_msg msg)
> +{
> + if (msg.msg_type != OPAL_MSG_ASYNC_COMP)
> + return OPAL_PARAMETER;

Should instead be WARN_ON or BUG_ON ? Is there *ever* a situation where
calling opal_get_async_rc on a non-OPAL_MSG_ASYNC_COMP is a not both a
dumb idea and a bug?

otherwise (including if above change is made)
Acked-by: Stewart Smith <stew...@linux.vnet.ibm.com>


-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2] powerpc/boot: Add OPAL console to epapr wrappers

2016-06-26 Thread Stewart Smith

Oliver O'Halloran <ooh...@gmail.com> writes:
> This patch adds an OPAL console backend to the powerpc boot wrapper so
> that decompression failures inside the wrapper can be reported to the
> user. This is important since it typically indicates data corruption in
> the firmware and other nasty things.
>
> Currently this only works when building a little endian kernel. When
> compiling a 64 bit BE kernel the wrapper is always build 32 bit to be
> compatible with some 32 bit firmwares. BE support will be added at a
> later date. Another limitation of this is that only the "raw" type of
> OPAL console is supported, however machines that provide a hvsi console
> also provide a raw console so this is not an issue in practice.
>
> Actually-written-by: Benjamin Herrenschmidt <b...@kernel.crashing.org>
> Signed-off-by: Oliver O'Halloran <ooh...@gmail.com>
> Cc: Stewart Smith <stew...@linux.vnet.ibm.com>
> Cc: sta...@vger.kernel.org

(with addition of (C) headers suggested below)

Acked-by: Stewart Smith <stew...@linux.vnet.ibm.com>
Tested-by: Stewart Smith <stew...@linux.vnet.ibm.com>

One thing to think of is if we really need anything printed except the
error message. Out of these three lines of output (for a corrupted
zImage.epapr), only the last line gives any real information.

5343492: (5343490): zImage starting: loaded at 0x2001 (sp: 
0x20ee2ed8)
5507562: (5507560): Allocating 0x159efd4 bytes for kernel ...
5557104: (5557102): gunzipping (0x <- 
0x2001e000:0x20ee05c0)...inflate returned -3 msg: invalid block 
type

Although this should likely be addressed in another patch.

> ---
>  arch/powerpc/boot/Makefile |  4 +-
>  arch/powerpc/boot/opal-calls.S | 49 +++
>  arch/powerpc/boot/opal.c   | 88 
> ++
>  arch/powerpc/boot/ops.h|  1 +
>  arch/powerpc/boot/ppc_asm.h|  4 ++
>  arch/powerpc/boot/serial.c |  2 +
>  arch/powerpc/boot/types.h  | 12 ++
>  7 files changed, 158 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/boot/opal-calls.S
>  create mode 100644 arch/powerpc/boot/opal.c
>
> diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
> index 8fe78a3efc92..00cf88aa9a23 100644
> --- a/arch/powerpc/boot/Makefile
> +++ b/arch/powerpc/boot/Makefile
> @@ -70,7 +70,7 @@ $(addprefix $(obj)/,$(zlib) cuboot-c2k.o gunzip_util.o 
> main.o): \
>  libfdt   := fdt.c fdt_ro.c fdt_wip.c fdt_sw.c fdt_rw.c fdt_strerror.c
>  libfdtheader := fdt.h libfdt.h libfdt_internal.h
>
> -$(addprefix $(obj)/,$(libfdt) libfdt-wrapper.o simpleboot.o epapr.o): \
> +$(addprefix $(obj)/,$(libfdt) libfdt-wrapper.o simpleboot.o epapr.o opal.o): 
> \
>   $(addprefix $(obj)/,$(libfdtheader))
>
>  src-wlib-y := string.S crt0.S crtsavres.S stdio.c main.c \
> @@ -78,7 +78,7 @@ src-wlib-y := string.S crt0.S crtsavres.S stdio.c main.c \
>   ns16550.c serial.c simple_alloc.c div64.S util.S \
>   gunzip_util.c elf_util.c $(zlib) devtree.c stdlib.c \
>   oflib.c ofconsole.c cuboot.c mpsc.c cpm-serial.c \
> - uartlite.c mpc52xx-psc.c
> + uartlite.c mpc52xx-psc.c opal.c opal-calls.S
>  src-wlib-$(CONFIG_40x) += 4xx.c planetcore.c
>  src-wlib-$(CONFIG_44x) += 4xx.c ebony.c bamboo.c
>  src-wlib-$(CONFIG_8xx) += mpc8xx.c planetcore.c fsl-soc.c
> diff --git a/arch/powerpc/boot/opal-calls.S b/arch/powerpc/boot/opal-calls.S
> new file mode 100644
> index ..1f3c097e1552
> --- /dev/null
> +++ b/arch/powerpc/boot/opal-calls.S
> @@ -0,0 +1,49 @@

Needs (C) header.

> +#include "ppc_asm.h"
> +#include "../include/asm/opal-api.h"
> +
> + .text
> +
> +#define OPAL_CALL(name, token)   \
> + .globl name;\
> +name:\
> + li  r0, token;  \
> + b   opal_call;
> +
> +opal_call:
> + mflrr11
> + std r11,16(r1)
> + mfcrr12
> + stw r12,8(r1)
> + mr  r13,r2
> +
> + /* Set opal return address */
> + ld  r11,opal_return@got(r2)
> + mtlrr11
> + mfmsr   r12
> +
> + /* switch to BE when we enter OPAL */
> + li  r11,MSR_LE
> + andcr12,r12,r11
> + mtspr   SPRN_HSRR1,r12
> +
> + /* load the opal call entry point and base */
> + ld  r11,opal@got(r2)
> + ld  r12,8(r11)
> + ld  r2,0(r11)
> + mtspr   SPRN_HSRR0,r12
> + hrfid
> +
> +opal_return:
> + FIXUP_ENDIAN
> + mr  r

Re: [PATCH, RFC] cxl: Add support for CAPP DMA mode

2016-06-08 Thread Stewart Smith

Ian Munsie <imun...@au1.ibm.com> writes:

> From: Ian Munsie <imun...@au1.ibm.com>
>
> This adds support for using CAPP DMA mode, which is required for XSL
> based cards such as the Mellanox CX4 to function.
>
> This is currently an RFC as it depends on the corresponding support to
> be merged into skiboot first, which was submitted here:
> http://patchwork.ozlabs.org/patch/625582/

Merged to skiboot as of 5477148a439fda9fb55ea4a828c958fcdcc10f2e - which
will make it into skiboot 5.3.x

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH, RFC] cxl: Add support for CAPP DMA mode

2016-06-08 Thread Stewart Smith

Ian Munsie <imun...@au1.ibm.com> writes:
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c 
> b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 3a5ea82..5a42e98 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -2793,7 +2793,9 @@ int pnv_phb_to_cxl_mode(struct pci_dev *dev, uint64_t 
> mode)
>   pe_info(pe, "Switching PHB to CXL\n");
>  
>   rc = opal_pci_set_phb_cxl_mode(phb->opal_id, mode, pe->pe_number);
> - if (rc)
> + if (rc == OPAL_UNSUPPORTED)
> + dev_err(>dev, "Required cxl mode not supported by firmware 
> - update skiboot\n");
> + else if (rc)
>   dev_err(>dev, "opal_pci_set_phb_cxl_mode failed:
> %i\n", rc);

Could mention version required, which would be skiboot 5.3.x or higher.

This could be something we start doing - there's enough random bits of
functionality we could tell the user exactly what they have to upgrade
to to have work.

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: powerpc/nvram: Fix an incorrect partition merge

2016-06-06 Thread Stewart Smith

xinhui <xinhui@linux.vnet.ibm.com> writes:
>> Has it always been broken?
>>
>
> no. after nvram partition corruption hit, all nvram partitions will be
> erased and re-alloc after the second machine reboot.
> I don't know who does it but i guess it is the firmware. :)

It is. PAPR says we format NVRAM when it's corrupted. This is also true
for NVRAM for PowerNV (not just pseries guest).

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH 0030/1529] Fix typo

2016-05-22 Thread Stewart Smith

Andrea Gelmini <andrea.gelm...@gelma.net> writes:
> Signed-off-by: Andrea Gelmini <andrea.gelm...@gelma.net>
> ---
>  Documentation/devicetree/bindings/rtc/rtc-opal.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/Documentation/devicetree/bindings/rtc/rtc-opal.txt 
> b/Documentation/devicetree/bindings/rtc/rtc-opal.txt
> index a1734e5..2340938c 100644
> --- a/Documentation/devicetree/bindings/rtc/rtc-opal.txt
> +++ b/Documentation/devicetree/bindings/rtc/rtc-opal.txt
> @@ -2,7 +2,7 @@ IBM OPAL real-time clock
>  
>  
>  Required properties:
> -- comapatible: Should be "ibm,opal-rtc"
> +- compatible: Should be "ibm,opal-rtc"
>  
>  Optional properties:
>  - wakeup-source: Decides if the wakeup is supported or not

Acked-by: Stewart Smith <stew...@linux.vnet.ibm.com>

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH V2 1/2] devicetree/bindings: Add binding for operator panel on FSP machines

2016-04-26 Thread Stewart Smith

Suraj Jitindar Singh <sjitindarsi...@gmail.com> writes:
> Add a binding to Documentation/devicetree/bindings/powerpc/opal
> (oppanel-opal.txt) for the operator panel which is present on IBM
> pseries machines with FSPs.

It's not pseries (as that implies PowerVM / PAPR) - while here we're all
about OPAL.

With a slight change to the commit message,
Acked-by: Stewart Smith <stew...@linux.vnet.ibm.com>


-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v2 2/2] cpufreq: powernv: Ramp-down global pstate slower than local-pstate

2016-04-20 Thread Stewart Smith

Akshay Adiga <akshay.ad...@linux.vnet.ibm.com> writes:
> Iozone results show fairly consistent performance boost.
> YCSB on redis shows improved Max latencies in most cases.

What about power consumption?

> Iozone write/rewite test were made with filesizes 200704Kb and 401408Kb
> with different record sizes . The following table shows IOoperations/sec
> with and without patch.

> Iozone Results ( in op/sec) ( mean over 3 iterations )

What's the variance between runs?

> Tested with YCSB workload (50% update + 50% read) over redis for 1 million
> records and 1 million operation. Each test was carried out with target
> operations per second and persistence disabled.
>
> Max-latency (in us)( mean over 5 iterations )

What's the variance between runs?

std dev? 95th percentile?

> ---
> op/sOperation   with patch  without patch   %change
> ---
> 15000   Read61480.6 50261.4 22.32

This seems fairly significant regression. Any idea why at 15K op/s
there's such a regression?

> --- a/drivers/cpufreq/powernv-cpufreq.c
> +++ b/drivers/cpufreq/powernv-cpufreq.c
[ 15 more citation lines. Click/Enter to show. ]
> @@ -36,12 +36,56 @@
>  #include 
>  #include  /* Required for cpu_sibling_mask() in UP configs */
>  #include 
> +#include 
>  
>  #define POWERNV_MAX_PSTATES  256
>  #define PMSR_PSAFE_ENABLE(1UL << 30)
>  #define PMSR_SPR_EM_DISABLE  (1UL << 31)
>  #define PMSR_MAX(x)  ((x >> 32) & 0xFF)
>  
> +#define MAX_RAMP_DOWN_TIME   5120
> +/*
> + * On an idle system we want the global pstate to ramp-down from max value
> to
> + * min over a span of ~5 secs. Also we want it to initially ramp-down
> slowly and
> + * then ramp-down rapidly later on.

Where does 5 seconds come from?

Why 5 and not 10, or not 2? Is there some time period inherit in
hardware or software that this is computed from?

> +/* Interval after which the timer is queued to bring down global pstate */
> +#define GPSTATE_TIMER_INTERVAL   2000

in ms?

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH] tty/hvc: Use IRQF_SHARED for hvc consoles

2016-03-22 Thread Stewart Smith

Samuel Mendoza-Jonas <s...@mendozajonas.com> writes:
> Commit 2def86a7200c
> ("hvc: Convert to using interrupts instead of opal events")
> enabled the use of interrupts in the hvc_driver for OPAL platforms.
> However on machines with more than one hvc console, any console after
> the first will fail to register an interrupt handler in
> notifier_add_irq() since all consoles share the same IRQ number but do
> not set the IRQF_SHARED flag:
>
> [   51.179907] genirq: Flags mismatch irq 31.  (hvc_console) vs.
>  (hvc_console)
> [   51.180010] hvc_open: request_irq failed with rc -16.
>
> This error propagates up to hvc_open() and the console is closed, but
> OPAL will still generate interrupts that are not handled, leading to
> rcu_sched stall warnings.
>
> Set IRQF_SHARED when calling request_irq, allowing additional consoles
> to start properly.
>
> Signed-off-by: Samuel Mendoza-Jonas <s...@mendozajonas.com>
> Cc: <sta...@vger.kernel.org> # 4.1.x-

Tested on 4.4.6 - seemed to stop (some of) the problems I was having
when using it as a kernel for the bootloader on a FSP based POWER8
system.

Tested-by: Stewart Smith <stew...@linux.vnet.ibm.com>

-- 
Stewart Smith
OPAL Architect, IBM.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

1 2 3 >

1 - 100 of 250 matches

Mail list logo