Re: [RFC PATCH 10/43] powerpc/64s: Always set PMU control registers to frozen/disabled when not in use

2021-07-15 Thread Madhavan Srinivasan



On 7/14/21 6:09 PM, Nicholas Piggin wrote:

Excerpts from Nicholas Piggin's message of July 12, 2021 12:41 pm:

Excerpts from Athira Rajeev's message of July 10, 2021 12:50 pm:



On 22-Jun-2021, at 4:27 PM, Nicholas Piggin  wrote:

KVM PMU management code looks for particular frozen/disabled bits in
the PMU registers so it knows whether it must clear them when coming
out of a guest or not. Setting this up helps KVM make these optimisations
without getting confused. Longer term the better approach might be to
move guest/host PMU switching to the perf subsystem.

Signed-off-by: Nicholas Piggin 
---
arch/powerpc/kernel/cpu_setup_power.c | 4 ++--
arch/powerpc/kernel/dt_cpu_ftrs.c | 6 +++---
arch/powerpc/kvm/book3s_hv.c  | 5 +
arch/powerpc/perf/core-book3s.c   | 7 +++
4 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/cpu_setup_power.c 
b/arch/powerpc/kernel/cpu_setup_power.c
index a29dc8326622..3dc61e203f37 100644
--- a/arch/powerpc/kernel/cpu_setup_power.c
+++ b/arch/powerpc/kernel/cpu_setup_power.c
@@ -109,7 +109,7 @@ static void init_PMU_HV_ISA207(void)
static void init_PMU(void)
{
mtspr(SPRN_MMCRA, 0);
-   mtspr(SPRN_MMCR0, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);
mtspr(SPRN_MMCR1, 0);
mtspr(SPRN_MMCR2, 0);
}
@@ -123,7 +123,7 @@ static void init_PMU_ISA31(void)
{
mtspr(SPRN_MMCR3, 0);
mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
-   mtspr(SPRN_MMCR0, MMCR0_PMCCEXT);
+   mtspr(SPRN_MMCR0, MMCR0_FC | MMCR0_PMCCEXT);
}

/*
diff --git a/arch/powerpc/kernel/dt_cpu_ftrs.c 
b/arch/powerpc/kernel/dt_cpu_ftrs.c
index 0a6b36b4bda8..06a089fbeaa7 100644
--- a/arch/powerpc/kernel/dt_cpu_ftrs.c
+++ b/arch/powerpc/kernel/dt_cpu_ftrs.c
@@ -353,7 +353,7 @@ static void init_pmu_power8(void)
}

mtspr(SPRN_MMCRA, 0);
-   mtspr(SPRN_MMCR0, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);
mtspr(SPRN_MMCR1, 0);
mtspr(SPRN_MMCR2, 0);
mtspr(SPRN_MMCRS, 0);
@@ -392,7 +392,7 @@ static void init_pmu_power9(void)
mtspr(SPRN_MMCRC, 0);

mtspr(SPRN_MMCRA, 0);
-   mtspr(SPRN_MMCR0, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);
mtspr(SPRN_MMCR1, 0);
mtspr(SPRN_MMCR2, 0);
}
@@ -428,7 +428,7 @@ static void init_pmu_power10(void)

mtspr(SPRN_MMCR3, 0);
mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
-   mtspr(SPRN_MMCR0, MMCR0_PMCCEXT);
+   mtspr(SPRN_MMCR0, MMCR0_FC | MMCR0_PMCCEXT);
}

static int __init feat_enable_pmu_power10(struct dt_cpu_feature *f)
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1f30f98b09d1..f7349d150828 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2593,6 +2593,11 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu 
*vcpu)
#endif
#endif
vcpu->arch.mmcr[0] = MMCR0_FC;
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   vcpu->arch.mmcr[0] |= MMCR0_PMCCEXT;
+   vcpu->arch.mmcra = MMCRA_BHRB_DISABLE;
+   }
+
vcpu->arch.ctrl = CTRL_RUNLATCH;
/* default to host PVR, since we can't spoof it */
kvmppc_set_pvr_hv(vcpu, mfspr(SPRN_PVR));
diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
index 51622411a7cc..e33b29ec1a65 100644
--- a/arch/powerpc/perf/core-book3s.c
+++ b/arch/powerpc/perf/core-book3s.c
@@ -1361,6 +1361,13 @@ static void power_pmu_enable(struct pmu *pmu)
goto out;

if (cpuhw->n_events == 0) {
+   if (cpu_has_feature(CPU_FTR_ARCH_31)) {
+   mtspr(SPRN_MMCRA, MMCRA_BHRB_DISABLE);
+   mtspr(SPRN_MMCR0, MMCR0_FC | MMCR0_PMCCEXT);
+   } else {
+   mtspr(SPRN_MMCRA, 0);
+   mtspr(SPRN_MMCR0, MMCR0_FC);
+   }


Hi Nick,

We are setting these bits in “power_pmu_disable” function. And disable will be 
called before any event gets deleted/stopped. Can you please help to understand 
why this is needed in power_pmu_enable path also ?

I'll have to go back and check what I needed it for.

Okay, MMCRA is getting MMCRA_SDAR_MODE_DCACHE set on POWER9, by the looks.

That's not necessarily a problem, but KVM sets MMCRA to 0 to disable
SDAR updates. So KVM and perf don't agree on what the "correct" value
for disabled is. Which could be a problem with POWER10 not setting BHRB
disable before my series.


Nick,

BHRB disable will be only known to P10 guest kernel and in case of
P9 guest kernel it is not known. And for the P10 host and guest,
I guess we set BHRB_DISABLE in cpu_setup code and also in 
power_pmu_disable().

So you areseeing P10 guest having MMCRA as zero during context switch?

Maddy


I'll get rid of this hunk for now, I expect things won't be exactly clean
or consistent until the KVM host PMU code is moved into perf/ though.

Thanks,
Nick


[PATCH 2/2] KVM: PPC: Fix kvm_arch_vcpu_ioctl vcpu_load leak

2021-07-15 Thread Nicholas Piggin
vcpu_put is not called if the user copy fails. This can result in preempt
notifier corruption and crashes, among other issues.

Reported-by: Alexey Kardashevskiy 
Fixes: b3cebfe8c1ca ("KVM: PPC: Move vcpu_load/vcpu_put down to each ioctl case 
in kvm_arch_vcpu_ioctl")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/powerpc.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index be33b5321a76..b4e6f70b97b9 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -2048,9 +2048,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
{
struct kvm_enable_cap cap;
r = -EFAULT;
-   vcpu_load(vcpu);
if (copy_from_user(, argp, sizeof(cap)))
goto out;
+   vcpu_load(vcpu);
r = kvm_vcpu_ioctl_enable_cap(vcpu, );
vcpu_put(vcpu);
break;
@@ -2074,9 +2074,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
case KVM_DIRTY_TLB: {
struct kvm_dirty_tlb dirty;
r = -EFAULT;
-   vcpu_load(vcpu);
if (copy_from_user(, argp, sizeof(dirty)))
goto out;
+   vcpu_load(vcpu);
r = kvm_vcpu_ioctl_dirty_tlb(vcpu, );
vcpu_put(vcpu);
break;
-- 
2.23.0



[PATCH 1/2] KVM: PPC: Book3S: Fix CONFIG_TRANSACTIONAL_MEM=n crash

2021-07-15 Thread Nicholas Piggin
When running CPU_FTR_P9_TM_HV_ASSIST, HFSCR[TM] is set for the guest
even if the host has CONFIG_TRANSACTIONAL_MEM=n, which causes it to be
unprepared to handle guest exits while transactional.

Normal guests don't have a problem because the HTM capability will not
be advertised, but a rogue or buggy one could crash the host.

Reported-by: Alexey Kardashevskiy 
Fixes: 4bb3c7a0208f ("KVM: PPC: Book3S HV: Work around transactional memory 
bugs in POWER9")
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 1d1fcc290fca..085fb8ecbf68 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2697,8 +2697,10 @@ static int kvmppc_core_vcpu_create_hv(struct kvm_vcpu 
*vcpu)
HFSCR_DSCR | HFSCR_VECVSX | HFSCR_FP | HFSCR_PREFIX;
if (cpu_has_feature(CPU_FTR_HVMODE)) {
vcpu->arch.hfscr &= mfspr(SPRN_HFSCR);
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
if (cpu_has_feature(CPU_FTR_P9_TM_HV_ASSIST))
vcpu->arch.hfscr |= HFSCR_TM;
+#endif
}
if (cpu_has_feature(CPU_FTR_TM_COMP))
vcpu->arch.hfscr |= HFSCR_TM;
-- 
2.23.0



[powerpc:fixes-test] BUILD SUCCESS e44fbdb68049539de9923ce4bad2d277aef54892

2021-07-15 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
fixes-test
branch HEAD: e44fbdb68049539de9923ce4bad2d277aef54892  KVM: PPC: Book3S HV P9: 
Fix guest TM support

elapsed time: 731m

configs tested: 58
configs skipped: 106

The following configs have been built successfully.
More configs may be tested in the coming days.

gcc tested configs:
sh   se7705_defconfig
arc  axs101_defconfig
sh  rts7751r2d1_defconfig
arm  gemini_defconfig
powerpc asp8347_defconfig
riscvnommu_virt_defconfig
pariscgeneric-32bit_defconfig
mips mpc30x_defconfig
x86_64  defconfig
arm  pxa910_defconfig
xtensa  cadence_csp_defconfig
arm  tct_hammer_defconfig
arm   sama5_defconfig
mips cu1000-neo_defconfig
h8300   defconfig
arm  alldefconfig
x86_64allnoconfig
m68k allmodconfig
m68kdefconfig
m68k allyesconfig
nios2   defconfig
arc  allyesconfig
nds32 allnoconfig
nds32   defconfig
nios2allyesconfig
cskydefconfig
alpha   defconfig
alphaallyesconfig
xtensa   allyesconfig
h8300allyesconfig
arc defconfig
sh   allmodconfig
parisc  defconfig
s390 allyesconfig
s390 allmodconfig
parisc   allyesconfig
s390defconfig
powerpc  allyesconfig
powerpc  allmodconfig
powerpc   allnoconfig
i386 randconfig-a005-20210715
i386 randconfig-a006-20210715
i386 randconfig-a004-20210715
i386 randconfig-a001-20210715
i386 randconfig-a003-20210715
i386 randconfig-a002-20210715
x86_64   randconfig-a013-20210715
x86_64   randconfig-a012-20210715
x86_64   randconfig-a015-20210715
x86_64   randconfig-a014-20210715
x86_64   randconfig-a016-20210715
x86_64   randconfig-a011-20210715
i386 randconfig-a014-20210715
i386 randconfig-a015-20210715
i386 randconfig-a011-20210715
i386 randconfig-a013-20210715
i386 randconfig-a012-20210715
i386 randconfig-a016-20210715

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-...@lists.01.org


Re: [PATCH v2 4/4] bus: Make remove callback return void

2021-07-15 Thread Thomas Bogendoerfer
On Tue, Jul 06, 2021 at 05:48:03PM +0200, Uwe Kleine-König wrote:
> The driver core ignores the return value of this callback because there
> is only little it can do when a device disappears.
> 
> This is the final bit of a long lasting cleanup quest where several
> buses were converted to also return void from their remove callback.
> Additionally some resource leaks were fixed that were caused by drivers
> returning an error code in the expectation that the driver won't go
> away.
> 
> With struct bus_type::remove returning void it's prevented that newly
> implemented buses return an ignored error code and so don't anticipate
> wrong expectations for driver authors.
> 
> Acked-by: Russell King (Oracle)  (For ARM, Amba 
> and related parts)
> Acked-by: Mark Brown 
> Acked-by: Chen-Yu Tsai  (for drivers/bus/sunxi-rsb.c)
> Acked-by: Pali Rohár 
> Acked-by: Mauro Carvalho Chehab  (for drivers/media)
> Acked-by: Hans de Goede  (For drivers/platform)
> Acked-by: Alexandre Belloni 
> Acked-By: Vinod Koul 
> Acked-by: Juergen Gross  (For Xen)
> Acked-by: Lee Jones  (For drivers/mfd)
> Acked-by: Johannes Thumshirn  (For drivers/mcb)
> Acked-by: Johan Hovold 
> Acked-by: Srinivas Kandagatla  (For 
> drivers/slimbus)
> Acked-by: Kirti Wankhede  (For drivers/vfio)
> Acked-by: Maximilian Luz 
> Acked-by: Heikki Krogerus  (For ulpi and 
> typec)
> Acked-by: Samuel Iglesias Gonsálvez  (For ipack)
> Reviewed-by: Tom Rix  (For fpga)
> Acked-by: Geoff Levand  (For ps3)
> Signed-off-by: Uwe Kleine-König 
> ---
> [...] 
>  arch/mips/sgi-ip22/ip22-gio.c | 3 +--

Acked-by: Thomas Bogendoerfer 

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.[ RFC1925, 2.3 ]


Re: [PATCH v3 1/1] powerpc/pseries: Interface to represent PAPR firmware attributes

2021-07-15 Thread Fabiano Rosas
Pratik Sampat  writes:

> Hello,
>
> On 12/07/21 9:13 pm, Fabiano Rosas wrote:
>> "Pratik R. Sampat"  writes:
>>
>> Hi, have you seen Documentation/core-api/kobject.rst, particularly the
>> part that says:
>>
>> "When you see a sysfs directory full of other directories, generally each
>> of those directories corresponds to a kobject in the same kset."
>>
>> Taking a look at samples/kobject/kset-example.c, it seems to provide an
>> overall structure that is closer to what other modules do when creating
>> sysfs entries. It uses less dynamic allocations and deals a bit better
>> with cleaning up the state afterwards.
>>
> Thank you for pointing me towards this example, the kset approach is
> interesting and the example indeed does handle cleanups better.
>
> Currently, we use "machine_device_initcall()" to register this
> functionality, do you suggest I convert this into a tristate module
> instead where I can include a "module_exit" for cleanups?

Ugh.. I was hoping we could get away with having all cleanups done at
kobject release time. But now I see that it is not called unless we
decrement the reference count. Nevermind then.

>>> +   ret = plpar_hcall_norets(H_GET_ENERGY_SCALE_INFO, ESI_FLAGS_ALL, 0,
>>> +virt_to_phys(esi_buf), MAX_BUF_SZ);
>>> +   esi_hdr = (struct h_energy_scale_info_hdr *) esi_buf;
>>> +   if (ret != H_SUCCESS || esi_hdr->data_header_version != ESI_VERSION) {
>> I really dislike this. If you want to bail due to version change, then
>> at least include in the ABI document that we might not give the
>> userspace any data at all.
>
> My only concern for having a version check is that, the attribute list
> can change as well as the attributes itself may change.
> If that is the case, then in a newer version if we do not bail out we
> may parse data into our structs incorrectly.

Sure, that is a valid concern. But the documentation for the header
version field says:

  "Version of the Header. The header will be always backward compatible,
  and changes will not impact the Array of attributes. Current version =
  0x01"

I guess this is a bit vague still, but I understood that:

1- header elements continue to exist at the same position;
2- the format of the array of attributes will not change.

Are you saying that my interpretation above is not correct or that you
don't trust the HV to enforce it?

> My argument only hinges on that we should likely give no data at all
> instead of junk or incorrect data.

I agree. I just don't think it would be possible to end up with
incorrect data, unless the HV has a bug.

> Maybe I could make this check after the return check and give out a
> version mismatch message like the following?
> pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO VER MISMATCH - EXP: 0x%x, REC: 
> 0x%x",
>  ESI_VERSION, esi_hdr->data_header_version);

Yes, this will help with debug if we ever end up in this situation.

>>> +   pr_warn("hcall failed: H_GET_ENERGY_SCALE_INFO");
>>> +   goto out;
>>> +   }
>>> +
>>> +   num_attrs = be64_to_cpu(esi_hdr->num_attrs);
>>> +   /*
>>> +* Typecast the energy buffer to the attribute structure at the offset
>>> +* specified in the buffer
>>> +*/
>> I think the code is now simple enough that this comment could be
>> removed.
>
> ack
>
>>> +   esi_attrs = (struct energy_scale_attribute *)
>>> +   (esi_buf + be64_to_cpu(esi_hdr->array_offset));
>>> +
>>> +   pgs = kcalloc(num_attrs, sizeof(*pgs), GFP_KERNEL);
>> This is never freed.
>>
>>> +   if (!pgs)
>>> +   goto out_pgs;
>>> +
>>> +   papr_kobj = kobject_create_and_add("papr", firmware_kobj);
>>> +   if (!papr_kobj) {
>>> +   pr_warn("kobject_create_and_add papr failed\n");
>>> +   goto out_kobj;
>>> +   }
>>> +
>>> +   esi_kobj = kobject_create_and_add("energy_scale_info", papr_kobj);
>>> +   if (!esi_kobj) {
>>> +   pr_warn("kobject_create_and_add energy_scale_info failed\n");
>>> +   goto out_ekobj;
>>> +   }
>>> +
>>> +   for (idx = 0; idx < num_attrs; idx++) {
>>> +   char buf[4];
>>> +   bool show_val_desc = true;
>>> +
>>> +   pgs[idx].pgattrs = kcalloc(MAX_ATTRS,
>>> +  sizeof(*pgs[idx].pgattrs),
>>> +  GFP_KERNEL);
>>> +   if (!pgs[idx].pgattrs)
>>> +   goto out_kobj;
>>> +
>>> +   pgs[idx].pg.attrs = kcalloc(MAX_ATTRS + 1,
>>> +   sizeof(*pgs[idx].pg.attrs),
>>> +   GFP_KERNEL);
>> I think the kobject code expects this to be statically allocated, so
>> you'd need to override the release function in some way to be able to
>> free this.
>
> Right this and pgs both are never free'd because my understanding was
> that as this functionality is invoked from machine_init, I'd expect it
> to stay until shutdown.

Yep, I thought the kset code would improve this, but I misread 

[PATCH printk v4 3/6] printk: remove safe buffers

2021-07-15 Thread John Ogness
With @logbuf_lock removed, the high level printk functions for
storing messages are lockless. Messages can be stored from any
context, so there is no need for the NMI and safe buffers anymore.
Remove the NMI and safe buffers.

Although the safe buffers are removed, the NMI and safe context
tracking is still in place. In these contexts, store the message
immediately but still use irq_work to defer the console printing.

Since printk recursion tracking is in place, safe context tracking
for most of printk is not needed. Remove it. Only safe context
tracking relating to the console and console_owner locks is left
in place. This is because the console and console_owner locks are
needed for the actual printing.

Signed-off-by: John Ogness 
---
 arch/powerpc/kernel/traps.c|   1 -
 arch/powerpc/kernel/watchdog.c |   5 -
 include/linux/printk.h |  10 -
 kernel/kexec_core.c|   1 -
 kernel/panic.c |   3 -
 kernel/printk/internal.h   |  17 --
 kernel/printk/printk.c | 120 +---
 kernel/printk/printk_safe.c| 335 +
 lib/nmi_backtrace.c|   6 -
 9 files changed, 48 insertions(+), 450 deletions(-)

diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index b4ab95c9e94a..2522800217d1 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -170,7 +170,6 @@ extern void panic_flush_kmsg_start(void)
 
 extern void panic_flush_kmsg_end(void)
 {
-   printk_safe_flush_on_panic();
kmsg_dump(KMSG_DUMP_PANIC);
bust_spinlocks(0);
debug_locks_off();
diff --git a/arch/powerpc/kernel/watchdog.c b/arch/powerpc/kernel/watchdog.c
index c9a8f4781a10..dc17d8903d4f 100644
--- a/arch/powerpc/kernel/watchdog.c
+++ b/arch/powerpc/kernel/watchdog.c
@@ -183,11 +183,6 @@ static void watchdog_smp_panic(int cpu, u64 tb)
 
wd_smp_unlock();
 
-   printk_safe_flush();
-   /*
-* printk_safe_flush() seems to require another print
-* before anything actually goes out to console.
-*/
if (sysctl_hardlockup_all_cpu_backtrace)
trigger_allbutself_cpu_backtrace();
 
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 1790a5521fd9..664612f75dac 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -207,8 +207,6 @@ __printf(1, 2) void dump_stack_set_arch_desc(const char 
*fmt, ...);
 void dump_stack_print_info(const char *log_lvl);
 void show_regs_print_info(const char *log_lvl);
 extern asmlinkage void dump_stack(void) __cold;
-extern void printk_safe_flush(void);
-extern void printk_safe_flush_on_panic(void);
 #else
 static inline __printf(1, 0)
 int vprintk(const char *s, va_list args)
@@ -272,14 +270,6 @@ static inline void show_regs_print_info(const char 
*log_lvl)
 static inline void dump_stack(void)
 {
 }
-
-static inline void printk_safe_flush(void)
-{
-}
-
-static inline void printk_safe_flush_on_panic(void)
-{
-}
 #endif
 
 #ifdef CONFIG_SMP
diff --git a/kernel/kexec_core.c b/kernel/kexec_core.c
index f099baee3578..69c6e9b7761c 100644
--- a/kernel/kexec_core.c
+++ b/kernel/kexec_core.c
@@ -978,7 +978,6 @@ void crash_kexec(struct pt_regs *regs)
old_cpu = atomic_cmpxchg(_cpu, PANIC_CPU_INVALID, this_cpu);
if (old_cpu == PANIC_CPU_INVALID) {
/* This is the 1st CPU which comes here, so go ahead. */
-   printk_safe_flush_on_panic();
__crash_kexec(regs);
 
/*
diff --git a/kernel/panic.c b/kernel/panic.c
index 332736a72a58..1f0df42f8d0c 100644
--- a/kernel/panic.c
+++ b/kernel/panic.c
@@ -247,7 +247,6 @@ void panic(const char *fmt, ...)
 * Bypass the panic_cpu check and call __crash_kexec directly.
 */
if (!_crash_kexec_post_notifiers) {
-   printk_safe_flush_on_panic();
__crash_kexec(NULL);
 
/*
@@ -271,8 +270,6 @@ void panic(const char *fmt, ...)
 */
atomic_notifier_call_chain(_notifier_list, 0, buf);
 
-   /* Call flush even twice. It tries harder with a single online CPU */
-   printk_safe_flush_on_panic();
kmsg_dump(KMSG_DUMP_PANIC);
 
/*
diff --git a/kernel/printk/internal.h b/kernel/printk/internal.h
index 51615c909b2f..6cc35c5de890 100644
--- a/kernel/printk/internal.h
+++ b/kernel/printk/internal.h
@@ -22,7 +22,6 @@ __printf(1, 0) int vprintk_deferred(const char *fmt, va_list 
args);
 void __printk_safe_enter(void);
 void __printk_safe_exit(void);
 
-void printk_safe_init(void);
 bool printk_percpu_data_ready(void);
 
 #define printk_safe_enter_irqsave(flags)   \
@@ -37,18 +36,6 @@ bool printk_percpu_data_ready(void);
local_irq_restore(flags);   \
} while (0)
 
-#define printk_safe_enter_irq()\
-   do {\
-   local_irq_disable();\
-   __printk_safe_enter();  \
-   } while (0)

[PATCH printk v4 4/6] printk: remove NMI tracking

2021-07-15 Thread John Ogness
All NMI contexts are handled the same as the safe context: store the
message and defer printing. There is no need to have special NMI
context tracking for this. Using in_nmi() is enough.

There are several parts of the kernel that are manually calling into
the printk NMI context tracking in order to cause general printk
deferred printing:

arch/arm/kernel/smp.c
arch/powerpc/kexec/crash.c
kernel/trace/trace.c

For these users, provide a new function pair
printk_deferred_enter/exit that explicitly achieves the same
objective.

Signed-off-by: John Ogness 
---
 arch/arm/kernel/smp.c   |  4 ++--
 arch/powerpc/kexec/crash.c  |  2 +-
 include/linux/hardirq.h |  2 --
 include/linux/printk.h  | 31 +++
 init/Kconfig|  5 -
 kernel/printk/internal.h|  8 
 kernel/printk/printk_safe.c | 37 +
 kernel/trace/trace.c|  4 ++--
 8 files changed, 25 insertions(+), 68 deletions(-)

diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index c7bb168b0d97..842427ff2b3c 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -667,9 +667,9 @@ static void do_handle_IPI(int ipinr)
break;
 
case IPI_CPU_BACKTRACE:
-   printk_nmi_enter();
+   printk_deferred_enter();
nmi_cpu_backtrace(get_irq_regs());
-   printk_nmi_exit();
+   printk_deferred_exit();
break;
 
default:
diff --git a/arch/powerpc/kexec/crash.c b/arch/powerpc/kexec/crash.c
index 0196d0c211ac..1070378c8e35 100644
--- a/arch/powerpc/kexec/crash.c
+++ b/arch/powerpc/kexec/crash.c
@@ -313,7 +313,7 @@ void default_machine_crash_shutdown(struct pt_regs *regs)
int (*old_handler)(struct pt_regs *regs);
 
/* Avoid hardlocking with irresponsive CPU holding logbuf_lock */
-   printk_nmi_enter();
+   printk_deferred_enter();
 
/*
 * This function is only called after the system
diff --git a/include/linux/hardirq.h b/include/linux/hardirq.h
index 69bc86ea382c..76878b357ffa 100644
--- a/include/linux/hardirq.h
+++ b/include/linux/hardirq.h
@@ -116,7 +116,6 @@ extern void rcu_nmi_exit(void);
do {\
lockdep_off();  \
arch_nmi_enter();   \
-   printk_nmi_enter(); \
BUG_ON(in_nmi() == NMI_MASK);   \
__preempt_count_add(NMI_OFFSET + HARDIRQ_OFFSET);   \
} while (0)
@@ -135,7 +134,6 @@ extern void rcu_nmi_exit(void);
do {\
BUG_ON(!in_nmi());  \
__preempt_count_sub(NMI_OFFSET + HARDIRQ_OFFSET);   \
-   printk_nmi_exit();  \
arch_nmi_exit();\
lockdep_on();   \
} while (0)
diff --git a/include/linux/printk.h b/include/linux/printk.h
index 664612f75dac..03f7ccedaf18 100644
--- a/include/linux/printk.h
+++ b/include/linux/printk.h
@@ -149,18 +149,6 @@ static inline __printf(1, 2) __cold
 void early_printk(const char *s, ...) { }
 #endif
 
-#ifdef CONFIG_PRINTK_NMI
-extern void printk_nmi_enter(void);
-extern void printk_nmi_exit(void);
-extern void printk_nmi_direct_enter(void);
-extern void printk_nmi_direct_exit(void);
-#else
-static inline void printk_nmi_enter(void) { }
-static inline void printk_nmi_exit(void) { }
-static inline void printk_nmi_direct_enter(void) { }
-static inline void printk_nmi_direct_exit(void) { }
-#endif /* PRINTK_NMI */
-
 struct dev_printk_info;
 
 #ifdef CONFIG_PRINTK
@@ -180,6 +168,16 @@ int printk(const char *fmt, ...);
  */
 __printf(1, 2) __cold int printk_deferred(const char *fmt, ...);
 
+extern void __printk_safe_enter(void);
+extern void __printk_safe_exit(void);
+/*
+ * The printk_deferred_enter/exit macros are available only as a hack for
+ * some code paths that need to defer all printk console printing. Interrupts
+ * must be disabled for the deferred duration.
+ */
+#define printk_deferred_enter __printk_safe_enter
+#define printk_deferred_exit __printk_safe_exit
+
 /*
  * Please don't use printk_ratelimit(), because it shares ratelimiting state
  * with all other unrelated printk_ratelimit() callsites.  Instead use
@@ -223,6 +221,15 @@ int printk_deferred(const char *s, ...)
 {
return 0;
 }
+
+static inline void printk_deferred_enter(void)
+{
+}
+
+static inline void printk_deferred_exit(void)
+{
+}
+
 static inline int printk_ratelimit(void)
 {
return 0;
diff --git a/init/Kconfig b/init/Kconfig
index a61c92066c2e..9c0510693543 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1506,11 +1506,6 @@ config PRINTK
  very difficult to 

[PATCH printk v4 0/6] printk: remove safe buffers

2021-07-15 Thread John Ogness
Hi,

Here is v4 of a series to remove the safe buffers. v3 can be
found here [0]. The safe buffers are no longer needed because
messages can be stored directly into the log buffer from any
context.

However, the safe buffers also provided a form of recursion
protection. For that reason, explicit recursion protection is
implemented for this series.

The safe buffers also implicitly provided serialization
between multiple CPUs executing in NMI context. This was
particularly necessary for the nmi_backtrace() output. This
serializiation is now preserved by using the printk cpulock.

With the removal of the safe buffers, there is no need for
extra NMI enter/exit tracking. So this is also removed
(which includes removing the config option CONFIG_PRINTK_NMI).

And finally, there are a few places in the kernel that need to
specify code blocks where all printk calls are to be deferred
printing. Previously the NMI tracking API was being (mis)used
for this purpose. This series introduces an official and
explicit interface for such cases. (Note that all deferred
printing will be removed anyway, once printing kthreads are
introduced.)

Changes since v3:

- Remove safe context tracking in vprintk().

- Add safe context tracking for @console_owner usage since that
  is also a component of the printing code.

- Refactor syslog_print() so that it is easier to understand
  and follow the locking logic.

- Introduce printk_deferred_enter/exit functions to be used by
  code that needs to specify code block where all printk calls
  are to be deferred printing.

John Ogness

[0] https://lore.kernel.org/lkml/2021062448.5190-1-john.ogn...@linutronix.de

John Ogness (6):
  lib/nmi_backtrace: explicitly serialize banner and regs
  printk: track/limit recursion
  printk: remove safe buffers
  printk: remove NMI tracking
  printk: convert @syslog_lock to mutex
  printk: syslog: close window between wait and read

 arch/arm/kernel/smp.c  |   4 +-
 arch/powerpc/kernel/traps.c|   1 -
 arch/powerpc/kernel/watchdog.c |   5 -
 arch/powerpc/kexec/crash.c |   2 +-
 include/linux/hardirq.h|   2 -
 include/linux/printk.h |  41 ++--
 init/Kconfig   |   5 -
 kernel/kexec_core.c|   1 -
 kernel/panic.c |   3 -
 kernel/printk/internal.h   |  25 ---
 kernel/printk/printk.c | 268 ++--
 kernel/printk/printk_safe.c| 364 +
 kernel/trace/trace.c   |   4 +-
 lib/nmi_backtrace.c|  13 +-
 14 files changed, 194 insertions(+), 544 deletions(-)


base-commit: 70333dec446292cd896cd051d2ebd6808b328949
-- 
2.20.1



[PATCH v1 09/16] powerpc/iommu: return error code from .map_sg() ops

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure.

Propagate the error up if vio_dma_iommu_map_sg() fails.

ppc_iommu_map_sg() may fail either because of iommu_range_alloc() or
because of tbl->it_ops->set(). The former only supports returning an
error with DMA_MAPPING_ERROR and an examination of the latter indicates
that it may return arch-specific errors (for example,
tce_buildmulti_pSeriesLP()). Hence, coalesce all of those errors into
-EINVAL;

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Geoff Levand 
---
 arch/powerpc/kernel/iommu.c | 4 ++--
 arch/powerpc/platforms/ps3/system-bus.c | 2 +-
 arch/powerpc/platforms/pseries/vio.c| 5 +++--
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 2af89a5e379f..bd0ed618bfa5 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -473,7 +473,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table 
*tbl,
BUG_ON(direction == DMA_NONE);
 
if ((nelems == 0) || !tbl)
-   return 0;
+   return -EINVAL;
 
outs = s = segstart = [0];
outcount = 1;
@@ -599,7 +599,7 @@ int ppc_iommu_map_sg(struct device *dev, struct iommu_table 
*tbl,
if (s == outs)
break;
}
-   return 0;
+   return -EINVAL;
 }
 
 
diff --git a/arch/powerpc/platforms/ps3/system-bus.c 
b/arch/powerpc/platforms/ps3/system-bus.c
index 1a5665875165..c54eb46f0cfb 100644
--- a/arch/powerpc/platforms/ps3/system-bus.c
+++ b/arch/powerpc/platforms/ps3/system-bus.c
@@ -663,7 +663,7 @@ static int ps3_ioc0_map_sg(struct device *_dev, struct 
scatterlist *sg,
   unsigned long attrs)
 {
BUG();
-   return 0;
+   return -EINVAL;
 }
 
 static void ps3_sb_unmap_sg(struct device *_dev, struct scatterlist *sg,
diff --git a/arch/powerpc/platforms/pseries/vio.c 
b/arch/powerpc/platforms/pseries/vio.c
index e00f3725ec96..e31e59c54f30 100644
--- a/arch/powerpc/platforms/pseries/vio.c
+++ b/arch/powerpc/platforms/pseries/vio.c
@@ -560,7 +560,8 @@ static int vio_dma_iommu_map_sg(struct device *dev, struct 
scatterlist *sglist,
for_each_sg(sglist, sgl, nelems, count)
alloc_size += roundup(sgl->length, IOMMU_PAGE_SIZE(tbl));
 
-   if (vio_cmo_alloc(viodev, alloc_size))
+   ret = vio_cmo_alloc(viodev, alloc_size);
+   if (ret)
goto out_fail;
ret = ppc_iommu_map_sg(dev, tbl, sglist, nelems, dma_get_mask(dev),
direction, attrs);
@@ -577,7 +578,7 @@ static int vio_dma_iommu_map_sg(struct device *dev, struct 
scatterlist *sglist,
vio_cmo_dealloc(viodev, alloc_size);
 out_fail:
atomic_inc(>cmo.allocs_failed);
-   return 0;
+   return ret;
 }
 
 static void vio_dma_iommu_unmap_sg(struct device *dev,
-- 
2.20.1



[PATCH v1 06/16] ARM/dma-mapping: return error code from .map_sg() ops

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure,
so propagate any errors that may happen all the way up.

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
Cc: Russell King 
Cc: Thomas Bogendoerfer 
---
 arch/arm/mm/dma-mapping.c | 22 +-
 1 file changed, 13 insertions(+), 9 deletions(-)

diff --git a/arch/arm/mm/dma-mapping.c b/arch/arm/mm/dma-mapping.c
index c4b8df2ad328..8c286e690756 100644
--- a/arch/arm/mm/dma-mapping.c
+++ b/arch/arm/mm/dma-mapping.c
@@ -980,7 +980,7 @@ int arm_dma_map_sg(struct device *dev, struct scatterlist 
*sg, int nents,
 {
const struct dma_map_ops *ops = get_dma_ops(dev);
struct scatterlist *s;
-   int i, j;
+   int i, j, ret;
 
for_each_sg(sg, s, nents, i) {
 #ifdef CONFIG_NEED_SG_DMA_LENGTH
@@ -988,7 +988,8 @@ int arm_dma_map_sg(struct device *dev, struct scatterlist 
*sg, int nents,
 #endif
s->dma_address = ops->map_page(dev, sg_page(s), s->offset,
s->length, dir, attrs);
-   if (dma_mapping_error(dev, s->dma_address))
+   ret = dma_mapping_error(dev, s->dma_address);
+   if (ret)
goto bad_mapping;
}
return nents;
@@ -996,7 +997,7 @@ int arm_dma_map_sg(struct device *dev, struct scatterlist 
*sg, int nents,
  bad_mapping:
for_each_sg(sg, s, i, j)
ops->unmap_page(dev, sg_dma_address(s), sg_dma_len(s), dir, 
attrs);
-   return 0;
+   return ret;
 }
 
 /**
@@ -1622,7 +1623,7 @@ static int __iommu_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
 bool is_coherent)
 {
struct scatterlist *s = sg, *dma = sg, *start = sg;
-   int i, count = 0;
+   int i, count = 0, ret;
unsigned int offset = s->offset;
unsigned int size = s->offset + s->length;
unsigned int max = dma_get_max_seg_size(dev);
@@ -1634,8 +1635,10 @@ static int __iommu_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
s->dma_length = 0;
 
if (s->offset || (size & ~PAGE_MASK) || size + s->length > max) 
{
-   if (__map_sg_chunk(dev, start, size, >dma_address,
-   dir, attrs, is_coherent) < 0)
+   ret = __map_sg_chunk(dev, start, size,
+>dma_address, dir, attrs,
+is_coherent);
+   if (ret < 0)
goto bad_mapping;
 
dma->dma_address += offset;
@@ -1648,8 +1651,9 @@ static int __iommu_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
}
size += s->length;
}
-   if (__map_sg_chunk(dev, start, size, >dma_address, dir, attrs,
-   is_coherent) < 0)
+   ret = __map_sg_chunk(dev, start, size, >dma_address, dir, attrs,
+is_coherent);
+   if (ret < 0)
goto bad_mapping;
 
dma->dma_address += offset;
@@ -1660,7 +1664,7 @@ static int __iommu_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
 bad_mapping:
for_each_sg(sg, s, count, i)
__iommu_remove_mapping(dev, sg_dma_address(s), sg_dma_len(s));
-   return 0;
+   return ret;
 }
 
 /**
-- 
2.20.1



[PATCH v1 03/16] iommu: Return full error code from iommu_map_sg[_atomic]()

2021-07-15 Thread Logan Gunthorpe
Convert to ssize_t return code so the return code from __iommu_map()
can be returned all the way down through dma_iommu_map_sg().

Signed-off-by: Logan Gunthorpe 
Cc: Joerg Roedel 
Cc: Will Deacon 
---
 drivers/iommu/iommu.c | 15 +++
 include/linux/iommu.h | 22 +++---
 2 files changed, 18 insertions(+), 19 deletions(-)

diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
index 5419c4b9f27a..bf971b4e34aa 100644
--- a/drivers/iommu/iommu.c
+++ b/drivers/iommu/iommu.c
@@ -2567,9 +2567,9 @@ size_t iommu_unmap_fast(struct iommu_domain *domain,
 }
 EXPORT_SYMBOL_GPL(iommu_unmap_fast);
 
-static size_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-struct scatterlist *sg, unsigned int nents, int 
prot,
-gfp_t gfp)
+static ssize_t __iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+   struct scatterlist *sg, unsigned int nents, int prot,
+   gfp_t gfp)
 {
const struct iommu_ops *ops = domain->ops;
size_t len = 0, mapped = 0;
@@ -2610,19 +2610,18 @@ static size_t __iommu_map_sg(struct iommu_domain 
*domain, unsigned long iova,
/* undo mappings already done */
iommu_unmap(domain, iova, mapped);
 
-   return 0;
-
+   return ret;
 }
 
-size_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-   struct scatterlist *sg, unsigned int nents, int prot)
+ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+struct scatterlist *sg, unsigned int nents, int prot)
 {
might_sleep();
return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_KERNEL);
 }
 EXPORT_SYMBOL_GPL(iommu_map_sg);
 
-size_t iommu_map_sg_atomic(struct iommu_domain *domain, unsigned long iova,
+ssize_t iommu_map_sg_atomic(struct iommu_domain *domain, unsigned long iova,
struct scatterlist *sg, unsigned int nents, int prot)
 {
return __iommu_map_sg(domain, iova, sg, nents, prot, GFP_ATOMIC);
diff --git a/include/linux/iommu.h b/include/linux/iommu.h
index 32d448050bf7..9369458ba1bd 100644
--- a/include/linux/iommu.h
+++ b/include/linux/iommu.h
@@ -414,11 +414,11 @@ extern size_t iommu_unmap(struct iommu_domain *domain, 
unsigned long iova,
 extern size_t iommu_unmap_fast(struct iommu_domain *domain,
   unsigned long iova, size_t size,
   struct iommu_iotlb_gather *iotlb_gather);
-extern size_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
-  struct scatterlist *sg,unsigned int nents, int prot);
-extern size_t iommu_map_sg_atomic(struct iommu_domain *domain,
- unsigned long iova, struct scatterlist *sg,
- unsigned int nents, int prot);
+extern ssize_t iommu_map_sg(struct iommu_domain *domain, unsigned long iova,
+   struct scatterlist *sg, unsigned int nents, int prot);
+extern ssize_t iommu_map_sg_atomic(struct iommu_domain *domain,
+  unsigned long iova, struct scatterlist *sg,
+  unsigned int nents, int prot);
 extern phys_addr_t iommu_iova_to_phys(struct iommu_domain *domain, dma_addr_t 
iova);
 extern void iommu_set_fault_handler(struct iommu_domain *domain,
iommu_fault_handler_t handler, void *token);
@@ -679,18 +679,18 @@ static inline size_t iommu_unmap_fast(struct iommu_domain 
*domain,
return 0;
 }
 
-static inline size_t iommu_map_sg(struct iommu_domain *domain,
- unsigned long iova, struct scatterlist *sg,
- unsigned int nents, int prot)
+static inline ssize_t iommu_map_sg(struct iommu_domain *domain,
+  unsigned long iova, struct scatterlist *sg,
+  unsigned int nents, int prot)
 {
-   return 0;
+   return -ENODEV;
 }
 
-static inline size_t iommu_map_sg_atomic(struct iommu_domain *domain,
+static inline ssize_t iommu_map_sg_atomic(struct iommu_domain *domain,
  unsigned long iova, struct scatterlist *sg,
  unsigned int nents, int prot)
 {
-   return 0;
+   return -ENODEV;
 }
 
 static inline void iommu_flush_iotlb_all(struct iommu_domain *domain)
-- 
2.20.1



[PATCH v1 07/16] ia64/sba_iommu: return error code from sba_map_sg_attrs()

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure.

Propagate the return of dma_mapping_error() up, if it is an errno.

sba_coalesce_chunks() may only presently fail if sba_alloc_range()
fails, which in turn only fails if the iommu is out of mapping
resources, hence a -ENOMEM is used in that case.

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
Cc: Michael Ellerman 
Cc: Niklas Schnelle 
Cc: Thomas Bogendoerfer 
---
 arch/ia64/hp/common/sba_iommu.c | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/ia64/hp/common/sba_iommu.c b/arch/ia64/hp/common/sba_iommu.c
index 9148ddbf02e5..09dbe07a18c1 100644
--- a/arch/ia64/hp/common/sba_iommu.c
+++ b/arch/ia64/hp/common/sba_iommu.c
@@ -1431,7 +1431,7 @@ static int sba_map_sg_attrs(struct device *dev, struct 
scatterlist *sglist,
unsigned long attrs)
 {
struct ioc *ioc;
-   int coalesced, filled = 0;
+   int coalesced, filled = 0, ret;
 #ifdef ASSERT_PDIR_SANITY
unsigned long flags;
 #endif
@@ -1458,8 +1458,9 @@ static int sba_map_sg_attrs(struct device *dev, struct 
scatterlist *sglist,
sglist->dma_length = sglist->length;
sglist->dma_address = sba_map_page(dev, sg_page(sglist),
sglist->offset, sglist->length, dir, attrs);
-   if (dma_mapping_error(dev, sglist->dma_address))
-   return 0;
+   ret = dma_mapping_error(dev, sglist->dma_address);
+   if (ret)
+   return ret;
return 1;
}
 
@@ -1486,7 +1487,7 @@ static int sba_map_sg_attrs(struct device *dev, struct 
scatterlist *sglist,
coalesced = sba_coalesce_chunks(ioc, dev, sglist, nents);
if (coalesced < 0) {
sba_unmap_sg_attrs(dev, sglist, nents, dir, attrs);
-   return 0;
+   return -ENOMEM;
}
 
/*
-- 
2.20.1



[PATCH v1 13/16] xen: swiotlb: return error code from xen_swiotlb_map_sg()

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure.

xen_swiotlb_map_sg() may only fail if xen_swiotlb_map_page() fails, but
xen_swiotlb_map_page() only supports returning errors as
DMA_MAPPING_ERROR. So coalesce all errors into EINVAL.

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
Cc: Konrad Rzeszutek Wilk 
Cc: Boris Ostrovsky 
Cc: Juergen Gross 
Cc: Stefano Stabellini 
---
 drivers/xen/swiotlb-xen.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c
index 24d11861ac7d..b5707127c9d7 100644
--- a/drivers/xen/swiotlb-xen.c
+++ b/drivers/xen/swiotlb-xen.c
@@ -509,7 +509,7 @@ xen_swiotlb_map_sg(struct device *dev, struct scatterlist 
*sgl, int nelems,
 out_unmap:
xen_swiotlb_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
sg_dma_len(sgl) = 0;
-   return 0;
+   return -EINVAL;
 }
 
 static void
-- 
2.20.1



[PATCH v1 00/16] .map_sg() error cleanup

2021-07-15 Thread Logan Gunthorpe
Hi,

This series is spun out and expanded from my work to add P2PDMA support
to DMA map operations[1].

The P2PDMA work requires distinguishing different error conditions in
a map_sg operation. dma_map_sgtable() already allows for returning an
error code (where as dma_map_sg() is only allowed to return zero)
however, it currently only returns -EINVAL when a .map_sg() call returns
zero.

This series cleans up all .map_sg() implementations to return appropriate
error codes. After the cleanup, dma_map_sg() will still return zero,
however dma_map_sgtable() will pass the error code from the .map_sg()
call. Thanks go to Martn Oliveira for doing a lot of the cleanup of the
obscure implementations.

The patch set is based off of v5.14-rc1 and a git repo can be found
here:

  https://github.com/sbates130272/linux-p2pmem map_sg_err_cleanup_v1

Thanks,

Logan

[1] 
https://lore.kernel.org/linux-block/20210513223203.5542-1-log...@deltatee.com/

--

Logan Gunthorpe (5):
  dma-mapping: Allow map_sg() ops to return negative error codes
  dma-direct: Return appropriate error code from dma_direct_map_sg()
  iommu: Return full error code from iommu_map_sg[_atomic]()
  dma-iommu: Return error code from iommu_dma_map_sg()
  dma-mapping: Disallow .map_sg operations from returning zero on error

Martin Oliveira (11):
  alpha: return error code from alpha_pci_map_sg()
  ARM/dma-mapping: return error code from .map_sg() ops
  ia64/sba_iommu: return error code from sba_map_sg_attrs()
  MIPS/jazzdma: return error code from jazz_dma_map_sg()
  powerpc/iommu: return error code from .map_sg() ops
  s390/pci: return error code from s390_dma_map_sg()
  sparc/iommu: return error codes from .map_sg() ops
  parisc: return error code from .map_sg() ops
  xen: swiotlb: return error code from xen_swiotlb_map_sg()
  x86/amd_gart: return error code from gart_map_sg()
  dma-mapping: return error code from dma_dummy_map_sg()

 arch/alpha/kernel/pci_iommu.c   | 10 +++-
 arch/arm/mm/dma-mapping.c   | 22 +---
 arch/ia64/hp/common/sba_iommu.c |  9 +--
 arch/mips/jazz/jazzdma.c|  2 +-
 arch/powerpc/kernel/iommu.c |  4 +-
 arch/powerpc/platforms/ps3/system-bus.c |  2 +-
 arch/powerpc/platforms/pseries/vio.c|  5 +-
 arch/s390/pci/pci_dma.c | 12 ++--
 arch/sparc/kernel/iommu.c   |  4 +-
 arch/sparc/kernel/pci_sun4v.c   |  4 +-
 arch/sparc/mm/iommu.c   |  2 +-
 arch/x86/kernel/amd_gart_64.c   | 16 +++---
 drivers/iommu/dma-iommu.c   | 20 ---
 drivers/iommu/iommu.c   | 15 +++--
 drivers/parisc/ccio-dma.c   |  2 +-
 drivers/parisc/sba_iommu.c  |  2 +-
 drivers/xen/swiotlb-xen.c   |  2 +-
 include/linux/dma-map-ops.h |  6 +-
 include/linux/dma-mapping.h | 35 +++-
 include/linux/iommu.h   | 22 
 kernel/dma/direct.c |  2 +-
 kernel/dma/dummy.c  |  2 +-
 kernel/dma/mapping.c| 73 ++---
 23 files changed, 165 insertions(+), 108 deletions(-)


base-commit: e73f0f0ee7541171d89f2e2491130c7771ba58d3
--
2.20.1


[PATCH v1 14/16] x86/amd_gart: return error code from gart_map_sg()

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure.

So make __dma_map_cont() return a valid errno (which is then propagated
to gart_map_sg() via dma_map_cont()) and return it in case of failure.

Also, return -EINVAL in case of invalid nents.

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Borislav Petkov 
Cc: "H. Peter Anvin" 
Cc: Niklas Schnelle 
Cc: Thomas Bogendoerfer 
Cc: Michael Ellerman 
---
 arch/x86/kernel/amd_gart_64.c | 16 +---
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/amd_gart_64.c b/arch/x86/kernel/amd_gart_64.c
index 9ac696487b13..46aea9a4f26b 100644
--- a/arch/x86/kernel/amd_gart_64.c
+++ b/arch/x86/kernel/amd_gart_64.c
@@ -331,7 +331,7 @@ static int __dma_map_cont(struct device *dev, struct 
scatterlist *start,
int i;
 
if (iommu_start == -1)
-   return -1;
+   return -ENOMEM;
 
for_each_sg(start, s, nelems, i) {
unsigned long pages, addr;
@@ -380,13 +380,13 @@ static int gart_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
   enum dma_data_direction dir, unsigned long attrs)
 {
struct scatterlist *s, *ps, *start_sg, *sgmap;
-   int need = 0, nextneed, i, out, start;
+   int need = 0, nextneed, i, out, start, ret;
unsigned long pages = 0;
unsigned int seg_size;
unsigned int max_seg_size;
 
if (nents == 0)
-   return 0;
+   return -EINVAL;
 
out = 0;
start   = 0;
@@ -414,8 +414,9 @@ static int gart_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
if (!iommu_merge || !nextneed || !need || s->offset ||
(s->length + seg_size > max_seg_size) ||
(ps->offset + ps->length) % PAGE_SIZE) {
-   if (dma_map_cont(dev, start_sg, i - start,
-sgmap, pages, need) < 0)
+   ret = dma_map_cont(dev, start_sg, i - start,
+  sgmap, pages, need);
+   if (ret < 0)
goto error;
out++;
 
@@ -432,7 +433,8 @@ static int gart_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
pages += iommu_num_pages(s->offset, s->length, PAGE_SIZE);
ps = s;
}
-   if (dma_map_cont(dev, start_sg, i - start, sgmap, pages, need) < 0)
+   ret = dma_map_cont(dev, start_sg, i - start, sgmap, pages, need);
+   if (ret < 0)
goto error;
out++;
flush_gart();
@@ -458,7 +460,7 @@ static int gart_map_sg(struct device *dev, struct 
scatterlist *sg, int nents,
iommu_full(dev, pages << PAGE_SHIFT, dir);
for_each_sg(sg, s, nents, i)
s->dma_address = DMA_MAPPING_ERROR;
-   return 0;
+   return ret;
 }
 
 /* allocate and map a coherent mapping */
-- 
2.20.1



[PATCH v1 12/16] parisc: return error code from .map_sg() ops

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure.

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
Cc: "James E.J. Bottomley" 
Cc: Helge Deller 
---
 drivers/parisc/ccio-dma.c  | 2 +-
 drivers/parisc/sba_iommu.c | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/parisc/ccio-dma.c b/drivers/parisc/ccio-dma.c
index b5f9ee81a46c..a3a5cfda3d93 100644
--- a/drivers/parisc/ccio-dma.c
+++ b/drivers/parisc/ccio-dma.c
@@ -918,7 +918,7 @@ ccio_map_sg(struct device *dev, struct scatterlist *sglist, 
int nents,
BUG_ON(!dev);
ioc = GET_IOC(dev);
if (!ioc)
-   return 0;
+   return -ENODEV;

DBG_RUN_SG("%s() START %d entries\n", __func__, nents);
 
diff --git a/drivers/parisc/sba_iommu.c b/drivers/parisc/sba_iommu.c
index dce4cdf786cd..9a6671a230ee 100644
--- a/drivers/parisc/sba_iommu.c
+++ b/drivers/parisc/sba_iommu.c
@@ -947,7 +947,7 @@ sba_map_sg(struct device *dev, struct scatterlist *sglist, 
int nents,
 
ioc = GET_IOC(dev);
if (!ioc)
-   return 0;
+   return -ENODEV;
 
/* Fast path single entry scatterlists. */
if (nents == 1) {
-- 
2.20.1



[PATCH v1 02/16] dma-direct: Return appropriate error code from dma_direct_map_sg()

2021-07-15 Thread Logan Gunthorpe
Now that the map_sg() op expects error codes instead of return zero on
error, convert dma_direct_map_sg() to return an error code. The
only error to return presently is EINVAL if a page could not
be mapped.

Signed-off-by: Logan Gunthorpe 
---
 kernel/dma/direct.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/direct.c b/kernel/dma/direct.c
index f737e3347059..803ee9321170 100644
--- a/kernel/dma/direct.c
+++ b/kernel/dma/direct.c
@@ -411,7 +411,7 @@ int dma_direct_map_sg(struct device *dev, struct 
scatterlist *sgl, int nents,
 
 out_unmap:
dma_direct_unmap_sg(dev, sgl, i, dir, attrs | DMA_ATTR_SKIP_CPU_SYNC);
-   return 0;
+   return -EINVAL;
 }
 
 dma_addr_t dma_direct_map_resource(struct device *dev, phys_addr_t paddr,
-- 
2.20.1



[PATCH v1 11/16] sparc/iommu: return error codes from .map_sg() ops

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure.

Returning an errno from __sbus_iommu_map_sg() results in
sbus_iommu_map_sg_gflush() and sbus_iommu_map_sg_pflush() returning an
errno, as those functions are wrappers around __sbus_iommu_map_sg().

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
Cc: "David S. Miller" 
Cc: Niklas Schnelle 
Cc: Michael Ellerman 
---
 arch/sparc/kernel/iommu.c | 4 ++--
 arch/sparc/kernel/pci_sun4v.c | 4 ++--
 arch/sparc/mm/iommu.c | 2 +-
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index a034f571d869..0589acd34201 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -448,7 +448,7 @@ static int dma_4u_map_sg(struct device *dev, struct 
scatterlist *sglist,
iommu = dev->archdata.iommu;
strbuf = dev->archdata.stc;
if (nelems == 0 || !iommu)
-   return 0;
+   return -EINVAL;
 
spin_lock_irqsave(>lock, flags);
 
@@ -580,7 +580,7 @@ static int dma_4u_map_sg(struct device *dev, struct 
scatterlist *sglist,
}
spin_unlock_irqrestore(>lock, flags);
 
-   return 0;
+   return -EINVAL;
 }
 
 /* If contexts are being used, they are the same in all of the mappings
diff --git a/arch/sparc/kernel/pci_sun4v.c b/arch/sparc/kernel/pci_sun4v.c
index 9de57e88f7a1..d90e80fa5705 100644
--- a/arch/sparc/kernel/pci_sun4v.c
+++ b/arch/sparc/kernel/pci_sun4v.c
@@ -486,7 +486,7 @@ static int dma_4v_map_sg(struct device *dev, struct 
scatterlist *sglist,
 
iommu = dev->archdata.iommu;
if (nelems == 0 || !iommu)
-   return 0;
+   return -EINVAL;
atu = iommu->atu;
 
prot = HV_PCI_MAP_ATTR_READ;
@@ -619,7 +619,7 @@ static int dma_4v_map_sg(struct device *dev, struct 
scatterlist *sglist,
}
local_irq_restore(flags);
 
-   return 0;
+   return -EINVAL;
 }
 
 static void dma_4v_unmap_sg(struct device *dev, struct scatterlist *sglist,
diff --git a/arch/sparc/mm/iommu.c b/arch/sparc/mm/iommu.c
index 0c0342e5b10d..01ffcedd159c 100644
--- a/arch/sparc/mm/iommu.c
+++ b/arch/sparc/mm/iommu.c
@@ -256,7 +256,7 @@ static int __sbus_iommu_map_sg(struct device *dev, struct 
scatterlist *sgl,
sg->dma_address =__sbus_iommu_map_page(dev, sg_page(sg),
sg->offset, sg->length, per_page_flush);
if (sg->dma_address == DMA_MAPPING_ERROR)
-   return 0;
+   return -EINVAL;
sg->dma_length = sg->length;
}
 
-- 
2.20.1



[PATCH v1 01/16] dma-mapping: Allow map_sg() ops to return negative error codes

2021-07-15 Thread Logan Gunthorpe
Allow dma_map_sgtable() to pass errors from the map_sg() ops. This
will be required for returning appropriate error codes when mapping
P2PDMA memory.

Introduce __dma_map_sg_attrs() which will return the raw error code
from the map_sg operation (whether it be negative or zero). Then add a
dma_map_sg_attrs() wrapper to convert any negative errors to zero to
satisfy the existing calling convention.

dma_map_sgtable() will convert a zero error return for old map_sg() ops
into a -EINVAL return and return any negative errors as reported.

This allows map_sg implementations to start returning multiple
negative error codes. Legacy map_sg implementations can continue
to return zero until they are all converted.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/dma-map-ops.h |  8 +++-
 include/linux/dma-mapping.h | 35 --
 kernel/dma/mapping.c| 73 +
 3 files changed, 78 insertions(+), 38 deletions(-)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index 0d53a96a3d64..eaa969be8284 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -41,8 +41,12 @@ struct dma_map_ops {
size_t size, enum dma_data_direction dir,
unsigned long attrs);
/*
-* map_sg returns 0 on error and a value > 0 on success.
-* It should never return a value < 0.
+* map_sg should return a negative error code on error.
+* dma_map_sgtable() will return the error code returned and convert
+* a zero return (for legacy implementations) into -EINVAL.
+*
+* dma_map_sg() will always return zero on any negative or zero
+* return to satisfy its own calling convention.
 */
int (*map_sg)(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir, unsigned long attrs);
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 183e7103a66d..daa1e360f0ee 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -110,6 +110,8 @@ int dma_map_sg_attrs(struct device *dev, struct scatterlist 
*sg, int nents,
 void dma_unmap_sg_attrs(struct device *dev, struct scatterlist *sg,
  int nents, enum dma_data_direction dir,
  unsigned long attrs);
+int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
+   enum dma_data_direction dir, unsigned long attrs);
 dma_addr_t dma_map_resource(struct device *dev, phys_addr_t phys_addr,
size_t size, enum dma_data_direction dir, unsigned long attrs);
 void dma_unmap_resource(struct device *dev, dma_addr_t addr, size_t size,
@@ -174,6 +176,11 @@ static inline void dma_unmap_sg_attrs(struct device *dev,
unsigned long attrs)
 {
 }
+static inline int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
+   enum dma_data_direction dir, unsigned long attrs)
+{
+   return -EOPNOTSUPP;
+}
 static inline dma_addr_t dma_map_resource(struct device *dev,
phys_addr_t phys_addr, size_t size, enum dma_data_direction dir,
unsigned long attrs)
@@ -343,34 +350,6 @@ static inline void dma_sync_single_range_for_device(struct 
device *dev,
return dma_sync_single_for_device(dev, addr + offset, size, dir);
 }
 
-/**
- * dma_map_sgtable - Map the given buffer for DMA
- * @dev:   The device for which to perform the DMA operation
- * @sgt:   The sg_table object describing the buffer
- * @dir:   DMA direction
- * @attrs: Optional DMA attributes for the map operation
- *
- * Maps a buffer described by a scatterlist stored in the given sg_table
- * object for the @dir DMA operation by the @dev device. After success the
- * ownership for the buffer is transferred to the DMA domain.  One has to
- * call dma_sync_sgtable_for_cpu() or dma_unmap_sgtable() to move the
- * ownership of the buffer back to the CPU domain before touching the
- * buffer by the CPU.
- *
- * Returns 0 on success or -EINVAL on error during mapping the buffer.
- */
-static inline int dma_map_sgtable(struct device *dev, struct sg_table *sgt,
-   enum dma_data_direction dir, unsigned long attrs)
-{
-   int nents;
-
-   nents = dma_map_sg_attrs(dev, sgt->sgl, sgt->orig_nents, dir, attrs);
-   if (nents <= 0)
-   return -EINVAL;
-   sgt->nents = nents;
-   return 0;
-}
-
 /**
  * dma_unmap_sgtable - Unmap the given buffer for DMA
  * @dev:   The device for which to perform the DMA operation
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 2b06a809d0b9..30f89d244566 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -177,12 +177,8 @@ void dma_unmap_page_attrs(struct device *dev, dma_addr_t 
addr, size_t size,
 }
 EXPORT_SYMBOL(dma_unmap_page_attrs);
 
-/*
- * dma_maps_sg_attrs returns 0 on error and > 0 

[PATCH v1 05/16] alpha: return error code from alpha_pci_map_sg()

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure.

pci_map_single_1() can fail for different reasons, but since the only
supported type of error return is DMA_MAPPING_ERROR, we coalesce those
errors into EINVAL.

ENOMEM is returned when no page tables can be allocated.

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
---
 arch/alpha/kernel/pci_iommu.c | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/alpha/kernel/pci_iommu.c b/arch/alpha/kernel/pci_iommu.c
index 35d7b3096d6e..72fc2465d13c 100644
--- a/arch/alpha/kernel/pci_iommu.c
+++ b/arch/alpha/kernel/pci_iommu.c
@@ -649,7 +649,9 @@ static int alpha_pci_map_sg(struct device *dev, struct 
scatterlist *sg,
sg->dma_address
  = pci_map_single_1(pdev, SG_ENT_VIRT_ADDRESS(sg),
 sg->length, dac_allowed);
-   return sg->dma_address != DMA_MAPPING_ERROR;
+   if (sg->dma_address == DMA_MAPPING_ERROR)
+   return -EINVAL;
+   return 1;
}
 
start = sg;
@@ -685,8 +687,10 @@ static int alpha_pci_map_sg(struct device *dev, struct 
scatterlist *sg,
if (out < end)
out->dma_length = 0;
 
-   if (out - start == 0)
+   if (out - start == 0) {
printk(KERN_WARNING "pci_map_sg failed: no entries?\n");
+   return -ENOMEM;
+   }
DBGA("pci_map_sg: %ld entries\n", out - start);
 
return out - start;
@@ -699,7 +703,7 @@ static int alpha_pci_map_sg(struct device *dev, struct 
scatterlist *sg,
   entries.  Unmap them now.  */
if (out > start)
pci_unmap_sg(pdev, start, out - start, dir);
-   return 0;
+   return -ENOMEM;
 }
 
 /* Unmap a set of streaming mode DMA translations.  Again, cpu read
-- 
2.20.1



[PATCH v1 10/16] s390/pci: return error code from s390_dma_map_sg()

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure.

So propagate the error from __s390_dma_map_sg() up.

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
Cc: Niklas Schnelle 
Cc: Gerald Schaefer 
Cc: Heiko Carstens 
Cc: Vasily Gorbik 
Cc: Christian Borntraeger 
---
 arch/s390/pci/pci_dma.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/arch/s390/pci/pci_dma.c b/arch/s390/pci/pci_dma.c
index ebc9a49523aa..c78b02012764 100644
--- a/arch/s390/pci/pci_dma.c
+++ b/arch/s390/pci/pci_dma.c
@@ -487,7 +487,7 @@ static int s390_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
unsigned int max = dma_get_max_seg_size(dev);
unsigned int size = s->offset + s->length;
unsigned int offset = s->offset;
-   int count = 0, i;
+   int count = 0, i, ret;
 
for (i = 1; i < nr_elements; i++) {
s = sg_next(s);
@@ -497,8 +497,9 @@ static int s390_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
 
if (s->offset || (size & ~PAGE_MASK) ||
size + s->length > max) {
-   if (__s390_dma_map_sg(dev, start, size,
- >dma_address, dir))
+   ret = __s390_dma_map_sg(dev, start, size,
+   >dma_address, dir);
+   if (ret)
goto unmap;
 
dma->dma_address += offset;
@@ -511,7 +512,8 @@ static int s390_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
}
size += s->length;
}
-   if (__s390_dma_map_sg(dev, start, size, >dma_address, dir))
+   ret = __s390_dma_map_sg(dev, start, size, >dma_address, dir);
+   if (ret)
goto unmap;
 
dma->dma_address += offset;
@@ -523,7 +525,7 @@ static int s390_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
s390_dma_unmap_pages(dev, sg_dma_address(s), sg_dma_len(s),
 dir, attrs);
 
-   return 0;
+   return ret;
 }
 
 static void s390_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
-- 
2.20.1



Re: [PATCH v1 00/16] .map_sg() error cleanup

2021-07-15 Thread Logan Gunthorpe




On 2021-07-15 10:53 a.m., Russell King (Oracle) wrote:
> On Thu, Jul 15, 2021 at 10:45:28AM -0600, Logan Gunthorpe wrote:
>> Hi,
>>
>> This series is spun out and expanded from my work to add P2PDMA support
>> to DMA map operations[1].
>>
>> The P2PDMA work requires distinguishing different error conditions in
>> a map_sg operation. dma_map_sgtable() already allows for returning an
>> error code (where as dma_map_sg() is only allowed to return zero)
>> however, it currently only returns -EINVAL when a .map_sg() call returns
>> zero.
>>
>> This series cleans up all .map_sg() implementations to return appropriate
>> error codes. After the cleanup, dma_map_sg() will still return zero,
>> however dma_map_sgtable() will pass the error code from the .map_sg()
>> call. Thanks go to Martn Oliveira for doing a lot of the cleanup of the
>> obscure implementations.
>>
>> The patch set is based off of v5.14-rc1 and a git repo can be found
>> here:
> 
> Have all the callers for dma_map_sg() been updated to check for error
> codes? If not, isn't that a pre-requisit to this patch set?

No. Perhaps I wasn't clear enough: This series is changing only
impelemntations of .map_sg(). It does *not* change the return code of
dma_map_sg(). dma_map_sg() will continue to return zero on error for the
foreseeable future. The dma_map_sgtable() call already allows returning
error codes and it will pass the new error code through. This is what
will be used in the P2PDMA work.

Logan


[PATCH v1 08/16] MIPS/jazzdma: return error code from jazz_dma_map_sg()

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure.

vdma_alloc() may fail for different reasons, but since it only supports
indicating an error via a return of DMA_MAPPING_ERROR, we coalesce the
different reasons into -EINVAL.

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
Cc: Thomas Bogendoerfer 
---
 arch/mips/jazz/jazzdma.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/mips/jazz/jazzdma.c b/arch/mips/jazz/jazzdma.c
index 461457b28982..3b99743435db 100644
--- a/arch/mips/jazz/jazzdma.c
+++ b/arch/mips/jazz/jazzdma.c
@@ -552,7 +552,7 @@ static int jazz_dma_map_sg(struct device *dev, struct 
scatterlist *sglist,
dir);
sg->dma_address = vdma_alloc(sg_phys(sg), sg->length);
if (sg->dma_address == DMA_MAPPING_ERROR)
-   return 0;
+   return -EINVAL;
sg_dma_len(sg) = sg->length;
}

--
2.20.1


[PATCH v1 04/16] dma-iommu: Return error code from iommu_dma_map_sg()

2021-07-15 Thread Logan Gunthorpe
Pass through appropriate error codes from iommu_dma_map_sg() now that
the error code will be passed through dma_map_sgtable().

Signed-off-by: Logan Gunthorpe 
Cc: Joerg Roedel 
Cc: Will Deacon 
---
 drivers/iommu/dma-iommu.c | 20 +---
 1 file changed, 13 insertions(+), 7 deletions(-)

diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c
index 98ba927aee1a..9d35e9994306 100644
--- a/drivers/iommu/dma-iommu.c
+++ b/drivers/iommu/dma-iommu.c
@@ -972,7 +972,7 @@ static int iommu_dma_map_sg_swiotlb(struct device *dev, 
struct scatterlist *sg,
 
 out_unmap:
iommu_dma_unmap_sg_swiotlb(dev, sg, i, dir, attrs | 
DMA_ATTR_SKIP_CPU_SYNC);
-   return 0;
+   return -EINVAL;
 }
 
 /*
@@ -993,11 +993,14 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
dma_addr_t iova;
size_t iova_len = 0;
unsigned long mask = dma_get_seg_boundary(dev);
+   ssize_t ret;
int i;
 
-   if (static_branch_unlikely(_deferred_attach_enabled) &&
-   iommu_deferred_attach(dev, domain))
-   return 0;
+   if (static_branch_unlikely(_deferred_attach_enabled)) {
+   ret = iommu_deferred_attach(dev, domain);
+   if (ret)
+   return ret;
+   }
 
if (!(attrs & DMA_ATTR_SKIP_CPU_SYNC))
iommu_dma_sync_sg_for_device(dev, sg, nents, dir);
@@ -1045,14 +1048,17 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
}
 
iova = iommu_dma_alloc_iova(domain, iova_len, dma_get_mask(dev), dev);
-   if (!iova)
+   if (!iova) {
+   ret = -ENOMEM;
goto out_restore_sg;
+   }
 
/*
 * We'll leave any physical concatenation to the IOMMU driver's
 * implementation - it knows better than we do.
 */
-   if (iommu_map_sg_atomic(domain, iova, sg, nents, prot) < iova_len)
+   ret = iommu_map_sg_atomic(domain, iova, sg, nents, prot);
+   if (ret < iova_len)
goto out_free_iova;
 
return __finalise_sg(dev, sg, nents, iova);
@@ -1061,7 +1067,7 @@ static int iommu_dma_map_sg(struct device *dev, struct 
scatterlist *sg,
iommu_dma_free_iova(cookie, iova, iova_len, NULL);
 out_restore_sg:
__invalidate_sg(sg, nents);
-   return 0;
+   return ret;
 }
 
 static void iommu_dma_unmap_sg(struct device *dev, struct scatterlist *sg,
-- 
2.20.1



[PATCH v1 16/16] dma-mapping: Disallow .map_sg operations from returning zero on error

2021-07-15 Thread Logan Gunthorpe
Now that all the .map_sg operations have been converted to returning
proper error codes, drop the code to handle a zero return value,
add a warning if a zero is returned and update the comment for the
map_sg operation.

Signed-off-by: Logan Gunthorpe 
---
 include/linux/dma-map-ops.h | 8 +++-
 kernel/dma/mapping.c| 6 +++---
 2 files changed, 6 insertions(+), 8 deletions(-)

diff --git a/include/linux/dma-map-ops.h b/include/linux/dma-map-ops.h
index eaa969be8284..f299bc1e317b 100644
--- a/include/linux/dma-map-ops.h
+++ b/include/linux/dma-map-ops.h
@@ -42,11 +42,9 @@ struct dma_map_ops {
unsigned long attrs);
/*
 * map_sg should return a negative error code on error.
-* dma_map_sgtable() will return the error code returned and convert
-* a zero return (for legacy implementations) into -EINVAL.
-*
-* dma_map_sg() will always return zero on any negative or zero
-* return to satisfy its own calling convention.
+* dma_map_sgtable() will return the error code returned by the
+* operation and dma_map_sg() will always convert any error to zero
+* to satisfy its own calling convention.
 */
int (*map_sg)(struct device *dev, struct scatterlist *sg, int nents,
enum dma_data_direction dir, unsigned long attrs);
diff --git a/kernel/dma/mapping.c b/kernel/dma/mapping.c
index 30f89d244566..978a6a16aaf7 100644
--- a/kernel/dma/mapping.c
+++ b/kernel/dma/mapping.c
@@ -194,6 +194,8 @@ static int __dma_map_sg_attrs(struct device *dev, struct 
scatterlist *sg,
else
ents = ops->map_sg(dev, sg, nents, dir, attrs);
 
+   WARN_ON_ONCE(ents == 0);
+
if (ents > 0)
debug_dma_map_sg(dev, sg, nents, ents, dir);
 
@@ -251,9 +253,7 @@ int dma_map_sgtable(struct device *dev, struct sg_table 
*sgt,
int nents;
 
nents = __dma_map_sg_attrs(dev, sgt->sgl, sgt->orig_nents, dir, attrs);
-   if (nents == 0)
-   return -EINVAL;
-   else if (nents < 0)
+   if (nents < 0)
return nents;
 
sgt->nents = nents;
-- 
2.20.1



[PATCH v1 15/16] dma-mapping: return error code from dma_dummy_map_sg()

2021-07-15 Thread Logan Gunthorpe
From: Martin Oliveira 

The .map_sg() op now expects an error code instead of zero on failure.

The only errno to return is -ENODEV in the case when DMA is not
supported.

Signed-off-by: Martin Oliveira 
Signed-off-by: Logan Gunthorpe 
---
 kernel/dma/dummy.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/dma/dummy.c b/kernel/dma/dummy.c
index eacd4c5b10bf..ae9abebed0c4 100644
--- a/kernel/dma/dummy.c
+++ b/kernel/dma/dummy.c
@@ -22,7 +22,7 @@ static int dma_dummy_map_sg(struct device *dev, struct 
scatterlist *sgl,
int nelems, enum dma_data_direction dir,
unsigned long attrs)
 {
-   return 0;
+   return -ENODEV;
 }
 
 static int dma_dummy_supported(struct device *hwdev, u64 mask)
-- 
2.20.1



Re: [PATCH v1 00/16] .map_sg() error cleanup

2021-07-15 Thread Russell King (Oracle)
On Thu, Jul 15, 2021 at 10:45:28AM -0600, Logan Gunthorpe wrote:
> Hi,
> 
> This series is spun out and expanded from my work to add P2PDMA support
> to DMA map operations[1].
> 
> The P2PDMA work requires distinguishing different error conditions in
> a map_sg operation. dma_map_sgtable() already allows for returning an
> error code (where as dma_map_sg() is only allowed to return zero)
> however, it currently only returns -EINVAL when a .map_sg() call returns
> zero.
> 
> This series cleans up all .map_sg() implementations to return appropriate
> error codes. After the cleanup, dma_map_sg() will still return zero,
> however dma_map_sgtable() will pass the error code from the .map_sg()
> call. Thanks go to Martn Oliveira for doing a lot of the cleanup of the
> obscure implementations.
> 
> The patch set is based off of v5.14-rc1 and a git repo can be found
> here:

Have all the callers for dma_map_sg() been updated to check for error
codes? If not, isn't that a pre-requisit to this patch set?

>From what I see in Linus' current tree, we still have cases today
where the return value of dma_map_sg() is compared with zero to
detect failure, so I think that needs fixing before we start changing
the dma_map_sg() implementation to return negative numbers.

I also notice that there are various places that don't check the
return value - and returning a negative number instead of zero may
well cause random other bits to be set in fields.

So, I think there's a fair amount of work to do in all the drivers
before this change can be considered.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last!


[RFC PATCH 6/6] KVM: PPC: Book3S HV P9: Remove subcore HMI handling

2021-07-15 Thread Nicholas Piggin
On POWER9 and newer, rather than the complex HMI synchronisation and
subcore state, have each thread un-apply the guest TB offset before
calling into the early HMI handler.

This allows the subcore state to be avoided, including subcore enter
/ exit guest, which includes an expensive divide that shows up
slightly in profiles.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  | 12 +---
 arch/powerpc/kvm/book3s_hv_hmi.c  |  7 ++-
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 21 -
 arch/powerpc/kvm/book3s_hv_ras.c  |  4 
 4 files changed, 35 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 957efde59014..7d5c31537a79 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3985,8 +3985,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
vcpu->arch.ceded = 0;
 
-   kvmppc_subcore_enter_guest();
-
vcpu_vpa_increment_dispatch(vcpu);
 
if (kvmhv_on_pseries()) {
@@ -4039,8 +4037,6 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
 
vcpu_vpa_increment_dispatch(vcpu);
 
-   kvmppc_subcore_exit_guest();
-
return trap;
 }
 
@@ -6022,9 +6018,11 @@ static int kvmppc_book3s_init_hv(void)
if (r)
return r;
 
-   r = kvm_init_subcore_bitmap();
-   if (r)
-   return r;
+   if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+   r = kvm_init_subcore_bitmap();
+   if (r)
+   return r;
+   }
 
/*
 * We need a way of accessing the XICS interrupt controller,
diff --git a/arch/powerpc/kvm/book3s_hv_hmi.c b/arch/powerpc/kvm/book3s_hv_hmi.c
index 9af660476314..1ec50c69678b 100644
--- a/arch/powerpc/kvm/book3s_hv_hmi.c
+++ b/arch/powerpc/kvm/book3s_hv_hmi.c
@@ -20,10 +20,15 @@ void wait_for_subcore_guest_exit(void)
 
/*
 * NULL bitmap pointer indicates that KVM module hasn't
-* been loaded yet and hence no guests are running.
+* been loaded yet and hence no guests are running, or running
+* on POWER9 or newer CPU.
+*
 * If no KVM is in use, no need to co-ordinate among threads
 * as all of them will always be in host and no one is going
 * to modify TB other than the opal hmi handler.
+*
+* POWER9 and newer don't need this synchronisation.
+*
 * Hence, just return from here.
 */
if (!local_paca->sibling_subcore_state)
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 44150a29f6e0..3c9e7a500264 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -3,6 +3,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -900,7 +901,25 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
kvmppc_realmode_machine_check(vcpu);
 
} else if (unlikely(trap == BOOK3S_INTERRUPT_HMI)) {
-   kvmppc_realmode_hmi_handler();
+   /*
+* Unapply and clear the offset first. That way, if the TB
+* was fine then no harm done, if it is corrupted then the
+* HMI resync will bring it back to host mode. This way, we
+* don't need to actualy know whether not OPAL resynced the
+* timebase. Although it would be cleaner if we could rely
+* on that, early POWER9 OPAL did not support the
+* OPAL_HANDLE_HMI2 call.
+*/
+   if (vc->tb_offset_applied) {
+   u64 new_tb = mftb() - vc->tb_offset_applied;
+   mtspr(SPRN_TBU40, new_tb);
+   if ((mftb() & 0xff) < (new_tb & 0xff)) {
+   new_tb += 0x100;
+   mtspr(SPRN_TBU40, new_tb);
+   }
+   vc->tb_offset_applied = 0;
+   }
+   hmi_exception_realmode(NULL);
 
} else if (trap == BOOK3S_INTERRUPT_H_EMUL_ASSIST) {
vcpu->arch.emul_inst = mfspr(SPRN_HEIR);
diff --git a/arch/powerpc/kvm/book3s_hv_ras.c b/arch/powerpc/kvm/book3s_hv_ras.c
index d4bca93b79f6..a49ee9bdab67 100644
--- a/arch/powerpc/kvm/book3s_hv_ras.c
+++ b/arch/powerpc/kvm/book3s_hv_ras.c
@@ -136,6 +136,10 @@ void kvmppc_realmode_machine_check(struct kvm_vcpu *vcpu)
vcpu->arch.mce_evt = mce_evt;
 }
 
+/*
+ * This subcore HMI handling is all only for pre-POWER9 CPUs.
+ */
+
 /* Check if dynamic split is in force and return subcore size accordingly. */
 static inline int kvmppc_cur_subcore_size(void)
 {
-- 
2.23.0



[RFC PATCH 5/6] KVM: PPC: Book3S HV P9: Stop using vc->dpdes

2021-07-15 Thread Nicholas Piggin
The P9 path uses vc->dpdes only for msgsndp / SMT emulation. This adds
an ordering requirement between vcpu->doorbell_request and vc->dpdes for
no real benefit. Use vcpu->doorbell_request directly.

XXX: verify msgsndp / DPDES emulation works properly.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c  | 18 ++
 arch/powerpc/kvm/book3s_hv_builtin.c  |  2 ++
 arch/powerpc/kvm/book3s_hv_p9_entry.c | 14 ++
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 93ecbc040529..957efde59014 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -766,6 +766,8 @@ static bool kvmppc_doorbell_pending(struct kvm_vcpu *vcpu)
 
if (vcpu->arch.doorbell_request)
return true;
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   return false;
/*
 * Ensure that the read of vcore->dpdes comes after the read
 * of vcpu->doorbell_request.  This barrier matches the
@@ -2165,8 +2167,10 @@ static int kvmppc_get_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
 * either vcore->dpdes or doorbell_request.
 * On POWER8, doorbell_request is 0.
 */
-   *val = get_reg_val(id, vcpu->arch.vcore->dpdes |
-  vcpu->arch.doorbell_request);
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   *val = get_reg_val(id, vcpu->arch.doorbell_request);
+   else
+   *val = get_reg_val(id, vcpu->arch.vcore->dpdes);
break;
case KVM_REG_PPC_VTB:
*val = get_reg_val(id, vcpu->arch.vcore->vtb);
@@ -2403,7 +2407,10 @@ static int kvmppc_set_one_reg_hv(struct kvm_vcpu *vcpu, 
u64 id,
vcpu->arch.pspb = set_reg_val(id, *val);
break;
case KVM_REG_PPC_DPDES:
-   vcpu->arch.vcore->dpdes = set_reg_val(id, *val);
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   vcpu->arch.doorbell_request = set_reg_val(id, *val) & 1;
+   else
+   vcpu->arch.vcore->dpdes = set_reg_val(id, *val);
break;
case KVM_REG_PPC_VTB:
vcpu->arch.vcore->vtb = set_reg_val(id, *val);
@@ -4431,11 +4438,6 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
 
if (!nested) {
kvmppc_core_prepare_to_enter(vcpu);
-   if (vcpu->arch.doorbell_request) {
-   vc->dpdes = 1;
-   smp_wmb();
-   vcpu->arch.doorbell_request = 0;
-   }
if (test_bit(BOOK3S_IRQPRIO_EXTERNAL,
 >arch.pending_exceptions))
lpcr |= LPCR_MER;
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c 
b/arch/powerpc/kvm/book3s_hv_builtin.c
index be8ef1c5b1bf..9d24f4472b42 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -649,6 +649,8 @@ void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu)
int ext;
unsigned long lpcr;
 
+   WARN_ON_ONCE(cpu_has_feature(CPU_FTR_ARCH_300));
+
/* Insert EXTERNAL bit into LPCR at the MER bit position */
ext = (vcpu->arch.pending_exceptions >> BOOK3S_IRQPRIO_EXTERNAL) & 1;
lpcr = mfspr(SPRN_LPCR);
diff --git a/arch/powerpc/kvm/book3s_hv_p9_entry.c 
b/arch/powerpc/kvm/book3s_hv_p9_entry.c
index 82dac23cc678..44150a29f6e0 100644
--- a/arch/powerpc/kvm/book3s_hv_p9_entry.c
+++ b/arch/powerpc/kvm/book3s_hv_p9_entry.c
@@ -677,6 +677,7 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
unsigned long host_pidr;
unsigned long host_dawr1;
unsigned long host_dawrx1;
+   unsigned long dpdes;
 
hdec = time_limit - *tb;
if (hdec < 0)
@@ -736,8 +737,10 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
 
if (vc->pcr)
mtspr(SPRN_PCR, vc->pcr | PCR_MASK);
-   if (vc->dpdes)
-   mtspr(SPRN_DPDES, vc->dpdes);
+   if (vcpu->arch.doorbell_request) {
+   vcpu->arch.doorbell_request = 0;
+   mtspr(SPRN_DPDES, 1);
+   }
 
if (dawr_enabled()) {
if (vcpu->arch.dawr0 != host_dawr0)
@@ -968,7 +971,10 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
vcpu->arch.shregs.sprg2 = mfspr(SPRN_SPRG2);
vcpu->arch.shregs.sprg3 = mfspr(SPRN_SPRG3);
 
-   vc->dpdes = mfspr(SPRN_DPDES);
+   dpdes = mfspr(SPRN_DPDES);
+   if (dpdes)
+   vcpu->arch.doorbell_request = 1;
+
vc->vtb = mfspr(SPRN_VTB);
 
dec = mfspr(SPRN_DEC);
@@ -1030,7 +1036,7 @@ int kvmhv_vcpu_entry_p9(struct kvm_vcpu *vcpu, u64 
time_limit, unsigned long lpc
}

[RFC PATCH 4/6] KVM: PPC: Book3S HV P9: Tidy kvmppc_create_dtl_entry

2021-07-15 Thread Nicholas Piggin
This goes further to removing vcores from the P9 path. Also avoid the
memset in favour of explicitly initialising all fields.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 61 +---
 1 file changed, 35 insertions(+), 26 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 2b02cfe3e456..93ecbc040529 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -703,41 +703,30 @@ static u64 vcore_stolen_time(struct kvmppc_vcore *vc, u64 
now)
return p;
 }
 
-static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
-   struct kvmppc_vcore *vc, u64 tb)
+static void __kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
+   unsigned int pcpu, u64 now,
+   unsigned long stolen)
 {
struct dtl_entry *dt;
struct lppaca *vpa;
-   unsigned long stolen;
-   unsigned long core_stolen;
-   u64 now;
-   unsigned long flags;
 
dt = vcpu->arch.dtl_ptr;
vpa = vcpu->arch.vpa.pinned_addr;
-   now = tb;
-
-   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
-   stolen = 0;
-   } else {
-   core_stolen = vcore_stolen_time(vc, now);
-   stolen = core_stolen - vcpu->arch.stolen_logged;
-   vcpu->arch.stolen_logged = core_stolen;
-   spin_lock_irqsave(>arch.tbacct_lock, flags);
-   stolen += vcpu->arch.busy_stolen;
-   vcpu->arch.busy_stolen = 0;
-   spin_unlock_irqrestore(>arch.tbacct_lock, flags);
-   }
 
if (!dt || !vpa)
return;
-   memset(dt, 0, sizeof(struct dtl_entry));
+
dt->dispatch_reason = 7;
-   dt->processor_id = cpu_to_be16(vc->pcpu + vcpu->arch.ptid);
-   dt->timebase = cpu_to_be64(now + vc->tb_offset);
+   dt->preempt_reason = 0;
+   dt->processor_id = cpu_to_be16(pcpu + vcpu->arch.ptid);
dt->enqueue_to_dispatch_time = cpu_to_be32(stolen);
+   dt->ready_to_enqueue_time = 0;
+   dt->waiting_to_ready_time = 0;
+   dt->timebase = cpu_to_be64(now);
+   dt->fault_addr = 0;
dt->srr0 = cpu_to_be64(kvmppc_get_pc(vcpu));
dt->srr1 = cpu_to_be64(vcpu->arch.shregs.msr);
+
++dt;
if (dt == vcpu->arch.dtl.pinned_end)
dt = vcpu->arch.dtl.pinned_addr;
@@ -748,6 +737,27 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
vcpu->arch.dtl.dirty = true;
 }
 
+static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
+   struct kvmppc_vcore *vc)
+{
+   unsigned long stolen;
+   unsigned long core_stolen;
+   u64 now;
+   unsigned long flags;
+
+   now = mftb();
+
+   core_stolen = vcore_stolen_time(vc, now);
+   stolen = core_stolen - vcpu->arch.stolen_logged;
+   vcpu->arch.stolen_logged = core_stolen;
+   spin_lock_irqsave(>arch.tbacct_lock, flags);
+   stolen += vcpu->arch.busy_stolen;
+   vcpu->arch.busy_stolen = 0;
+   spin_unlock_irqrestore(>arch.tbacct_lock, flags);
+
+   __kvmppc_create_dtl_entry(vcpu, vc->pcpu, now + vc->tb_offset, stolen);
+}
+
 /* See if there is a doorbell interrupt pending for a vcpu */
 static bool kvmppc_doorbell_pending(struct kvm_vcpu *vcpu)
 {
@@ -3722,7 +3732,7 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
pvc->pcpu = pcpu + thr;
for_each_runnable_thread(i, vcpu, pvc) {
kvmppc_start_thread(vcpu, pvc);
-   kvmppc_create_dtl_entry(vcpu, pvc, mftb());
+   kvmppc_create_dtl_entry(vcpu, pvc);
trace_kvm_guest_enter(vcpu);
if (!vcpu->arch.ptid)
thr0_done = true;
@@ -4272,7 +4282,7 @@ static int kvmppc_run_vcpu(struct kvm_vcpu *vcpu)
if ((vc->vcore_state == VCORE_PIGGYBACK ||
 vc->vcore_state == VCORE_RUNNING) &&
   !VCORE_IS_EXITING(vc)) {
-   kvmppc_create_dtl_entry(vcpu, vc, mftb());
+   kvmppc_create_dtl_entry(vcpu, vc);
kvmppc_start_thread(vcpu, vc);
trace_kvm_guest_enter(vcpu);
} else if (vc->vcore_state == VCORE_SLEEPING) {
@@ -4449,8 +4459,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
local_paca->kvm_hstate.ptid = 0;
local_paca->kvm_hstate.fake_suspend = 0;
 
-   vc->pcpu = pcpu; // for kvmppc_create_dtl_entry
-   kvmppc_create_dtl_entry(vcpu, vc, tb);
+   __kvmppc_create_dtl_entry(vcpu, pcpu, tb + vc->tb_offset, 0);
 
trace_kvm_guest_enter(vcpu);
 
-- 
2.23.0



[RFC PATCH 3/6] KVM: PPC: Book3S HV P9: Remove most of the vcore logic

2021-07-15 Thread Nicholas Piggin
The P9 path always uses one vcpu per vcore, so none of the the vcore,
locks, stolen time, blocking logic, shared waitq, etc., is required.

Remove most of it.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 147 ---
 1 file changed, 85 insertions(+), 62 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index d809566918de..2b02cfe3e456 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -281,6 +281,8 @@ static void kvmppc_core_start_stolen(struct kvmppc_vcore 
*vc, u64 tb)
 {
unsigned long flags;
 
+   WARN_ON_ONCE(cpu_has_feature(CPU_FTR_ARCH_300));
+
spin_lock_irqsave(>stoltb_lock, flags);
vc->preempt_tb = tb;
spin_unlock_irqrestore(>stoltb_lock, flags);
@@ -290,6 +292,8 @@ static void kvmppc_core_end_stolen(struct kvmppc_vcore *vc, 
u64 tb)
 {
unsigned long flags;
 
+   WARN_ON_ONCE(cpu_has_feature(CPU_FTR_ARCH_300));
+
spin_lock_irqsave(>stoltb_lock, flags);
if (vc->preempt_tb != TB_NIL) {
vc->stolen_tb += tb - vc->preempt_tb;
@@ -302,7 +306,12 @@ static void kvmppc_core_vcpu_load_hv(struct kvm_vcpu 
*vcpu, int cpu)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
unsigned long flags;
-   u64 now = mftb();
+   u64 now;
+
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   return;
+
+   now = mftb();
 
/*
 * We can test vc->runner without taking the vcore lock,
@@ -326,7 +335,12 @@ static void kvmppc_core_vcpu_put_hv(struct kvm_vcpu *vcpu)
 {
struct kvmppc_vcore *vc = vcpu->arch.vcore;
unsigned long flags;
-   u64 now = mftb();
+   u64 now;
+
+   if (cpu_has_feature(CPU_FTR_ARCH_300))
+   return;
+
+   now = mftb();
 
if (vc->runner == vcpu && vc->vcore_state >= VCORE_SLEEPING)
kvmppc_core_start_stolen(vc, now);
@@ -678,6 +692,8 @@ static u64 vcore_stolen_time(struct kvmppc_vcore *vc, u64 
now)
u64 p;
unsigned long flags;
 
+   WARN_ON_ONCE(cpu_has_feature(CPU_FTR_ARCH_300));
+
spin_lock_irqsave(>stoltb_lock, flags);
p = vc->stolen_tb;
if (vc->vcore_state != VCORE_INACTIVE &&
@@ -700,13 +716,19 @@ static void kvmppc_create_dtl_entry(struct kvm_vcpu *vcpu,
dt = vcpu->arch.dtl_ptr;
vpa = vcpu->arch.vpa.pinned_addr;
now = tb;
-   core_stolen = vcore_stolen_time(vc, now);
-   stolen = core_stolen - vcpu->arch.stolen_logged;
-   vcpu->arch.stolen_logged = core_stolen;
-   spin_lock_irqsave(>arch.tbacct_lock, flags);
-   stolen += vcpu->arch.busy_stolen;
-   vcpu->arch.busy_stolen = 0;
-   spin_unlock_irqrestore(>arch.tbacct_lock, flags);
+
+   if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+   stolen = 0;
+   } else {
+   core_stolen = vcore_stolen_time(vc, now);
+   stolen = core_stolen - vcpu->arch.stolen_logged;
+   vcpu->arch.stolen_logged = core_stolen;
+   spin_lock_irqsave(>arch.tbacct_lock, flags);
+   stolen += vcpu->arch.busy_stolen;
+   vcpu->arch.busy_stolen = 0;
+   spin_unlock_irqrestore(>arch.tbacct_lock, flags);
+   }
+
if (!dt || !vpa)
return;
memset(dt, 0, sizeof(struct dtl_entry));
@@ -903,13 +925,14 @@ static int kvm_arch_vcpu_yield_to(struct kvm_vcpu *target)
 * mode handler is not called but no other threads are in the
 * source vcore.
 */
-
-   spin_lock(>lock);
-   if (target->arch.state == KVMPPC_VCPU_RUNNABLE &&
-   vcore->vcore_state != VCORE_INACTIVE &&
-   vcore->runner)
-   target = vcore->runner;
-   spin_unlock(>lock);
+   if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
+   spin_lock(>lock);
+   if (target->arch.state == KVMPPC_VCPU_RUNNABLE &&
+   vcore->vcore_state != VCORE_INACTIVE &&
+   vcore->runner)
+   target = vcore->runner;
+   spin_unlock(>lock);
+   }
 
return kvm_vcpu_yield_to(target);
 }
@@ -3097,13 +3120,6 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu, 
struct kvmppc_vcore *vc)
kvmppc_ipi_thread(cpu);
 }
 
-/* Old path does this in asm */
-static void kvmppc_stop_thread(struct kvm_vcpu *vcpu)
-{
-   vcpu->cpu = -1;
-   vcpu->arch.thread_cpu = -1;
-}
-
 static void kvmppc_wait_for_nap(int n_threads)
 {
int cpu = smp_processor_id();
@@ -3192,6 +3208,8 @@ static void kvmppc_vcore_preempt(struct kvmppc_vcore *vc)
 {
struct preempted_vcore_list *lp = this_cpu_ptr(_vcores);
 
+   WARN_ON_ONCE(cpu_has_feature(CPU_FTR_ARCH_300));
+
vc->vcore_state = VCORE_PREEMPT;
vc->pcpu = smp_processor_id();
if (vc->num_threads < threads_per_vcore(vc->kvm)) {
@@ -3208,6 +3226,8 @@ static void 

[RFC PATCH 2/6] KVM: PPC: Book3S HV P9: Avoid cpu_in_guest atomics on entry and exit

2021-07-15 Thread Nicholas Piggin
cpu_in_guest is set to determine if a CPU needs to be IPI'ed to exit
the guest and notice the need_tlb_flush bit.

This can be implemented as a global per-CPU pointer to the currently
running guest instead of per-guest cpumasks, saving 2 atomics per
entry/exit. P7/8 doesn't require cpu_in_guest, nor does a nested HV
(only the L0 does), so move it to the P9 HV path.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/include/asm/kvm_book3s_64.h |  1 -
 arch/powerpc/include/asm/kvm_host.h  |  1 -
 arch/powerpc/kvm/book3s_hv.c | 38 +---
 3 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_64.h 
b/arch/powerpc/include/asm/kvm_book3s_64.h
index 20ca9b1a2d41..2b442e00fb5d 100644
--- a/arch/powerpc/include/asm/kvm_book3s_64.h
+++ b/arch/powerpc/include/asm/kvm_book3s_64.h
@@ -43,7 +43,6 @@ struct kvm_nested_guest {
struct mutex tlb_lock;  /* serialize page faults and tlbies */
struct kvm_nested_guest *next;
cpumask_t need_tlb_flush;
-   cpumask_t cpu_in_guest;
short prev_cpu[NR_CPUS];
u8 radix;   /* is this nested guest radix */
 };
diff --git a/arch/powerpc/include/asm/kvm_host.h 
b/arch/powerpc/include/asm/kvm_host.h
index cd7939eb47ca..45dc92812020 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -288,7 +288,6 @@ struct kvm_arch {
u32 online_vcores;
atomic_t hpte_mod_interest;
cpumask_t need_tlb_flush;
-   cpumask_t cpu_in_guest;
u8 radix;
u8 fwnmi_enabled;
u8 secure_guest;
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 27a7a856eed1..d809566918de 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -2981,30 +2981,33 @@ static void kvmppc_release_hwthread(int cpu)
tpaca->kvm_hstate.kvm_split_mode = NULL;
 }
 
+static DEFINE_PER_CPU(struct kvm *, cpu_in_guest);
+
 static void radix_flush_cpu(struct kvm *kvm, int cpu, struct kvm_vcpu *vcpu)
 {
struct kvm_nested_guest *nested = vcpu->arch.nested;
-   cpumask_t *cpu_in_guest;
int i;
 
cpu = cpu_first_tlb_thread_sibling(cpu);
-   if (nested) {
+   if (nested)
cpumask_set_cpu(cpu, >need_tlb_flush);
-   cpu_in_guest = >cpu_in_guest;
-   } else {
+   else
cpumask_set_cpu(cpu, >arch.need_tlb_flush);
-   cpu_in_guest = >arch.cpu_in_guest;
-   }
/*
-* Make sure setting of bit in need_tlb_flush precedes
-* testing of cpu_in_guest bits.  The matching barrier on
-* the other side is the first smp_mb() in kvmppc_run_core().
+* Make sure setting of bit in need_tlb_flush precedes testing of
+* cpu_in_guest. The matching barrier on the other side is hwsync
+* when switching to guest MMU mode, which happens between
+* cpu_in_guest being set to the guest kvm, and need_tlb_flush bit
+* being tested.
 */
smp_mb();
for (i = cpu; i <= cpu_last_tlb_thread_sibling(cpu);
-   i += cpu_tlb_thread_sibling_step())
-   if (cpumask_test_cpu(i, cpu_in_guest))
+   i += cpu_tlb_thread_sibling_step()) {
+   struct kvm *running = *per_cpu_ptr(_in_guest, i);
+
+   if (running == kvm)
smp_call_function_single(i, do_nothing, NULL, 1);
+   }
 }
 
 static void do_migrate_away_vcpu(void *arg)
@@ -3072,7 +3075,6 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu, 
struct kvmppc_vcore *vc)
 {
int cpu;
struct paca_struct *tpaca;
-   struct kvm *kvm = vc->kvm;
 
cpu = vc->pcpu;
if (vcpu) {
@@ -3083,7 +3085,6 @@ static void kvmppc_start_thread(struct kvm_vcpu *vcpu, 
struct kvmppc_vcore *vc)
cpu += vcpu->arch.ptid;
vcpu->cpu = vc->pcpu;
vcpu->arch.thread_cpu = cpu;
-   cpumask_set_cpu(cpu, >arch.cpu_in_guest);
}
tpaca = paca_ptrs[cpu];
tpaca->kvm_hstate.kvm_vcpu = vcpu;
@@ -3801,7 +3802,6 @@ static noinline void kvmppc_run_core(struct kvmppc_vcore 
*vc)
kvmppc_release_hwthread(pcpu + i);
if (sip && sip->napped[i])
kvmppc_ipi_thread(pcpu + i);
-   cpumask_clear_cpu(pcpu + i, >kvm->arch.cpu_in_guest);
}
 
spin_unlock(>lock);
@@ -3968,8 +3968,14 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
}
 
} else {
+   struct kvm *kvm = vcpu->kvm;
+
kvmppc_xive_push_vcpu(vcpu);
+
+   __this_cpu_write(cpu_in_guest, kvm);
trap = kvmhv_vcpu_entry_p9(vcpu, time_limit, lpcr, tb);
+   __this_cpu_write(cpu_in_guest, NULL);
+
if (trap == 

[RFC PATCH 1/6] KVM: PPC: Book3S HV P9: Add unlikely annotation for !mmu_ready

2021-07-15 Thread Nicholas Piggin
The mmu will almost always be ready.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/kvm/book3s_hv.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 9d15bbafe333..27a7a856eed1 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -4367,7 +4367,7 @@ int kvmhv_run_single_vcpu(struct kvm_vcpu *vcpu, u64 
time_limit,
vc->runner = vcpu;
 
/* See if the MMU is ready to go */
-   if (!kvm->arch.mmu_ready) {
+   if (unlikely(!kvm->arch.mmu_ready)) {
r = kvmhv_setup_mmu(vcpu);
if (r) {
run->exit_reason = KVM_EXIT_FAIL_ENTRY;
-- 
2.23.0



[RFC PATCH 0/6] KVM: PPC: Book3S HV P9: Reduce guest entry/exit

2021-07-15 Thread Nicholas Piggin
This goes on top of the previous speedup series. The previous series is
mostly invovled with reducing the cost of SPR accesses. This one starts
to look beyond those, to atomics, barriers, and other logic that can be
reduced. After this series the P9 path uses very few things from the
vcore structure. This saves several hundred cycles for guest entry/exit
on a POWER9.

Thanks,
Nick

Nicholas Piggin (6):
  KVM: PPC: Book3S HV P9: Add unlikely annotation for !mmu_ready
  KVM: PPC: Book3S HV P9: Avoid cpu_in_guest atomics on entry and exit
  KVM: PPC: Book3S HV P9: Remove most of the vcore logic
  KVM: PPC: Book3S HV P9: Tidy kvmppc_create_dtl_entry
  KVM: PPC: Book3S HV P9: Stop using vc->dpdes
  KVM: PPC: Book3S HV P9: Remove subcore HMI handling

 arch/powerpc/include/asm/kvm_book3s_64.h |   1 -
 arch/powerpc/include/asm/kvm_host.h  |   1 -
 arch/powerpc/kvm/book3s_hv.c | 250 +--
 arch/powerpc/kvm/book3s_hv_builtin.c |   2 +
 arch/powerpc/kvm/book3s_hv_hmi.c |   7 +-
 arch/powerpc/kvm/book3s_hv_p9_entry.c|  35 +++-
 arch/powerpc/kvm/book3s_hv_ras.c |   4 +
 7 files changed, 185 insertions(+), 115 deletions(-)

-- 
2.23.0



[RFC PATCH 0/6] KVM: PPC: Book3S HV P9: Reduce guest entry/exit

2021-07-15 Thread Nicholas Piggin
This goes on top of the previous speedup series. The previous series is
mostly invovled with reducing the cost of SPR accesses. This one starts
to look beyond those, to atomics, barriers, and other logic that can be
reduced. After this series the P9 path uses very few things from the
vcore structure. This saves several hundred cycles for guest entry/exit
on a POWER9.

Thanks,
Nick

Nicholas Piggin (6):
  KVM: PPC: Book3S HV P9: Add unlikely annotation for !mmu_ready
  KVM: PPC: Book3S HV P9: Avoid cpu_in_guest atomics on entry and exit
  KVM: PPC: Book3S HV P9: Remove most of the vcore logic
  KVM: PPC: Book3S HV P9: Tidy kvmppc_create_dtl_entry
  KVM: PPC: Book3S HV P9: Stop using vc->dpdes
  KVM: PPC: Book3S HV P9: Remove subcore HMI handling

 arch/powerpc/include/asm/kvm_book3s_64.h |   1 -
 arch/powerpc/include/asm/kvm_host.h  |   1 -
 arch/powerpc/kvm/book3s_hv.c | 250 +--
 arch/powerpc/kvm/book3s_hv_builtin.c |   2 +
 arch/powerpc/kvm/book3s_hv_hmi.c |   7 +-
 arch/powerpc/kvm/book3s_hv_p9_entry.c|  35 +++-
 arch/powerpc/kvm/book3s_hv_ras.c |   4 +
 7 files changed, 185 insertions(+), 115 deletions(-)

-- 
2.23.0



Re: [PATCH] KVM: PPC: Book3S HV P9: Fix guest TM support

2021-07-15 Thread Michael Ellerman
On Mon, 12 Jul 2021 11:36:50 +1000, Nicholas Piggin wrote:
> The conversion to C introduced several bugs in TM handling that can
> cause host crashes with TM bad thing interrupts. Mostly just simple
> typos or missed logic in the conversion that got through due to my
> not testing TM in the guest sufficiently.
> 
> - Early TM emulation for the softpatch interrupt should be done if fake
>   suspend mode is _not_ active.
> 
> [...]

Applied to powerpc/fixes.

[1/1] KVM: PPC: Book3S HV P9: Fix guest TM support
  https://git.kernel.org/powerpc/c/e44fbdb68049539de9923ce4bad2d277aef54892

cheers


Re: [PATCH v3 1/1] powerpc/pseries: Interface to represent PAPR firmware attributes

2021-07-15 Thread Pratik Sampat

Hello,

On 12/07/21 9:13 pm, Fabiano Rosas wrote:

"Pratik R. Sampat"  writes:

Hi, have you seen Documentation/core-api/kobject.rst, particularly the
part that says:

"When you see a sysfs directory full of other directories, generally each
of those directories corresponds to a kobject in the same kset."

Taking a look at samples/kobject/kset-example.c, it seems to provide an
overall structure that is closer to what other modules do when creating
sysfs entries. It uses less dynamic allocations and deals a bit better
with cleaning up the state afterwards.


Thank you for pointing me towards this example, the kset approach is
interesting and the example indeed does handle cleanups better.

Currently, we use "machine_device_initcall()" to register this
functionality, do you suggest I convert this into a tristate module
instead where I can include a "module_exit" for cleanups?


Adds a generic interface to represent the energy and frequency related
PAPR attributes on the system using the new H_CALL
"H_GET_ENERGY_SCALE_INFO".

H_GET_EM_PARMS H_CALL was previously responsible for exporting this
information in the lparcfg, however the H_GET_EM_PARMS H_CALL
will be deprecated P10 onwards.

The H_GET_ENERGY_SCALE_INFO H_CALL is of the following call format:
hcall(
   uint64 H_GET_ENERGY_SCALE_INFO,  // Get energy scale info
   uint64 flags,   // Per the flag request
   uint64 firstAttributeId,// The attribute id
   uint64 bufferAddress,   // Guest physical address of the output buffer
   uint64 bufferSize   // The size in bytes of the output buffer
);

This H_CALL can query either all the attributes at once with
firstAttributeId = 0, flags = 0 as well as query only one attribute
at a time with firstAttributeId = id, flags = 1.

The output buffer consists of the following
1. number of attributes  - 8 bytes
2. array offset to the data location - 8 bytes
3. version info  - 1 byte
4. A data array of size num attributes, which contains the following:
   a. attribute ID  - 8 bytes
   b. attribute value in number - 8 bytes
   c. attribute name in string  - 64 bytes
   d. attribute value in string - 64 bytes

The new H_CALL exports information in direct string value format, hence
a new interface has been introduced in
/sys/firmware/papr/energy_scale_info to export this information to
userspace in an extensible pass-through format.

The H_CALL returns the name, numeric value and string value (if exists)

The format of exposing the sysfs information is as follows:
/sys/firmware/papr/energy_scale_info/
|-- /
  |-- desc
  |-- value
  |-- value_desc (if exists)
|-- /
  |-- desc
  |-- value
  |-- value_desc (if exists)
...

The energy information that is exported is useful for userspace tools
such as powerpc-utils. Currently these tools infer the
"power_mode_data" value in the lparcfg, which in turn is obtained from
the to be deprecated H_GET_EM_PARMS H_CALL.
On future platforms, such userspace utilities will have to look at the
data returned from the new H_CALL being populated in this new sysfs
interface and report this information directly without the need of
interpretation.

Signed-off-by: Pratik R. Sampat 
Reviewed-by: Gautham R. Shenoy 
---
  .../sysfs-firmware-papr-energy-scale-info |  26 ++
  arch/powerpc/include/asm/hvcall.h |  24 +-
  arch/powerpc/kvm/trace_hv.h   |   1 +
  arch/powerpc/platforms/pseries/Makefile   |   3 +-
  .../pseries/papr_platform_attributes.c| 320 ++
  5 files changed, 372 insertions(+), 2 deletions(-)
  create mode 100644 
Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
  create mode 100644 arch/powerpc/platforms/pseries/papr_platform_attributes.c

diff --git a/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info 
b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
new file mode 100644
index ..fd82f2bfafe5
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-firmware-papr-energy-scale-info
@@ -0,0 +1,26 @@
+What:  /sys/firmware/papr/energy_scale_info
+Date:  June 2021
+Contact:   Linux for PowerPC mailing list 
+Description:   Directory hosting a set of platform attributes like
+   energy/frequency on Linux running as a PAPR guest.
+
+   Each file in a directory contains a platform
+   attribute hierarchy pertaining to performance/
+   energy-savings mode and processor frequency.
+
+What:  /sys/firmware/papr/energy_scale_info/
+   /sys/firmware/papr/energy_scale_info//desc
+   /sys/firmware/papr/energy_scale_info//value
+   /sys/firmware/papr/energy_scale_info//value_desc
+Date:  June 2021
+Contact:   Linux for PowerPC mailing list 
+Description:   Energy, frequency attributes directory for POWERVM servers
+
+   This directory provides energy, erequency,