[PATCH kernel v2] KVM: PPC: Fix TCE handling for VFIO

2022-04-19 Thread Alexey Kardashevskiy
The LoPAPR spec defines a guest visible IOMMU with a variable page size.
Currently QEMU advertises 4K, 64K, 2M, 16MB pages, a Linux VM picks
the biggest (16MB). In the case of a passed though PCI device, there is
a hardware IOMMU which does not support all pages sizes from the above -
P8 cannot do 2MB and P9 cannot do 16MB. So for each emulated
16M IOMMU page we may create several smaller mappings ("TCEs") in
the hardware IOMMU.

The code wrongly uses the emulated TCE index instead of hardware TCE
index in error handling. The problem is easier to see on POWER8 with
multi-level TCE tables (when only the first level is preallocated)
as hash mode uses real mode TCE hypercalls handlers.
The kernel starts using indirect tables when VMs get bigger than 128GB
(depends on the max page order).
The very first real mode hcall is going to fail with H_TOO_HARD as
in the real mode we cannot allocate memory for TCEs (we can in the virtual
mode) but on the way out the code attempts to clear hardware TCEs using
emulated TCE indexes which corrupts random kernel memory because
it_offset==1<<59 is subtracted from those indexes and the resulting index
is out of the TCE table bounds.

This fixes kvmppc_clear_tce() to use the correct TCE indexes.

While at it, this fixes TCE cache invalidation which uses emulated TCE
indexes instead of the hardware ones. This went unnoticed as 64bit DMA
is used these days and VMs map all RAM in one go and only then do DMA
and this is when the TCE cache gets populated.

Potentially this could slow down mapping, however normally 16MB
emulated pages are backed by 64K hardware pages so it is one write to
the "TCE Kill" per 256 updates which is not that bad considering the size
of the cache (1024 TCEs or so).

Fixes: ca1fc489cfa0 ("KVM: PPC: Book3S: Allow backing bigger guest IOMMU pages 
with smaller physical pages")

Reviewed-by: Frederic Barrat 
Reviewed-by: David Gibson 
Tested-by: David Gibson 
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v2:
* reworded the first paragraph of the commit log
---
 arch/powerpc/kvm/book3s_64_vio.c| 45 +++--
 arch/powerpc/kvm/book3s_64_vio_hv.c | 44 ++--
 2 files changed, 45 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
index d42b4b6d4a79..85cfa6328222 100644
--- a/arch/powerpc/kvm/book3s_64_vio.c
+++ b/arch/powerpc/kvm/book3s_64_vio.c
@@ -420,13 +420,19 @@ static void kvmppc_tce_put(struct kvmppc_spapr_tce_table 
*stt,
tbl[idx % TCES_PER_PAGE] = tce;
 }
 
-static void kvmppc_clear_tce(struct mm_struct *mm, struct iommu_table *tbl,
-   unsigned long entry)
+static void kvmppc_clear_tce(struct mm_struct *mm, struct 
kvmppc_spapr_tce_table *stt,
+   struct iommu_table *tbl, unsigned long entry)
 {
-   unsigned long hpa = 0;
-   enum dma_data_direction dir = DMA_NONE;
+   unsigned long i;
+   unsigned long subpages = 1ULL << (stt->page_shift - tbl->it_page_shift);
+   unsigned long io_entry = entry << (stt->page_shift - 
tbl->it_page_shift);
 
-   iommu_tce_xchg_no_kill(mm, tbl, entry, , );
+   for (i = 0; i < subpages; ++i) {
+   unsigned long hpa = 0;
+   enum dma_data_direction dir = DMA_NONE;
+
+   iommu_tce_xchg_no_kill(mm, tbl, io_entry + i, , );
+   }
 }
 
 static long kvmppc_tce_iommu_mapped_dec(struct kvm *kvm,
@@ -485,6 +491,8 @@ static long kvmppc_tce_iommu_unmap(struct kvm *kvm,
break;
}
 
+   iommu_tce_kill(tbl, io_entry, subpages);
+
return ret;
 }
 
@@ -544,6 +552,8 @@ static long kvmppc_tce_iommu_map(struct kvm *kvm,
break;
}
 
+   iommu_tce_kill(tbl, io_entry, subpages);
+
return ret;
 }
 
@@ -590,10 +600,9 @@ long kvmppc_h_put_tce(struct kvm_vcpu *vcpu, unsigned long 
liobn,
ret = kvmppc_tce_iommu_map(vcpu->kvm, stt, stit->tbl,
entry, ua, dir);
 
-   iommu_tce_kill(stit->tbl, entry, 1);
 
if (ret != H_SUCCESS) {
-   kvmppc_clear_tce(vcpu->kvm->mm, stit->tbl, entry);
+   kvmppc_clear_tce(vcpu->kvm->mm, stt, stit->tbl, entry);
goto unlock_exit;
}
}
@@ -669,13 +678,13 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
 */
if (get_user(tce, tces + i)) {
ret = H_TOO_HARD;
-   goto invalidate_exit;
+   goto unlock_exit;
}
tce = be64_to_cpu(tce);
 
if (kvmppc_tce_to_ua(vcpu->kvm, tce, )) {
ret = H_PARAMETER;
-   goto invalidate_exit;
+   goto unlock_exit;
}
 
list_for_each_entry_lockless(stit, >iommu_tables, next) {
@@ -684,19 

Re: [PATCH 3/3] KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused

2022-04-19 Thread Maxim Levitsky
On Tue, 2022-04-19 at 15:45 +, Sean Christopherson wrote:
> On Tue, Apr 19, 2022, Maxim Levitsky wrote:
> > On Fri, 2022-04-15 at 00:43 +, Sean Christopherson wrote:
> > > Add wrappers to acquire/release KVM's SRCU lock when stashing the index
> > > in vcpu->src_idx, along with rudimentary detection of illegal usage,
> > > e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx.  Because the
> > > SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
> > > unnoticed for quite some time and only cause problems when the nested
> > > lock happens to get a different index.
> > > 
> > > Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
> > > likely yell so loudly that it will bring the kernel to its knees.
> > > 
> > > Signed-off-by: Sean Christopherson 
> > > ---
> 
> ...
> 
> > Looks good to me overall.
> > 
> > Note that there are still places that acquire the lock and store the idx 
> > into
> > a local variable, for example kvm_xen_vcpu_set_attr and such.
> > I didn't check yet if these should be converted as well.
> 
> Using a local variable is ok, even desirable.  Nested/multiple readers is not 
> an
> issue, the bug fixed by patch 1 is purely that kvm_vcpu.srcu_idx gets 
> corrupted.

Makes sense. I still recal *that* bug in AVIC inhibition where that srcu lock 
was
a major PITA, but now I remember that it was not due to nesting of the lock,
but rather fact that we attempted to call syncronize_srcu or something like that
with it held.


> 
> In an ideal world, KVM would _only_ track the SRCU index in local variables, 
> but
> that would require plumbing the local variable down into vcpu_enter_guest() 
> and
> kvm_vcpu_block() so that SRCU can be unlocked prior to entering the guest or
> scheduling out the vCPU.
> 
It all makes sense now - thanks.

Best regards,
Maxim Levistky



[PATCH] ASoC: fsl_asrc: using pm_runtime_resume_and_get to simplify the code

2022-04-19 Thread cgel . zte
From: Minghao Chi 

Using pm_runtime_resume_and_get() to replace pm_runtime_get_sync and
pm_runtime_put_noidle. This change is just to simplify the code, no
actual functional changes.

Reported-by: Zeal Robot 
Signed-off-by: Minghao Chi 
---
 sound/soc/fsl/fsl_asrc.c | 6 ++
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/sound/soc/fsl/fsl_asrc.c b/sound/soc/fsl/fsl_asrc.c
index d7d1536a4f37..31a7f9aac6e3 100644
--- a/sound/soc/fsl/fsl_asrc.c
+++ b/sound/soc/fsl/fsl_asrc.c
@@ -1211,11 +1211,9 @@ static int fsl_asrc_probe(struct platform_device *pdev)
goto err_pm_disable;
}
 
-   ret = pm_runtime_get_sync(>dev);
-   if (ret < 0) {
-   pm_runtime_put_noidle(>dev);
+   ret = pm_runtime_resume_and_get(>dev);
+   if (ret < 0)
goto err_pm_get_sync;
-   }
 
ret = fsl_asrc_init(asrc);
if (ret) {
-- 
2.25.1




Re: [PATCH v1 1/1] powerpc/83xx/mpc8349emitx: Get rid of of_node assignment

2022-04-19 Thread Linus Walleij
On Wed, Apr 6, 2022 at 3:02 PM Andy Shevchenko
 wrote:
> On Mon, Mar 28, 2022 at 03:16:08PM +0200, Linus Walleij wrote:
> > On Wed, Mar 23, 2022 at 6:43 PM Andy Shevchenko
> >  wrote:
> >
> > > Let GPIO library to assign of_node from the parent device.
> > > This allows to move GPIO library and drivers to use fwnode
> > > APIs instead of being stuck with OF-only interfaces.
> > >
> > > Signed-off-by: Andy Shevchenko 
> >
> > That's a nice patch.
> > Reviewed-by: Linus Walleij 
>
> Thanks!
>
> Can we have this applied now?

I think Michael Ellerman could help with this?

Michael?

Yours,
Linus Walleij


Apply d799769188529abc6cbf035a10087a51f7832b6b to 5.17 and 5.15?

2022-04-19 Thread Nathan Chancellor
Hi Greg, Sasha, and Michael,

Commit d79976918852 ("powerpc/64: Add UADDR64 relocation support") fixes
a boot failure with CONFIG_RELOCATABLE=y kernels linked with recent
versions of ld.lld [1]. Additionally, it resolves a separate boot
failure that Paul Menzel reported [2] with ld.lld 13.0.0. Is this a
reasonable backport for 5.17 and 5.15? It applies cleanly, resolves both
problems, and does not appear to cause any other issues in my testing
for both trees but I was curious what Michael's opinion was, as I am far
from a PowerPC expert.

This change does apply cleanly to 5.10 (I did not try earlier branches)
but there are other changes needed for ld.lld to link CONFIG_RELOCATABLE
kernels in that branch so to avoid any regressions, I think it is safe
to just focus on 5.15 and 5.17.

Paul, it would not hurt to confirm the results of my testing with your
setup, just to make sure I did not miss anything :)

[1]: https://github.com/ClangBuiltLinux/linux/issues/1581
[2]: https://lore.kernel.org/Yg2h2Q2vXFkkLGTh@dev-arch.archlinux-ax161/

Cheers,
Nathan


Re: [PATCH 3/3] KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused

2022-04-19 Thread Fabiano Rosas
Sean Christopherson  writes:

> Add wrappers to acquire/release KVM's SRCU lock when stashing the index
> in vcpu->src_idx, along with rudimentary detection of illegal usage,
> e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx.  Because the
> SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
> unnoticed for quite some time and only cause problems when the nested
> lock happens to get a different index.
>
> Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
> likely yell so loudly that it will bring the kernel to its knees.
>
> Signed-off-by: Sean Christopherson 

For the powerpc part:

Tested-by: Fabiano Rosas 

> ---
>  arch/powerpc/kvm/book3s_64_mmu_radix.c |  9 +
>  arch/powerpc/kvm/book3s_hv_nested.c| 16 +++
>  arch/powerpc/kvm/book3s_rtas.c |  4 ++--
>  arch/powerpc/kvm/powerpc.c |  4 ++--
>  arch/riscv/kvm/vcpu.c  | 16 +++
>  arch/riscv/kvm/vcpu_exit.c |  4 ++--
>  arch/s390/kvm/interrupt.c  |  4 ++--
>  arch/s390/kvm/kvm-s390.c   |  8 
>  arch/s390/kvm/vsie.c   |  4 ++--
>  arch/x86/kvm/x86.c | 28 --
>  include/linux/kvm_host.h   | 24 +-
>  11 files changed, 71 insertions(+), 50 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index e4ce2a35483f..42851c32ff3b 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -168,9 +168,10 @@ int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, 
> gva_t eaddr,
>   return -EINVAL;
>   /* Read the entry from guest memory */
>   addr = base + (index * sizeof(rpte));
> - vcpu->srcu_idx = srcu_read_lock(>srcu);
> +
> + kvm_vcpu_srcu_read_lock(vcpu);
>   ret = kvm_read_guest(kvm, addr, , sizeof(rpte));
> - srcu_read_unlock(>srcu, vcpu->srcu_idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
>   if (ret) {
>   if (pte_ret_p)
>   *pte_ret_p = addr;
> @@ -246,9 +247,9 @@ int kvmppc_mmu_radix_translate_table(struct kvm_vcpu 
> *vcpu, gva_t eaddr,
>  
>   /* Read the table to find the root of the radix tree */
>   ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
> - vcpu->srcu_idx = srcu_read_lock(>srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>   ret = kvm_read_guest(kvm, ptbl, , sizeof(entry));
> - srcu_read_unlock(>srcu, vcpu->srcu_idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
>   if (ret)
>   return ret;
>  
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
> b/arch/powerpc/kvm/book3s_hv_nested.c
> index 9d373f8963ee..c943a051c6e7 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -306,10 +306,10 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>   /* copy parameters in */
>   hv_ptr = kvmppc_get_gpr(vcpu, 4);
>   regs_ptr = kvmppc_get_gpr(vcpu, 5);
> - vcpu->srcu_idx = srcu_read_lock(>kvm->srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>   err = kvmhv_read_guest_state_and_regs(vcpu, _hv, _regs,
> hv_ptr, regs_ptr);
> - srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
>   if (err)
>   return H_PARAMETER;
>  
> @@ -410,10 +410,10 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>   byteswap_hv_regs(_hv);
>   byteswap_pt_regs(_regs);
>   }
> - vcpu->srcu_idx = srcu_read_lock(>kvm->srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>   err = kvmhv_write_guest_state_and_regs(vcpu, _hv, _regs,
>  hv_ptr, regs_ptr);
> - srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
>   if (err)
>   return H_AUTHORITY;
>  
> @@ -600,16 +600,16 @@ long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu 
> *vcpu)
>   goto not_found;
>  
>   /* Write what was loaded into our buffer back to the L1 guest */
> - vcpu->srcu_idx = srcu_read_lock(>kvm->srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>   rc = kvm_vcpu_write_guest(vcpu, gp_to, buf, n);
> - srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
>   if (rc)
>   goto not_found;
>   } else {
>   /* Load the data to be stored from the L1 guest into our buf */
> - vcpu->srcu_idx = srcu_read_lock(>kvm->srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>   rc = kvm_vcpu_read_guest(vcpu, gp_from, buf, n);
> - srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
> +

Re: [PATCH RFC 2/8] arm64: stacktrace: Add arch_within_stack_frames

2022-04-19 Thread Mark Rutland
Hi,

On Mon, Apr 18, 2022 at 09:22:11PM +0800, He Zhe wrote:
> This function checks if the given address range crosses frame boundary.

I don't think that's quite true, becuase arm64's procedure call standard
(AAPCS64) doesn't give us enough information to determine this without
additional metadata from the compiler, which we simply don't have today.

Since there's a lot of confusion in this area, I've made a bit of an info dump
below, before review on the patch itself, but TBH I'm struggling to see that
this is all that useful.

On arm64, we use a calling convention called AAPCS64, (in full: "Procedure Call
Standard for the Arm® 64-bit Architecture (AArch64)"). That's maintained at:

  https://github.com/ARM-software/abi-aa

... with the latest release (as of today) at:

  
https://github.com/ARM-software/abi-aa/blob/60a8eb8c55e999d74dac5e368fc9d7e36e38dda4/aapcs64/aapcs64.rst
  https://github.com/ARM-software/abi-aa/releases/download/2022Q1/aapcs64.pdf

In AAPCS64, there are two related but distinct things to be aware of:

* The "stack frame" of a function, which is the entire contiguous region of
  stack memory used by a function.

* The "frame record", which is the saved FP and LR placed *somewhere* within
  the function's stack frame. The FP points at the most recent frame record on
  the stack, and at function call boundaries points at the caller's frame
  record.

AAPCS64 doesn't say *where* a frame record is placed within a stack frame, and
there are reasons for compilers to place above and below it. So in genral, a
functionss stack frame looks like:
  
+=+
|  above  |
|-|
| FP | LR |
|-|
|  below  |
+=+

... where the "above" or "below" portions might be any size (even 0 bytes).

Typical code generation today means for most functions that the "below" portion
is 0 bytes in size, but this is not guaranteed, and even today there are cases
where this is not true.

When one function calls another without a stack transition, that looks like:

+=+ ___
|  above  |\
|-||
 ,->| FP | LR |+-- Caller's stack frame
 |  |-||
 |  |  below  | ___/
 |  +=+ ___ 
 |  |  above  |\
 |  |-||
 '--| FP | LR |+-- Callee's stack frame
|-||
|  below  | ___/
+=+

Where there's a stack transition, and the new stack is at a *lower* VA than the
old stack, that looks like:

+=+ ___
|  above  |\
|-||
 ,->| FP | LR |+-- Caller's stack frame
 |  |-||
 |  |  below  | ___/
 |  +=+
 | 
 |  ~~~
 |  Arbitrarily 
 |  large gap,
 |  potentially
 |  including
 |  other data
 |  ~~~
 |
 |  +=+ ___ 
 |  |  above  |\
 |  |-||
 '--| FP | LR |+-- Callee's stack frame
|-||
|  below  | ___/
+=+

Where there's a stack transition, and the new stack is at a *higher* VA than
the old stack, that looks like:

+=+ ___ 
|  above  |\
|-||
 ,--| FP | LR |+-- Callee's stack frame
 |  |-||
 |  |  below  | ___/
 |  +=+
 |
 |  ~~~
 |  Arbitrarily 
 |  large gap,
 |  potentially
 |  including
 |  other data
 |  ~~~
 | 
 |  +=+ ___
 |  |  above  |\
 |  |-||
 '->| FP | LR |+-- Caller's stack frame
|-||
|  below  | ___/
+=+
 
In all of these cases, we *cannot* identify the boundary between the two stack
frames, we can *only* identify where something overlaps a frame record. That
might itself be a good thing, but it's not the same thing as what you describe
in the commit message.

> It is based on the existing x86 algorithm, but implemented via stacktrace.
> This can be tested by USERCOPY_STACK_FRAME_FROM and
> USERCOPY_STACK_FRAME_TO in lkdtm.

Can you please explain *why* we'd want this?

Who do we expect to use this?

What's the overhead in practice?

Has this passed a more realistic stress test (e.g. running some userspace
applications which make intensive use of copies to/from the kernel)?

> 
> Signed-off-by: He Zhe 
> ---
>  arch/arm64/Kconfig   |  1 +
>  arch/arm64/include/asm/thread_info.h | 12 +
>  arch/arm64/kernel/stacktrace.c   | 76 ++--
>  3 files changed, 85 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index 57c4c995965f..0f52a83d7771 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -165,6 +165,7 @@ config ARM64
>   select HAVE_ARCH_TRACEHOOK
>   select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>   select HAVE_ARCH_VMAP_STACK
> + select 

Re: [PATCH RFC 1/8] stacktrace: Change callback prototype to pass more information

2022-04-19 Thread He Zhe



On 4/19/22 21:09, Mark Rutland wrote:
> On Mon, Apr 18, 2022 at 09:22:10PM +0800, He Zhe wrote:
>> Currently stack_trace_consume_fn can only have pc of each frame of the
>> stack. Copying-beyond-the-frame-detection also needs fp of current and
>> previous frame. Other detection algorithm in the future may need more
>> information of the frame.
>>
>> We define a frame_info to include them all.
>>
>>
>> Signed-off-by: He Zhe 
>> ---
>>  include/linux/stacktrace.h |  9 -
>>  kernel/stacktrace.c| 10 +-
>>  2 files changed, 13 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
>> index 97455880ac41..5a61bfafe6f0 100644
>> --- a/include/linux/stacktrace.h
>> +++ b/include/linux/stacktrace.h
>> @@ -10,15 +10,22 @@ struct pt_regs;
>>  
>>  #ifdef CONFIG_ARCH_STACKWALK
>>  
>> +struct frame_info {
>> +unsigned long pc;
>> +unsigned long fp;
>> +unsigned long prev_fp;
>> +};
> I don't think this should be exposed through a generic interface; the `fp` and
> `prev_fp` values are only meaningful with arch-specific knowledge, and they're
> *very* easy to misuse (e.g. when transitioning from one stack to another).
> There's also a bunch of other information one may or may not want, depending 
> on
> what you're trying to achieve.
>
> I am happy to have an arch-specific internal unwinder where we can access this
> information, and *maybe* it makes sense to have a generic API that passes some
> opaque token, but I don't think we should make the structure generic.

Thanks for thoughts. I saw unwind_frame and etc was made private earlier and
took that as a hint that all further stack walk things should be based on those
functions and came up with this. But OK, good to know that arch-specific unwind
would be fine, I'll redo this series in that way.

Thanks,
Zhe

>
> Thanks,
> Mark.
>
>> +
>>  /**
>>   * stack_trace_consume_fn - Callback for arch_stack_walk()
>>   * @cookie: Caller supplied pointer handed back by arch_stack_walk()
>>   * @addr:   The stack entry address to consume
>> + * @fi: The frame information to consume
>>   *
>>   * Return:  True, if the entry was consumed or skipped
>>   *  False, if there is no space left to store
>>   */
>> -typedef bool (*stack_trace_consume_fn)(void *cookie, unsigned long addr);
>> +typedef bool (*stack_trace_consume_fn)(void *cookie, struct frame_info *fi);
>>  /**
>>   * arch_stack_walk - Architecture specific function to walk the stack
>>   * @consume_entry:  Callback which is invoked by the architecture code for
>> diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
>> index 9ed5ce989415..2d0a2812e92b 100644
>> --- a/kernel/stacktrace.c
>> +++ b/kernel/stacktrace.c
>> @@ -79,7 +79,7 @@ struct stacktrace_cookie {
>>  unsigned intlen;
>>  };
>>  
>> -static bool stack_trace_consume_entry(void *cookie, unsigned long addr)
>> +static bool stack_trace_consume_entry(void *cookie, struct frame_info *fi)
>>  {
>>  struct stacktrace_cookie *c = cookie;
>>  
>> @@ -90,15 +90,15 @@ static bool stack_trace_consume_entry(void *cookie, 
>> unsigned long addr)
>>  c->skip--;
>>  return true;
>>  }
>> -c->store[c->len++] = addr;
>> +c->store[c->len++] = fi->pc;
>>  return c->len < c->size;
>>  }
>>  
>> -static bool stack_trace_consume_entry_nosched(void *cookie, unsigned long 
>> addr)
>> +static bool stack_trace_consume_entry_nosched(void *cookie, struct 
>> frame_info *fi)
>>  {
>> -if (in_sched_functions(addr))
>> +if (in_sched_functions(fi->pc))
>>  return true;
>> -return stack_trace_consume_entry(cookie, addr);
>> +return stack_trace_consume_entry(cookie, fi);
>>  }
>>  
>>  /**
>> -- 
>> 2.25.1
>>



Re: [PATCH RFC 2/8] arm64: stacktrace: Add arch_within_stack_frames

2022-04-19 Thread He Zhe



On 4/19/22 05:59, Kees Cook wrote:
> On Mon, Apr 18, 2022 at 09:22:11PM +0800, He Zhe wrote:
>> This function checks if the given address range crosses frame boundary.
>> It is based on the existing x86 algorithm, but implemented via stacktrace.
>> This can be tested by USERCOPY_STACK_FRAME_FROM and
>> USERCOPY_STACK_FRAME_TO in lkdtm.
> Hi,
>
> Thanks for doing this implementation! One reason usercopy hardening
> didn't persue doing a "full" stacktrace was because it seemed relatively
> expensive. Did you do any usercopy-heavily workload testing to see if
> there was a noticeable performance impact?
>
> It would be nice to block the exposure of canaries and PAC bits, though,
> so I'm not opposed, but I'd like to get a better sense of how "heavy"
> this might be.

I just did some rough tests:
hackbench -s 512 -l 200 -g 15 -f 25 -P
Such line would hit arch_within_stack_frames at least 5000 times in my 
environment.
With hardened_usercopy=on, the execution time would be around 2.121 
seconds(average for 30 times)
With hardened_usercopy=off, the execution time would be around 2.011 
seconds(average for 30 times)

I'll test the original x86 way for arm64 tomorrow.

Any other workload needed to be run?

Thanks,
Zhe


>
> Thanks!
>
> -Kees
>



Re: [PATCH v2 1/2] of: Create platform devices for OF framebuffers

2022-04-19 Thread Thomas Zimmermann

Hi

Am 19.04.22 um 15:30 schrieb Rob Herring:
...

-#ifndef CONFIG_PPC
  static const struct of_device_id reserved_mem_matches[] = {
{ .compatible = "qcom,rmtfs-mem" },
{ .compatible = "qcom,cmd-db" },
@@ -520,33 +519,81 @@ static const struct of_device_id reserved_mem_matches[] = 
{
  
  static int __init of_platform_default_populate_init(void)

  {
-   struct device_node *node;
-


As both if/else clauses need 'node', I'd keep this declared here.


Ok.




device_links_supplier_sync_state_pause();
  
  	if (!of_have_populated_dt())

return -ENODEV;
  
-	/*

-* Handle certain compatibles explicitly, since we don't want to create
-* platform_devices for every node in /reserved-memory with a
-* "compatible",
-*/
-   for_each_matching_node(node, reserved_mem_matches)
-   of_platform_device_create(node, NULL, NULL);
+   if (IS_ENABLED(CONFIG_PPC)) {
+   struct device_node *boot_display = NULL;
+   struct device_node *node;
+   struct platform_device *dev;
+   int ret;
+
+   /* Check if we have a MacOS display without a node spec */
+   if (of_get_property(of_chosen, "linux,bootx-noscreen", NULL)) {
+   /*
+* The old code tried to work out which node was the 
MacOS
+* display based on the address. I'm dropping that 
since the
+* lack of a node spec only happens with old BootX 
versions
+* (users can update) and with this code, they'll still 
get
+* a display (just not the palette hacks).
+*/
+   dev = platform_device_alloc("bootx-noscreen", 0);
+   if (WARN_ON(!dev))
+   return -ENOMEM;
+   ret = platform_device_add(dev);
+   if (WARN_ON(ret)) {
+   platform_device_put(dev);
+   return ret;
+   }
+   }
  
-	node = of_find_node_by_path("/firmware");

-   if (node) {
-   of_platform_populate(node, NULL, NULL, NULL);
-   of_node_put(node);
-   }
+   /*
+* For OF framebuffers, first create the device for the boot 
display,
+* then for the other framebuffers. Only fail for the boot 
display;
+* ignore errors for the rest.
+*/
+   for_each_node_by_type(node, "display") {
+   if (!of_get_property(node, "linux,opened", NULL) ||
+   !of_get_property(node, "linux,boot-display", NULL))
+   continue;
+   dev = of_platform_device_create(node, "of-display", 
NULL);
+   if (WARN_ON(!dev))
+   return -ENOMEM;
+   boot_display = node;
+   break;
+   }
+   for_each_node_by_type(node, "display") {
+   if (!of_get_property(node, "linux,opened", NULL) || 
node == boot_display)
+   continue;
+   of_platform_device_create(node, "of-display", NULL);
+   }
  
-	node = of_get_compatible_child(of_chosen, "simple-framebuffer");

-   of_platform_device_create(node, NULL, NULL);
-   of_node_put(node);
+   } else {
+   struct device_node *node;
+
+   /*
+* Handle certain compatibles explicitly, since we don't want 
to create
+* platform_devices for every node in /reserved-memory with a
+* "compatible",
+*/
+   for_each_matching_node(node, reserved_mem_matches)
+   of_platform_device_create(node, NULL, NULL);
  
-	/* Populate everything else. */

-   of_platform_default_populate(NULL, NULL, NULL);
+   node = of_find_node_by_path("/firmware");
+   if (node) {
+   of_platform_populate(node, NULL, NULL, NULL);
+   of_node_put(node);
+   }
+
+   node = of_get_compatible_child(of_chosen, "simple-framebuffer");
+   of_platform_device_create(node, NULL, NULL);
+   of_node_put(node);


In v1, you supported "simple-framebuffer" on PPC. Don't we want to allow
that? Maybe no one cares ATM, but that could change. Either way:


Support for these framebuffers has always been mutually exclusive. The 
offb driver, which originally contained the code, depends on CONFIG_PPC. 
And PPC never supported simple-framebuffer anywhere.




Reviewed-by: Rob Herring 


Thank you.

Best regards
Thomas





+
+   /* Populate everything else. */
+   of_platform_default_populate(NULL, NULL, 

Re: [PATCH v2 1/2] of: Create platform devices for OF framebuffers

2022-04-19 Thread Rob Herring
On Tue, Apr 19, 2022 at 12:04:04PM +0200, Thomas Zimmermann wrote:
> Create a platform device for each OF-declared framebuffer and have
> offb bind to these devices. Allows for real hot-unplugging and other
> drivers besides offb.
> 
> Originally, offb created framebuffer devices while initializing its
> module by parsing the OF device tree. No actual Linux device was set
> up. This tied OF framebuffers to offb and makes writing other drivers
> for the OF framebuffers complicated. The absence of a Linux device
> further prevented real hot-unplugging. Adding a distinct platform
> device for each OF framebuffer solves both problems. Specifically, a
> DRM driver can now provide graphics output for modern userspace.
> 
> Some of the offb init code is now located in the OF initialization.
> There's now also an implementation of of_platform_default_populate_init(),
> which was missing before. The OF side creates different devices for
> either OF display nodes or BootX displays as they require different
> handling by the driver. The offb drivers picks up each type of device
> and runs the appropriate fbdev initialization.
> 
> Tested with OF display nodes on qemu's ppc64le target.
> 
> v2:
>   * run PPC code as part of existing initialization (Rob)
>   * add a few more error warnings (Javier)
> 
> Signed-off-by: Thomas Zimmermann 
> Reviewed-by: Javier Martinez Canillas 
> ---
>  drivers/of/platform.c  | 88 ++
>  drivers/video/fbdev/offb.c | 98 +-
>  2 files changed, 132 insertions(+), 54 deletions(-)
> 
> diff --git a/drivers/of/platform.c b/drivers/of/platform.c
> index a16b74f32aa9..738ba2e2838c 100644
> --- a/drivers/of/platform.c
> +++ b/drivers/of/platform.c
> @@ -507,7 +507,6 @@ int of_platform_default_populate(struct device_node *root,
>  }
>  EXPORT_SYMBOL_GPL(of_platform_default_populate);
>  
> -#ifndef CONFIG_PPC
>  static const struct of_device_id reserved_mem_matches[] = {
>   { .compatible = "qcom,rmtfs-mem" },
>   { .compatible = "qcom,cmd-db" },
> @@ -520,33 +519,81 @@ static const struct of_device_id reserved_mem_matches[] 
> = {
>  
>  static int __init of_platform_default_populate_init(void)
>  {
> - struct device_node *node;
> -

As both if/else clauses need 'node', I'd keep this declared here.

>   device_links_supplier_sync_state_pause();
>  
>   if (!of_have_populated_dt())
>   return -ENODEV;
>  
> - /*
> -  * Handle certain compatibles explicitly, since we don't want to create
> -  * platform_devices for every node in /reserved-memory with a
> -  * "compatible",
> -  */
> - for_each_matching_node(node, reserved_mem_matches)
> - of_platform_device_create(node, NULL, NULL);
> + if (IS_ENABLED(CONFIG_PPC)) {
> + struct device_node *boot_display = NULL;
> + struct device_node *node;
> + struct platform_device *dev;
> + int ret;
> +
> + /* Check if we have a MacOS display without a node spec */
> + if (of_get_property(of_chosen, "linux,bootx-noscreen", NULL)) {
> + /*
> +  * The old code tried to work out which node was the 
> MacOS
> +  * display based on the address. I'm dropping that 
> since the
> +  * lack of a node spec only happens with old BootX 
> versions
> +  * (users can update) and with this code, they'll still 
> get
> +  * a display (just not the palette hacks).
> +  */
> + dev = platform_device_alloc("bootx-noscreen", 0);
> + if (WARN_ON(!dev))
> + return -ENOMEM;
> + ret = platform_device_add(dev);
> + if (WARN_ON(ret)) {
> + platform_device_put(dev);
> + return ret;
> + }
> + }
>  
> - node = of_find_node_by_path("/firmware");
> - if (node) {
> - of_platform_populate(node, NULL, NULL, NULL);
> - of_node_put(node);
> - }
> + /*
> +  * For OF framebuffers, first create the device for the boot 
> display,
> +  * then for the other framebuffers. Only fail for the boot 
> display;
> +  * ignore errors for the rest.
> +  */
> + for_each_node_by_type(node, "display") {
> + if (!of_get_property(node, "linux,opened", NULL) ||
> + !of_get_property(node, "linux,boot-display", NULL))
> + continue;
> + dev = of_platform_device_create(node, "of-display", 
> NULL);
> + if (WARN_ON(!dev))
> + return -ENOMEM;
> + boot_display = node;
> + break;
> +

Re: [PATCH RFC 1/8] stacktrace: Change callback prototype to pass more information

2022-04-19 Thread Mark Rutland
On Mon, Apr 18, 2022 at 09:22:10PM +0800, He Zhe wrote:
> Currently stack_trace_consume_fn can only have pc of each frame of the
> stack. Copying-beyond-the-frame-detection also needs fp of current and
> previous frame. Other detection algorithm in the future may need more
> information of the frame.
> 
> We define a frame_info to include them all.
> 
> 
> Signed-off-by: He Zhe 
> ---
>  include/linux/stacktrace.h |  9 -
>  kernel/stacktrace.c| 10 +-
>  2 files changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/stacktrace.h b/include/linux/stacktrace.h
> index 97455880ac41..5a61bfafe6f0 100644
> --- a/include/linux/stacktrace.h
> +++ b/include/linux/stacktrace.h
> @@ -10,15 +10,22 @@ struct pt_regs;
>  
>  #ifdef CONFIG_ARCH_STACKWALK
>  
> +struct frame_info {
> + unsigned long pc;
> + unsigned long fp;
> + unsigned long prev_fp;
> +};

I don't think this should be exposed through a generic interface; the `fp` and
`prev_fp` values are only meaningful with arch-specific knowledge, and they're
*very* easy to misuse (e.g. when transitioning from one stack to another).
There's also a bunch of other information one may or may not want, depending on
what you're trying to achieve.

I am happy to have an arch-specific internal unwinder where we can access this
information, and *maybe* it makes sense to have a generic API that passes some
opaque token, but I don't think we should make the structure generic.

Thanks,
Mark.

> +
>  /**
>   * stack_trace_consume_fn - Callback for arch_stack_walk()
>   * @cookie:  Caller supplied pointer handed back by arch_stack_walk()
>   * @addr:The stack entry address to consume
> + * @fi:  The frame information to consume
>   *
>   * Return:   True, if the entry was consumed or skipped
>   *   False, if there is no space left to store
>   */
> -typedef bool (*stack_trace_consume_fn)(void *cookie, unsigned long addr);
> +typedef bool (*stack_trace_consume_fn)(void *cookie, struct frame_info *fi);
>  /**
>   * arch_stack_walk - Architecture specific function to walk the stack
>   * @consume_entry:   Callback which is invoked by the architecture code for
> diff --git a/kernel/stacktrace.c b/kernel/stacktrace.c
> index 9ed5ce989415..2d0a2812e92b 100644
> --- a/kernel/stacktrace.c
> +++ b/kernel/stacktrace.c
> @@ -79,7 +79,7 @@ struct stacktrace_cookie {
>   unsigned intlen;
>  };
>  
> -static bool stack_trace_consume_entry(void *cookie, unsigned long addr)
> +static bool stack_trace_consume_entry(void *cookie, struct frame_info *fi)
>  {
>   struct stacktrace_cookie *c = cookie;
>  
> @@ -90,15 +90,15 @@ static bool stack_trace_consume_entry(void *cookie, 
> unsigned long addr)
>   c->skip--;
>   return true;
>   }
> - c->store[c->len++] = addr;
> + c->store[c->len++] = fi->pc;
>   return c->len < c->size;
>  }
>  
> -static bool stack_trace_consume_entry_nosched(void *cookie, unsigned long 
> addr)
> +static bool stack_trace_consume_entry_nosched(void *cookie, struct 
> frame_info *fi)
>  {
> - if (in_sched_functions(addr))
> + if (in_sched_functions(fi->pc))
>   return true;
> - return stack_trace_consume_entry(cookie, addr);
> + return stack_trace_consume_entry(cookie, fi);
>  }
>  
>  /**
> -- 
> 2.25.1
> 


Re: [PATCH v2 3/8] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE

2022-04-19 Thread David Hildenbrand
On 29.03.22 18:43, David Hildenbrand wrote:
> Let's use bit 3 to remember PG_anon_exclusive in swap ptes.
> 
> Signed-off-by: David Hildenbrand 
> ---

Looks like I ignored that 32bit uses a different (undocumented) swap layout
and bit 3 falls into the swp type. We'll restrict this to x86-64 for now, just
like for the other architectures.

The following seems to fly. @Andrew, let me know if you prefer a v3.


>From bafb5ba914e89ad20c46f4e841a36909e610b81e Mon Sep 17 00:00:00 2001
From: David Hildenbrand 
Date: Wed, 9 Mar 2022 09:47:29 +0100
Subject: [PATCH] x86/pgtable: support __HAVE_ARCH_PTE_SWP_EXCLUSIVE on x86_64

Let's use bit 3 to remember PG_anon_exclusive in swap ptes.

Signed-off-by: David Hildenbrand 
---
 arch/x86/include/asm/pgtable.h  | 17 +
 arch/x86/include/asm/pgtable_64.h   |  4 +++-
 arch/x86/include/asm/pgtable_64_types.h |  5 +
 3 files changed, 25 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..a1c555abed26 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1291,6 +1291,23 @@ static inline void update_mmu_cache_pud(struct 
vm_area_struct *vma,
unsigned long addr, pud_t *pud)
 {
 }
+#ifdef _PAGE_SWP_EXCLUSIVE
+#define __HAVE_ARCH_PTE_SWP_EXCLUSIVE
+static inline pte_t pte_swp_mkexclusive(pte_t pte)
+{
+   return pte_set_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+
+static inline int pte_swp_exclusive(pte_t pte)
+{
+   return pte_flags(pte) & _PAGE_SWP_EXCLUSIVE;
+}
+
+static inline pte_t pte_swp_clear_exclusive(pte_t pte)
+{
+   return pte_clear_flags(pte, _PAGE_SWP_EXCLUSIVE);
+}
+#endif /* _PAGE_SWP_EXCLUSIVE */
 
 #ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
 static inline pte_t pte_swp_mksoft_dirty(pte_t pte)
diff --git a/arch/x86/include/asm/pgtable_64.h 
b/arch/x86/include/asm/pgtable_64.h
index 56d0399a0cd1..e479491da8d5 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -186,7 +186,7 @@ static inline void native_pgd_clear(pgd_t *pgd)
  *
  * | ...| 11| 10|  9|8|7|6|5| 4| 3|2| 1|0| <- bit number
  * | ...|SW3|SW2|SW1|G|L|D|A|CD|WT|U| W|P| <- bit names
- * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| X|F|SD|0| <- swp entry
+ * | TYPE (59-63) | ~OFFSET (9-58)  |0|0|X|X| X| E|F|SD|0| <- swp entry
  *
  * G (8) is aliased and used as a PROT_NONE indicator for
  * !present ptes.  We need to start storing swap entries above
@@ -203,6 +203,8 @@ static inline void native_pgd_clear(pgd_t *pgd)
  * F (2) in swp entry is used to record when a pagetable is
  * writeprotected by userfaultfd WP support.
  *
+ * E (3) in swp entry is used to rememeber PG_anon_exclusive.
+ *
  * Bit 7 in swp entry should be 0 because pmd_present checks not only P,
  * but also L and G.
  *
diff --git a/arch/x86/include/asm/pgtable_64_types.h 
b/arch/x86/include/asm/pgtable_64_types.h
index 91ac10654570..70e360a2e5fb 100644
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -163,4 +163,9 @@ extern unsigned int ptrs_per_p4d;
 
 #define PGD_KERNEL_START   ((PAGE_SIZE / 2) / sizeof(pgd_t))
 
+/*
+ * We borrow bit 3 to remember PG_anon_exclusive.
+ */
+#define _PAGE_SWP_EXCLUSIVE_PAGE_PWT
+
 #endif /* _ASM_X86_PGTABLE_64_DEFS_H */
-- 
2.35.1





-- 
Thanks,

David / dhildenb



[PATCH V3 2/2] powerpc/perf: Fix the power10 event alternatives array to have correct sort order

2022-04-19 Thread Athira Rajeev
When scheduling a group of events, there are constraint checks
done to make sure all events can go in a group. Example, one of
the criteria is that events in a group cannot use same PMC.
But platform specific PMU supports alternative event for some
of the event codes. During perf_event_open, if any event
group doesn't match constraint check criteria, further lookup
is done to find alternative event.

By current design, the array of alternatives events in PMU 
code is expected to be sorted by column 0. This is because in
find_alternative() function, the return criteria is based on
event code comparison. ie "event < ev_alt[i][0])". This
optimisation is there since find_alternative() can get called
multiple times. In power10 PMU code, the alternative event array
is not sorted list and hence there is breakage in finding 
alternative event.

To work with existing logic, fix the alternative event array
to be sorted by column 0 for power10-pmu.c

Results:

In case where an alternative event is not chosen when we could,
events will be multiplexed. ie, time sliced where it could
actually run concurrently.
Example, in power10 PM_INST_CMPL_ALT(0x2) has alternative
event, PM_INST_CMPL(0x500fa). Without the fix, if a group of
events with PMC1 to PMC4 is used along with PM_INST_CMPL_ALT,
it will be time sliced since all programmable PMC's are
consumed already. But with the fix, when it picks alternative
event on PMC5, all events will run concurrently.

<< Before Patch >>
 # perf stat -e r2,r100fc,r200fa,r300fc,r400fc
^C
 Performance counter stats for 'system wide':

 328668935  r2   (79.94%)
  56501024  r100fc   (79.95%)
  49564238  r200fa   (79.95%)
   376  r300fc   (80.19%)
   660  r400fc   (79.97%)

   4.039150522 seconds time elapsed

With the fix, since alternative event is chosen to run
on PMC6, events will be run concurrently.

<< After Patch >>
 # perf stat -e r2,r100fc,r200fa,r300fc,r400fc
^C
 Performance counter stats for 'system wide':

  23596607  r2
   4907738  r100fc
   2283608  r200fa
   135  r300fc
   248  r400fc

   1.664671390 seconds time elapsed

Fixes: a64e697cef23 ("powerpc/perf: power10 Performance Monitoring support")
Signed-off-by: Athira Rajeev 
Reviewed-by: Madhavan Srinivasan 
---
Changelog:
 v1 -> v2:
 Added Fixes tag and reworded commit message
 Added Reviewed-by from Maddy
 v2 -> v3:
 Added info about what is the breakage with current 
 code.

 arch/powerpc/perf/power10-pmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/perf/power10-pmu.c b/arch/powerpc/perf/power10-pmu.c
index d3398100a60f..c6d51e7093cf 100644
--- a/arch/powerpc/perf/power10-pmu.c
+++ b/arch/powerpc/perf/power10-pmu.c
@@ -91,8 +91,8 @@ extern u64 PERF_REG_EXTENDED_MASK;
 
 /* Table of alternatives, sorted by column 0 */
 static const unsigned int power10_event_alternatives[][MAX_ALT] = {
-   { PM_CYC_ALT,   PM_CYC },
{ PM_INST_CMPL_ALT, PM_INST_CMPL },
+   { PM_CYC_ALT,   PM_CYC },
 };
 
 static int power10_get_alternatives(u64 event, unsigned int flags, u64 alt[])
-- 
2.35.1



[PATCH V3 1/2] powerpc/perf: Fix the power9 event alternatives array to have correct sort order

2022-04-19 Thread Athira Rajeev
When scheduling a group of events, there are constraint checks
done to make sure all events can go in a group. Example, one of
the criteria is that events in a group cannot use same PMC.
But platform specific PMU supports alternative event for some
of the event codes. During perf_event_open, if any event
group doesn't match constraint check criteria, further lookup
is done to find alternative event.

By current design, the array of alternatives events in PMU
code is expected to be sorted by column 0. This is because in
find_alternative() function, the return criteria is based on
event code comparison. ie "event < ev_alt[i][0])". This
optimisation is there since find_alternative() can get called
multiple times. In power9 PMU code, the alternative event array
is not sorted list and hence there is breakage in finding
alternative event.

To work with existing logic, fix the alternative event array
to be sorted by column 0 for power9-pmu.c

Results:

With alternative events, multiplexing can be avoided. That is, 
for example, in power9 PM_LD_MISS_L1 (0x3e054) has alternative
event, PM_LD_MISS_L1_ALT (0x400f0). This is an identical event
which can be programmed in a different PMC.

<< Before patch >>

 # perf stat -e r3e054,r300fc
^C
 Performance counter stats for 'system wide':

   1057860  r3e054  (50.21%)
   379  r300fc  (49.79%)

   0.944329741 seconds time elapsed
   
Since both the events are using PMC3 in this case, they are
multiplexed here.

<>

 # perf stat -e r3e054,r300fc
^C
 Performance counter stats for 'system wide':

   1006948  r3e054
   182  r300fc
<<>>

Fixes: 91e0bd1e6251 ("powerpc/perf: Add PM_LD_MISS_L1 and PM_BR_2PATH to power9 
event list")
Signed-off-by: Athira Rajeev 
Reviewed-by: Madhavan Srinivasan 
---
Changelog:
 v1 -> v2:
 Added Fixes tag and reworded commit message.
 Added Reviewed-by from Maddy.
 v2 -> v3:
 Added info about what is the breakage with current
 code.

 arch/powerpc/perf/power9-pmu.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/perf/power9-pmu.c b/arch/powerpc/perf/power9-pmu.c
index c9eb5232e68b..c393e837648e 100644
--- a/arch/powerpc/perf/power9-pmu.c
+++ b/arch/powerpc/perf/power9-pmu.c
@@ -133,11 +133,11 @@ int p9_dd22_bl_ev[] = {
 
 /* Table of alternatives, sorted by column 0 */
 static const unsigned int power9_event_alternatives[][MAX_ALT] = {
-   { PM_INST_DISP, PM_INST_DISP_ALT },
-   { PM_RUN_CYC_ALT,   PM_RUN_CYC },
-   { PM_RUN_INST_CMPL_ALT, PM_RUN_INST_CMPL },
-   { PM_LD_MISS_L1,PM_LD_MISS_L1_ALT },
{ PM_BR_2PATH,  PM_BR_2PATH_ALT },
+   { PM_INST_DISP, PM_INST_DISP_ALT },
+   { PM_RUN_CYC_ALT,   PM_RUN_CYC },
+   { PM_LD_MISS_L1,PM_LD_MISS_L1_ALT },
+   { PM_RUN_INST_CMPL_ALT, PM_RUN_INST_CMPL },
 };
 
 static int power9_get_alternatives(u64 event, unsigned int flags, u64 alt[])
-- 
2.35.1



Re: [PATCH] misc: ocxl: fix possible double free in ocxl_file_register_afu

2022-04-19 Thread Hangyu Hua

On 2022/4/19 17:09, Frederic Barrat wrote:



On 18/04/2022 10:57, Hangyu Hua wrote:

info_release() will be called in device_unregister() when info->dev's
reference count is 0. So there is no need to call ocxl_afu_put() and
kfree() again.

Fix this by adding free_minor() and return to err_unregister error path.

Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl 
backend & frontend")

Signed-off-by: Hangyu Hua 
---



Thanks for fixing that error path!
I'm now thinking it would be cleaner to have the call to free_minor() in 
the info_release() callback but that would be another patch.

In any case:
Acked-by: Frederic Barrat 

   Fred



I think it is a good idea to use callbacks to handle all garbage 
collections. And free_minor is used only in ocxl_file_register_afu() 
andocxl_file_unregister_afu(). So this should only require minor changes.


Thanks.




  drivers/misc/ocxl/file.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
index d881f5e40ad9..6777c419a8da 100644
--- a/drivers/misc/ocxl/file.c
+++ b/drivers/misc/ocxl/file.c
@@ -556,7 +556,9 @@ int ocxl_file_register_afu(struct ocxl_afu *afu)
  err_unregister:
  ocxl_sysfs_unregister_afu(info); // safe to call even if 
register failed

+    free_minor(info);
  device_unregister(>dev);
+    return rc;
  err_put:
  ocxl_afu_put(afu);
  free_minor(info);


Re: [PATCH 3/3] KVM: Add helpers to wrap vcpu->srcu_idx and yell if it's abused

2022-04-19 Thread Maxim Levitsky
On Fri, 2022-04-15 at 00:43 +, Sean Christopherson wrote:
> Add wrappers to acquire/release KVM's SRCU lock when stashing the index
> in vcpu->src_idx, along with rudimentary detection of illegal usage,
> e.g. re-acquiring SRCU and thus overwriting vcpu->src_idx.  Because the
> SRCU index is (currently) either 0 or 1, illegal nesting bugs can go
> unnoticed for quite some time and only cause problems when the nested
> lock happens to get a different index.
> 
> Wrap the WARNs in PROVE_RCU=y, and make them ONCE, otherwise KVM will
> likely yell so loudly that it will bring the kernel to its knees.
> 
> Signed-off-by: Sean Christopherson 
> ---
>  arch/powerpc/kvm/book3s_64_mmu_radix.c |  9 +
>  arch/powerpc/kvm/book3s_hv_nested.c| 16 +++
>  arch/powerpc/kvm/book3s_rtas.c |  4 ++--
>  arch/powerpc/kvm/powerpc.c |  4 ++--
>  arch/riscv/kvm/vcpu.c  | 16 +++
>  arch/riscv/kvm/vcpu_exit.c |  4 ++--
>  arch/s390/kvm/interrupt.c  |  4 ++--
>  arch/s390/kvm/kvm-s390.c   |  8 
>  arch/s390/kvm/vsie.c   |  4 ++--
>  arch/x86/kvm/x86.c | 28 --
>  include/linux/kvm_host.h   | 24 +-
>  11 files changed, 71 insertions(+), 50 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index e4ce2a35483f..42851c32ff3b 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -168,9 +168,10 @@ int kvmppc_mmu_walk_radix_tree(struct kvm_vcpu *vcpu, 
> gva_t eaddr,
>   return -EINVAL;
>   /* Read the entry from guest memory */
>   addr = base + (index * sizeof(rpte));
> - vcpu->srcu_idx = srcu_read_lock(>srcu);
> +
> + kvm_vcpu_srcu_read_lock(vcpu);
>   ret = kvm_read_guest(kvm, addr, , sizeof(rpte));
> - srcu_read_unlock(>srcu, vcpu->srcu_idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
>   if (ret) {
>   if (pte_ret_p)
>   *pte_ret_p = addr;
> @@ -246,9 +247,9 @@ int kvmppc_mmu_radix_translate_table(struct kvm_vcpu 
> *vcpu, gva_t eaddr,
>  
>   /* Read the table to find the root of the radix tree */
>   ptbl = (table & PRTB_MASK) + (table_index * sizeof(entry));
> - vcpu->srcu_idx = srcu_read_lock(>srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>   ret = kvm_read_guest(kvm, ptbl, , sizeof(entry));
> - srcu_read_unlock(>srcu, vcpu->srcu_idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
>   if (ret)
>   return ret;
>  
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
> b/arch/powerpc/kvm/book3s_hv_nested.c
> index 9d373f8963ee..c943a051c6e7 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -306,10 +306,10 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>   /* copy parameters in */
>   hv_ptr = kvmppc_get_gpr(vcpu, 4);
>   regs_ptr = kvmppc_get_gpr(vcpu, 5);
> - vcpu->srcu_idx = srcu_read_lock(>kvm->srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>   err = kvmhv_read_guest_state_and_regs(vcpu, _hv, _regs,
> hv_ptr, regs_ptr);
> - srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
>   if (err)
>   return H_PARAMETER;
>  
> @@ -410,10 +410,10 @@ long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu)
>   byteswap_hv_regs(_hv);
>   byteswap_pt_regs(_regs);
>   }
> - vcpu->srcu_idx = srcu_read_lock(>kvm->srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>   err = kvmhv_write_guest_state_and_regs(vcpu, _hv, _regs,
>  hv_ptr, regs_ptr);
> - srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
>   if (err)
>   return H_AUTHORITY;
>  
> @@ -600,16 +600,16 @@ long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu 
> *vcpu)
>   goto not_found;
>  
>   /* Write what was loaded into our buffer back to the L1 guest */
> - vcpu->srcu_idx = srcu_read_lock(>kvm->srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>   rc = kvm_vcpu_write_guest(vcpu, gp_to, buf, n);
> - srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
> + kvm_vcpu_srcu_read_unlock(vcpu);
>   if (rc)
>   goto not_found;
>   } else {
>   /* Load the data to be stored from the L1 guest into our buf */
> - vcpu->srcu_idx = srcu_read_lock(>kvm->srcu);
> + kvm_vcpu_srcu_read_lock(vcpu);
>   rc = kvm_vcpu_read_guest(vcpu, gp_from, buf, n);
> - srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
> + 

Re: [PATCH 1/3] KVM: x86: Don't re-acquire SRCU lock in complete_emulated_io()

2022-04-19 Thread Maxim Levitsky
On Fri, 2022-04-15 at 00:43 +, Sean Christopherson wrote:
> Don't re-acquire SRCU in complete_emulated_io() now that KVM acquires the
> lock in kvm_arch_vcpu_ioctl_run().  More importantly, don't overwrite
> vcpu->srcu_idx.  If the index acquired by complete_emulated_io() differs
> from the one acquired by kvm_arch_vcpu_ioctl_run(), KVM will effectively
> leak a lock and hang if/when synchronize_srcu() is invoked for the
> relevant grace period.
> 
> Fixes: 8d25b7beca7e ("KVM: x86: pull kvm->srcu read-side to 
> kvm_arch_vcpu_ioctl_run")
> Cc: sta...@vger.kernel.org
> Signed-off-by: Sean Christopherson 
> ---
>  arch/x86/kvm/x86.c | 7 +--
>  1 file changed, 1 insertion(+), 6 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ab336f7c82e4..f35fe09de59d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10450,12 +10450,7 @@ static int vcpu_run(struct kvm_vcpu *vcpu)
>  
>  static inline int complete_emulated_io(struct kvm_vcpu *vcpu)
>  {
> - int r;
> -
> - vcpu->srcu_idx = srcu_read_lock(>kvm->srcu);
> - r = kvm_emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
> - srcu_read_unlock(>kvm->srcu, vcpu->srcu_idx);
> - return r;
> + return kvm_emulate_instruction(vcpu, EMULTYPE_NO_DECODE);
>  }
>  
>  static int complete_emulated_pio(struct kvm_vcpu *vcpu)

I wonder how this did work

Reviewed-by: Maxim Levitsky 

Best regards,
Maxim Levitsky



Re: [PATCH net-next v4 14/18] sfc: Remove usage of list iterator for list_add() after the loop body

2022-04-19 Thread Martin Habets
On Fri, Apr 15, 2022 at 02:29:43PM +0200, Jakob Koschel wrote:
> In preparation to limit the scope of a list iterator to the list
> traversal loop, use a dedicated pointer pointing to the location
> where the element should be inserted [1].
> 
> Before, the code implicitly used the head when no element was found
> when using >list. The new 'pos' variable is set to the list head
> by default and overwritten if the list exits early, marking the
> insertion point for list_add().
> 
> Link: 
> https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=ehreask5sqxpwr9y7k9sa6cwx...@mail.gmail.com/
>  [1]
> Signed-off-by: Jakob Koschel 
> ---
>  drivers/net/ethernet/sfc/rx_common.c | 7 +--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sfc/rx_common.c 
> b/drivers/net/ethernet/sfc/rx_common.c
> index 1b22c7be0088..716847ba7038 100644
> --- a/drivers/net/ethernet/sfc/rx_common.c
> +++ b/drivers/net/ethernet/sfc/rx_common.c
> @@ -556,6 +556,7 @@ efx_rx_packet_gro(struct efx_channel *channel, struct 
> efx_rx_buffer *rx_buf,
>  struct efx_rss_context *efx_alloc_rss_context_entry(struct efx_nic *efx)
>  {
>   struct list_head *head = >rss_context.list;
> + struct list_head *pos = head;

This violates the reverse Xmas list policy. This definition should be
1 line further down.

Martin

>   struct efx_rss_context *ctx, *new;
>   u32 id = 1; /* Don't use zero, that refers to the master RSS context */
>  
> @@ -563,8 +564,10 @@ struct efx_rss_context 
> *efx_alloc_rss_context_entry(struct efx_nic *efx)
>  
>   /* Search for first gap in the numbering */
>   list_for_each_entry(ctx, head, list) {
> - if (ctx->user_id != id)
> + if (ctx->user_id != id) {
> + pos = >list;
>   break;
> + }
>   id++;
>   /* Check for wrap.  If this happens, we have nearly 2^32
>* allocated RSS contexts, which seems unlikely.
> @@ -582,7 +585,7 @@ struct efx_rss_context 
> *efx_alloc_rss_context_entry(struct efx_nic *efx)
>  
>   /* Insert the new entry into the gap */
>   new->user_id = id;
> - list_add_tail(>list, >list);
> + list_add_tail(>list, pos);
>   return new;
>  }
>  
> -- 
> 2.25.1


[PATCH v2 0/2] of: Register platform device for each framebuffer

2022-04-19 Thread Thomas Zimmermann
Move the detection of OF framebuffers from fbdev into of platform code
and register a Linux platform device for each framebuffer. Allows for
DRM-based OF drivers and real hot-unplugging of the framebuffer.

This patchset has been tested with qemu's ppc64le emulation, which
provides a framebuffer via OF display node. If someone has an older
32-bit system with BootX available, please test.

v2:
* integrate PPC code into generic platform setup (Rob)
* keep !device workaround with a warning (Javier, Daniel)

Thomas Zimmermann (2):
  of: Create platform devices for OF framebuffers
  fbdev: Warn in hot-unplug workaround for framebuffers without device

 drivers/of/platform.c| 88 +---
 drivers/video/fbdev/core/fbmem.c | 10 ++--
 drivers/video/fbdev/offb.c   | 98 +---
 3 files changed, 136 insertions(+), 60 deletions(-)


base-commit: d97978df553d768e457cb68c637b2b0a6188b87c
prerequisite-patch-id: c2b2f08f0eccc9f5df0c0da49fa1d36267deb11d
prerequisite-patch-id: c67e5d886a47b7d0266d81100837557fda34cb24
prerequisite-patch-id: 6e1032c6302461624f33194c8b8f37103a3fa6ef
prerequisite-patch-id: 3f204510fcbf9530d6540bd8e6128cce598988b6
prerequisite-patch-id: ab7611d28d07723ab1dd392dcf9a6345de3b1040
-- 
2.35.1



[PATCH v2 2/2] fbdev: Warn in hot-unplug workaround for framebuffers without device

2022-04-19 Thread Thomas Zimmermann
A workaround makes fbdev hot-unplugging work for framebuffers without
device. The only user for this feature was offb. As each OF framebuffer
now has an associated platform device, the workaround hould no longer
be triggered. Update it with a warning and rewrite the comment. Fbdev
drivers that trigger the hot-unplug workaround really need to be fixed.

Signed-off-by: Thomas Zimmermann 
Suggested-by: Javier Martinez Canillas 
---
 drivers/video/fbdev/core/fbmem.c | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/drivers/video/fbdev/core/fbmem.c b/drivers/video/fbdev/core/fbmem.c
index bc6ed750e915..84427470367b 100644
--- a/drivers/video/fbdev/core/fbmem.c
+++ b/drivers/video/fbdev/core/fbmem.c
@@ -1577,14 +1577,12 @@ static void do_remove_conflicting_framebuffers(struct 
apertures_struct *a,
 * allocate the memory range.
 *
 * If it's not a platform device, at least print a 
warning. A
-* fix would add code to remove the device from the 
system.
+* fix would add code to remove the device from the 
system. For
+* framebuffers without any Linux device, print a 
warning as
+* well.
 */
if (!device) {
-   /* TODO: Represent each OF framebuffer as its 
own
-* device in the device hierarchy. For now, offb
-* doesn't have such a device, so unregister the
-* framebuffer as before without warning.
-*/
+   pr_warn("fb%d: no device set\n", i);
do_unregister_framebuffer(registered_fb[i]);
} else if (dev_is_platform(device)) {
registered_fb[i]->forced_out = true;
-- 
2.35.1



[PATCH v2 1/2] of: Create platform devices for OF framebuffers

2022-04-19 Thread Thomas Zimmermann
Create a platform device for each OF-declared framebuffer and have
offb bind to these devices. Allows for real hot-unplugging and other
drivers besides offb.

Originally, offb created framebuffer devices while initializing its
module by parsing the OF device tree. No actual Linux device was set
up. This tied OF framebuffers to offb and makes writing other drivers
for the OF framebuffers complicated. The absence of a Linux device
further prevented real hot-unplugging. Adding a distinct platform
device for each OF framebuffer solves both problems. Specifically, a
DRM driver can now provide graphics output for modern userspace.

Some of the offb init code is now located in the OF initialization.
There's now also an implementation of of_platform_default_populate_init(),
which was missing before. The OF side creates different devices for
either OF display nodes or BootX displays as they require different
handling by the driver. The offb drivers picks up each type of device
and runs the appropriate fbdev initialization.

Tested with OF display nodes on qemu's ppc64le target.

v2:
* run PPC code as part of existing initialization (Rob)
* add a few more error warnings (Javier)

Signed-off-by: Thomas Zimmermann 
Reviewed-by: Javier Martinez Canillas 
---
 drivers/of/platform.c  | 88 ++
 drivers/video/fbdev/offb.c | 98 +-
 2 files changed, 132 insertions(+), 54 deletions(-)

diff --git a/drivers/of/platform.c b/drivers/of/platform.c
index a16b74f32aa9..738ba2e2838c 100644
--- a/drivers/of/platform.c
+++ b/drivers/of/platform.c
@@ -507,7 +507,6 @@ int of_platform_default_populate(struct device_node *root,
 }
 EXPORT_SYMBOL_GPL(of_platform_default_populate);
 
-#ifndef CONFIG_PPC
 static const struct of_device_id reserved_mem_matches[] = {
{ .compatible = "qcom,rmtfs-mem" },
{ .compatible = "qcom,cmd-db" },
@@ -520,33 +519,81 @@ static const struct of_device_id reserved_mem_matches[] = 
{
 
 static int __init of_platform_default_populate_init(void)
 {
-   struct device_node *node;
-
device_links_supplier_sync_state_pause();
 
if (!of_have_populated_dt())
return -ENODEV;
 
-   /*
-* Handle certain compatibles explicitly, since we don't want to create
-* platform_devices for every node in /reserved-memory with a
-* "compatible",
-*/
-   for_each_matching_node(node, reserved_mem_matches)
-   of_platform_device_create(node, NULL, NULL);
+   if (IS_ENABLED(CONFIG_PPC)) {
+   struct device_node *boot_display = NULL;
+   struct device_node *node;
+   struct platform_device *dev;
+   int ret;
+
+   /* Check if we have a MacOS display without a node spec */
+   if (of_get_property(of_chosen, "linux,bootx-noscreen", NULL)) {
+   /*
+* The old code tried to work out which node was the 
MacOS
+* display based on the address. I'm dropping that 
since the
+* lack of a node spec only happens with old BootX 
versions
+* (users can update) and with this code, they'll still 
get
+* a display (just not the palette hacks).
+*/
+   dev = platform_device_alloc("bootx-noscreen", 0);
+   if (WARN_ON(!dev))
+   return -ENOMEM;
+   ret = platform_device_add(dev);
+   if (WARN_ON(ret)) {
+   platform_device_put(dev);
+   return ret;
+   }
+   }
 
-   node = of_find_node_by_path("/firmware");
-   if (node) {
-   of_platform_populate(node, NULL, NULL, NULL);
-   of_node_put(node);
-   }
+   /*
+* For OF framebuffers, first create the device for the boot 
display,
+* then for the other framebuffers. Only fail for the boot 
display;
+* ignore errors for the rest.
+*/
+   for_each_node_by_type(node, "display") {
+   if (!of_get_property(node, "linux,opened", NULL) ||
+   !of_get_property(node, "linux,boot-display", NULL))
+   continue;
+   dev = of_platform_device_create(node, "of-display", 
NULL);
+   if (WARN_ON(!dev))
+   return -ENOMEM;
+   boot_display = node;
+   break;
+   }
+   for_each_node_by_type(node, "display") {
+   if (!of_get_property(node, "linux,opened", NULL) || 
node == boot_display)
+   continue;
+   

Re: [PATCH] misc: ocxl: fix possible double free in ocxl_file_register_afu

2022-04-19 Thread Frederic Barrat




On 18/04/2022 10:57, Hangyu Hua wrote:

info_release() will be called in device_unregister() when info->dev's
reference count is 0. So there is no need to call ocxl_afu_put() and
kfree() again.

Fix this by adding free_minor() and return to err_unregister error path.

Fixes: 75ca758adbaf ("ocxl: Create a clear delineation between ocxl backend & 
frontend")
Signed-off-by: Hangyu Hua 
---



Thanks for fixing that error path!
I'm now thinking it would be cleaner to have the call to free_minor() in 
the info_release() callback but that would be another patch.

In any case:
Acked-by: Frederic Barrat 

  Fred



  drivers/misc/ocxl/file.c | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/drivers/misc/ocxl/file.c b/drivers/misc/ocxl/file.c
index d881f5e40ad9..6777c419a8da 100644
--- a/drivers/misc/ocxl/file.c
+++ b/drivers/misc/ocxl/file.c
@@ -556,7 +556,9 @@ int ocxl_file_register_afu(struct ocxl_afu *afu)
  
  err_unregister:

ocxl_sysfs_unregister_afu(info); // safe to call even if register failed
+   free_minor(info);
device_unregister(>dev);
+   return rc;
  err_put:
ocxl_afu_put(afu);
free_minor(info);


Re: [RFC v4 PATCH 4/5] powerpc/crash hp: add crash hotplug support for kexec_file_load

2022-04-19 Thread Sourabh Jain



On 14/04/22 22:02, Laurent Dufour wrote:

On 11/04/2022, 10:43:56, Sourabh Jain wrote:

Two major changes are done to enable the crash CPU hotplug handler.
Firstly, updated the kexec load path to prepare kimage for hotplug
changes, and secondly, implemented the crash hotplug handler.

On the kexec load path, the memsz allocation of the crash FDT segment
is updated to ensure that it has sufficient buffer space to accommodate
future hot add CPUs. Initialized the kimage members to track the kexec
FDT segment.

The crash hotplug handler updates the cpus node of crash FDT. While we
update crash FDT the kexec_crash_image is marked invalid and restored
after FDT update to avoid race.

Since memory crash hotplug support is not there yet the crash hotplug
the handler simply warns the user and returns.

Signed-off-by: Sourabh Jain 
---
  arch/powerpc/kexec/core_64.c | 46 ++
  arch/powerpc/kexec/elf_64.c  | 74 
  2 files changed, 120 insertions(+)

diff --git a/arch/powerpc/kexec/core_64.c b/arch/powerpc/kexec/core_64.c
index 249d2632526d..62f77cc86407 100644
--- a/arch/powerpc/kexec/core_64.c
+++ b/arch/powerpc/kexec/core_64.c
@@ -466,6 +466,52 @@ int update_cpus_node(void *fdt)
return ret;
  }
  
+#ifdef CONFIG_CRASH_HOTPLUG

+/**
+ * arch_crash_hotplug_handler() - Handle hotplug FDT changes
+ * @image: the active struct kimage
+ * @hp_action: the hot un/plug action being handled
+ * @a: first parameter dependent upon hp_action
+ * @b: first parameter dependent upon hp_action
+ *
+ * To accurately reflect CPU hot un/plug changes, the FDT
+ * must be updated with the new list of CPUs and memories.
+ */
+void arch_crash_hotplug_handler(struct kimage *image, unsigned int hp_action,
+   unsigned long a, unsigned long b)
+{
+   void *fdt;
+
+   /* No action needed for CPU hot-unplug */
+   if (hp_action == KEXEC_CRASH_HP_REMOVE_CPU)
+   return;

I should have report since in the previous version too, why nothing is done
when CPU are removed?


Since CPU note addresses are already available for all possible CPUs
(regardless they are online or not) the PHDR is allocated for all possible
CPUs during elfcorehdr creation. At least for the kexec_load system call.

Now on the crash path, the crashing CPU initiates an IPI call to update
the CPU data of all online CPUs and jumps to the purgatory. Hence no
action is needed for the remove case.

With the above logic, we shouldn't be taking any action for the CPU add
case too, right? But on PowerPC early boot path there is validation that
checks the boot CPU is part of the  Flattened Device Tree (FDT) or not.
If the boot CPU is not found in FDT the boot fails. Hence FDT needs to be
updated for every new CPU added to the system but not needed when
CPU is removed.


Thanks
Sourabh Jain




+
+   /* crash update on memory hotplug is not support yet */
+   if (hp_action == KEXEC_CRASH_HP_REMOVE_MEMORY || hp_action == 
KEXEC_CRASH_HP_ADD_MEMORY) {
+   pr_info_once("crash hp: crash update is not supported with memory 
hotplug\n");
+   return;
+   }
+
+   /* Must have valid FDT index */
+   if (!image->arch.fdt_index_valid) {
+   pr_err("crash hp: unable to locate FDT segment");
+   return;
+   }
+
+   fdt = __va((void *)image->segment[image->arch.fdt_index].mem);
+
+   /* Temporarily invalidate the crash image while it is replaced */
+   xchg(_crash_image, NULL);
+
+   /* update FDT to refelect changes to CPU resrouces */
+   if (update_cpus_node(fdt))
+   pr_err("crash hp: failed to update crash FDT");
+
+   /* The crash image is now valid once again */
+   xchg(_crash_image, image);
+}
+#endif /* CONFIG_CRASH_HOTPLUG */
+
  #ifdef CONFIG_PPC_64S_HASH_MMU
  /* Values we need to export to the second kernel via the device tree. */
  static unsigned long htab_base;
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index eeb258002d1e..9dc774548ce4 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -24,6 +24,67 @@
  #include 
  #include 
  
+#include 

+#include 
+
+#ifdef CONFIG_CRASH_HOTPLUG
+
+/**
+ * get_cpu_node_sz() - Calculate the space needed to store a CPU device type 
node
+ * in FDT. The calculation is done based on the existing 
CPU
+ * node in unflatten device tree. Loop through all the
+ * properties of the very first CPU type device node found 
in
+ * unflatten device tree and returns the sum of the 
property
+ * length and property string size of all properties of a 
CPU
+ * node.
+ */

I don't think this is following the indent rules.


+static int get_cpu_node_sz(void) {
+   struct device_node *dn = NULL;
+   struct property *pp;
+   int cpu_node_size = 0;
+
+   dn 

Re: [RFC v4 PATCH 3/5] powrepc/crash hp: update kimage_arch struct

2022-04-19 Thread Sourabh Jain



On 14/04/22 22:05, Laurent Dufour wrote:

On 11/04/2022, 10:43:55, Sourabh Jain wrote:

Two new members fdt_index and fdt_index_valid are added in kimage_arch
struct to track the FDT kexec segment. These new members of kimage_arch
struct will help the crash hotplug handler to easily access the FDT
segment from the kexec segment array. Otherwise, we have to loop through
all kexec segments to find the FDT segments.

Signed-off-by: Sourabh Jain 
---
  arch/powerpc/include/asm/kexec.h | 2 ++
  1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index e1288826e22e..19c2cab6a880 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -104,6 +104,8 @@ extern const struct kexec_file_ops kexec_elf64_ops;
  struct kimage_arch {
struct crash_mem *exclude_ranges;
  

#ifdef CONFIG_CRASH_HOTPLUG ?

+   int fdt_index;
+   bool fdt_index_valid;

#endif

It seems that this is only used when CONFIG_CRASH_HOTPLUG is defined, isn't it?


Yes it should be kept under CONFIG_CRASH_HOTPLUG config.

Thanks,
Sourabh Jain




unsigned long backup_start;
void *backup_buf;
void *fdt;


Re: [RFC v4 PATCH 2/5] powerpc/crash hp: introduce a new config option CRASH_HOTPLUG

2022-04-19 Thread Sourabh Jain



On 14/04/22 22:10, Laurent Dufour wrote:

On 11/04/2022, 10:43:54, Sourabh Jain wrote:

The option CRASH_HOTPLUG enables, in kernel update to kexec segments on
hotplug events.

All the updates needed on the capture kernel load path in the kernel for
both kexec_load and kexec_file_load system will be kept under this config.

Signed-off-by: Sourabh Jain 
Reviewed-by: Eric DeVolder 

Reviewed-by: Laurent Dufour 


Thanks for the review.

- Sourabh Jain




---
  arch/powerpc/Kconfig | 11 +++
  1 file changed, 11 insertions(+)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index b779603978e1..777db33f75b5 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -623,6 +623,17 @@ config FA_DUMP
  If unsure, say "y". Only special kernels like petitboot may
  need to say "N" here.
  
+config CRASH_HOTPLUG

+   bool "kernel updates of crash kexec segments"
+   depends on CRASH_DUMP && (HOTPLUG_CPU) && KEXEC_FILE
+   help
+ An efficient way to keep the capture kernel up-to-date with CPU
+ hotplug events. On CPU hotplug event the kexec segments of capture
+ kernel becomes stale and need to be updated with latest CPU data.
+ In this method the kernel performs minimal update to only relevant
+ kexec segments on CPU hotplug event, instead of triggering full
+ capture kernel reload from userspace using udev rule.
+
  config PRESERVE_FA_DUMP
bool "Preserve Firmware-assisted dump"
depends on PPC64 && PPC_POWERNV && !FA_DUMP


Re: [PATCH] macintosh: macio_asic: fix resource_size.cocci warnings

2022-04-19 Thread Uwe Kleine-König
On Thu, Apr 14, 2022 at 07:02:42AM -0700, Yihao Han wrote:
> drivers/macintosh/macio_asic.c:219:26-29: WARNING:
> Suspicious code. resource_size is maybe missing with res
> drivers/macintosh/macio_asic.c:221:26-29: WARNING:
> Suspicious code. resource_size is maybe missing with res

For log messages it's ok to overstep the line length limitation for
commit logs. IMHO adding newlines is worse, not sure that there are no
other strong opinions though.

> Use resource_size function on resource object instead of
> explicit computation.
> 
> Generated by: scripts/coccinelle/api/resource_size.cocci
> 
> Signed-off-by: Yihao Han 
> ---
>  drivers/macintosh/macio_asic.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/macintosh/macio_asic.c b/drivers/macintosh/macio_asic.c
> index 1943a007e2d5..260fccb3863e 100644
> --- a/drivers/macintosh/macio_asic.c
> +++ b/drivers/macintosh/macio_asic.c
> @@ -216,9 +216,9 @@ static int macio_resource_quirks(struct device_node *np, 
> struct resource *res,
>   /* Some older IDE resources have bogus sizes */
>   if (of_node_name_eq(np, "IDE") || of_node_name_eq(np, "ATA") ||
>   of_node_is_type(np, "ide") || of_node_is_type(np, "ata")) {
> - if (index == 0 && (res->end - res->start) > 0xfff)
> + if (index == 0 && (resource_size(res)) > 0xfff)

You can drop the parenthesis around resource_size(res) here.

Other than that looks fine,
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | https://www.pengutronix.de/ |


signature.asc
Description: PGP signature


Re: [PATCH 2/2] fbdev: Remove hot-unplug workaround for framebuffers without device

2022-04-19 Thread Javier Martinez Canillas
Hello Thomas,

On 4/13/22 20:09, Thomas Zimmermann wrote:

[snip]

 index bc6ed750e915..bdd00d381bbc 100644
 --- a/drivers/video/fbdev/core/fbmem.c
 +++ b/drivers/video/fbdev/core/fbmem.c
 @@ -1579,14 +1579,7 @@ static void 
 do_remove_conflicting_framebuffers(struct apertures_struct *a,
 * If it's not a platform device, at least print a 
 warning. A
 * fix would add code to remove the device from the 
 system.
 */
 -  if (!device) {
 -  /* TODO: Represent each OF framebuffer as its 
 own
 -   * device in the device hierarchy. For now, offb
 -   * doesn't have such a device, so unregister the
 -   * framebuffer as before without warning.
 -   */
 -  do_unregister_framebuffer(registered_fb[i]);
>>>
>>> Maybe we could still keep this for a couple of releases but with a big
>>> warning that's not supported in case there are out-of-tree drivers out
>>> there that still do this ?
>>>
>>> Or at least a warning if the do_unregister_framebuffer() call is removed.
>>
>> Yeah dying while holding console_lock isn't fun, and not having a WARN_ON
>> + bail-out code pretty much forces bug reporters to do a bisect here to
>> give us something more than "machine dies at boot with no messages".
>>
>> I'd just outright keep the WARN_ON here for 1-2 years even to really make
>> sure we got all the bug reports, since often these older machines only
>> update onto LTS releases.
> 
> If that's what the consent is, I'll go with it.
> 
> I'm just not sure if we talk about the same problem. offb didn't have a 
> platform device, so we recently added this workaround with 'if 
> (!device)'.  All the other fbdev drivers have a platform device; and 
> anything else that could fail is out-of-tree. We don't really care about 
> those AFAIK.
>

Yes, agreed on the offb change but I'm not really sure if we don't care
about out-of-tree modules. I mean, you are right in theory but I still
feel that we are changing a core behavior without giving people time to
sort out if needed.

Since before commit 27599aacbaef ("fbdev: Hot-unplug firmware fb devices
on forced removal") registered FBs didn't need to have a device, but now
that will lead to a NULL pointer dereference in dev_is_platform(device).

And that change only landed in v5.18-rc1, so it is fairly recent.

I know that we follow 
https://www.kernel.org/doc/Documentation/process/stable-api-nonsense.rst
but still my opinion is that having a warning for a couple of releases
if registered_fb[i]->device is NULL, instead of just crashing would be
a better way to handle this.
 
> With offb converted, we could practically remove all of the checks here 
> and call platform_device_unregister() unconditionally.
>

Yes for mainline, but as mentioned I thought mostly about out-of-tree. If
folks agree that we shouldn't care about these, I'm Ok with that as well.

-- 
Best regards,

Javier Martinez Canillas
Linux Engineering
Red Hat



Re: [PATCH net-next v2] net: ethernet: Prepare cleanup of powerpc's asm/prom.h

2022-04-19 Thread Paolo Abeni
Hello,

Sorry for the late reply.

On Fri, 2022-04-15 at 10:39 +0200, Christophe Leroy wrote:
> powerpc's asm/prom.h brings some headers that it doesn't
> need itself.

It's probably my fault, but I really can't parse the above. Could you
please re-phrase?
> 
> In order to clean it up in a further step, first clean all
> files that include asm/prom.h
> 
> Some files don't need asm/prom.h at all. For those ones,
> just remove inclusion of asm/prom.h
> 
> Some files don't need any of the items provided by asm/prom.h,
> but need some of the headers included by asm/prom.h. For those
> ones, add the needed headers that are brought by asm/prom.h at
> the moment, then remove asm/prom.h

Do you mean a follow-up patch is needed to drop the asm/prom.h include
from such files, even if that include could be dropped now without any
fourther change?

If so, I suggest v3 should additionally drop the asm/prom.h include
where possible.


Thanks!

Paolo