Re: [PATCH v4] lockdown,selinux: fix wrong subject in some SELinux lockdown checks

2021-09-15 Thread Paul Moore
On Mon, Sep 13, 2021 at 5:05 PM Paul Moore  wrote:
>
> On Mon, Sep 13, 2021 at 10:02 AM Ondrej Mosnacek  wrote:
> >
> > Commit 59438b46471a ("security,lockdown,selinux: implement SELinux
> > lockdown") added an implementation of the locked_down LSM hook to
> > SELinux, with the aim to restrict which domains are allowed to perform
> > operations that would breach lockdown.
> >
> > However, in several places the security_locked_down() hook is called in
> > situations where the current task isn't doing any action that would
> > directly breach lockdown, leading to SELinux checks that are basically
> > bogus.
> >
> > To fix this, add an explicit struct cred pointer argument to
> > security_lockdown() and define NULL as a special value to pass instead
> > of current_cred() in such situations. LSMs that take the subject
> > credentials into account can then fall back to some default or ignore
> > such calls altogether. In the SELinux lockdown hook implementation, use
> > SECINITSID_KERNEL in case the cred argument is NULL.
> >
> > Most of the callers are updated to pass current_cred() as the cred
> > pointer, thus maintaining the same behavior. The following callers are
> > modified to pass NULL as the cred pointer instead:
> > 1. arch/powerpc/xmon/xmon.c
> >  Seems to be some interactive debugging facility. It appears that
> >  the lockdown hook is called from interrupt context here, so it
> >  should be more appropriate to request a global lockdown decision.
> > 2. fs/tracefs/inode.c:tracefs_create_file()
> >  Here the call is used to prevent creating new tracefs entries when
> >  the kernel is locked down. Assumes that locking down is one-way -
> >  i.e. if the hook returns non-zero once, it will never return zero
> >  again, thus no point in creating these files. Also, the hook is
> >  often called by a module's init function when it is loaded by
> >  userspace, where it doesn't make much sense to do a check against
> >  the current task's creds, since the task itself doesn't actually
> >  use the tracing functionality (i.e. doesn't breach lockdown), just
> >  indirectly makes some new tracepoints available to whoever is
> >  authorized to use them.
> > 3. net/xfrm/xfrm_user.c:copy_to_user_*()
> >  Here a cryptographic secret is redacted based on the value returned
> >  from the hook. There are two possible actions that may lead here:
> >  a) A netlink message XFRM_MSG_GETSA with NLM_F_DUMP set - here the
> > task context is relevant, since the dumped data is sent back to
> > the current task.
> >  b) When adding/deleting/updating an SA via XFRM_MSG_xxxSA, the
> > dumped SA is broadcasted to tasks subscribed to XFRM events -
> > here the current task context is not relevant as it doesn't
> > represent the tasks that could potentially see the secret.
> >  It doesn't seem worth it to try to keep using the current task's
> >  context in the a) case, since the eventual data leak can be
> >  circumvented anyway via b), plus there is no way for the task to
> >  indicate that it doesn't care about the actual key value, so the
> >  check could generate a lot of "false alert" denials with SELinux.
> >  Thus, let's pass NULL instead of current_cred() here faute de
> >  mieux.
> >
> > Improvements-suggested-by: Casey Schaufler 
> > Improvements-suggested-by: Paul Moore 
> > Fixes: 59438b46471a ("security,lockdown,selinux: implement SELinux 
> > lockdown")
> > Acked-by: Dan Williams  [cxl]
> > Acked-by: Steffen Klassert  [xfrm]
> > Signed-off-by: Ondrej Mosnacek 
> > ---
> >
> > v4:
> > - rebase on top of TODO
> > - fix rebase conflicts:
> >   * drivers/cxl/pci.c
> > - trivial: the lockdown reason was corrected in mainline
> >   * kernel/bpf/helpers.c, kernel/trace/bpf_trace.c
> > - trivial: LOCKDOWN_BPF_READ was renamed to LOCKDOWN_BPF_READ_KERNEL
> >   in mainline
> >   * kernel/power/hibernate.c
> > - trivial: !secretmem_active() was added to the condition in
> >   hibernation_available()
> > - cover new security_locked_down() call in kernel/bpf/helpers.c
> >   (LOCKDOWN_BPF_WRITE_USER in BPF_FUNC_probe_write_user case)
> >
> > v3: 
> > https://lore.kernel.org/lkml/20210616085118.1141101-1-omosn...@redhat.com/
> > - add the cred argument to security_locked_down() and adapt all callers
> > - keep using current_cred() in BPF, as the hook calls have been shifted
> >   to program load time (commit ff40e51043af ("bpf, lockdown, audit: Fix
> >   buggy SELinux lockdown permission checks"))
> > - in SELinux, don't ignore hook calls where cred == NULL, but use
> >   SECINITSID_KERNEL as the subject instead
> > - update explanations in the commit message
> >
> > v2: 
> > https://lore.kernel.org/lkml/20210517092006.803332-1-omosn...@redhat.com/
> > - change to a single hook based on suggestions by Casey Schaufler
> >
> > v1: 
> > 

Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix

2021-09-15 Thread Jordan Niethe
On Thu, Sep 16, 2021 at 10:40 AM Christopher M. Riedl
 wrote:
>
> On Tue Sep 14, 2021 at 11:24 PM CDT, Jordan Niethe wrote:
> > On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
> >  wrote:
> > > ...
> > > +/*
> > > + * This can be called for kernel text or a module.
> > > + */
> > > +static int map_patch_mm(const void *addr, struct patch_mapping 
> > > *patch_mapping)
> > > +{
> > > +   struct page *page;
> > > +   struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > > +   unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > > +
> > > +   if (is_vmalloc_or_module_addr(addr))
> > > +   page = vmalloc_to_page(addr);
> > > +   else
> > > +   page = virt_to_page(addr);
> > > +
> > > +   patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > > +_mapping->ptl);
> > > +   if (unlikely(!patch_mapping->ptep)) {
> > > +   pr_warn("map patch: failed to allocate pte for 
> > > patching\n");
> > > +   return -1;
> > > +   }
> > > +
> > > +   set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > > +  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
> >
> > I think because switch_mm_irqs_off() will not necessarily have a
> > barrier so a ptesync would be needed.
> > A spurious fault here from __patch_instruction() would not be handled
> > correctly.
>
> Sorry I don't quite follow - can you explain this to me in a bit more
> detail?

radix__set_pte_at() skips calling ptesync as an optimization.
If there is no ordering between changing the pte and then accessing
the page with __patch_instruction(), a spurious fault could be raised.
I think such a fault would end up being causing bad_kernel_fault() ->
true and would not be fixed up.

I thought there might be a barrier in switch_mm_irqs_off() that would
provide this ordering but afaics that is not always the case.

So I think that we need to have a ptesync after set_pte_at().


Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix

2021-09-15 Thread Jordan Niethe
On Thu, Sep 16, 2021 at 10:38 AM Christopher M. Riedl
 wrote:
>
> On Sat Sep 11, 2021 at 4:14 AM CDT, Jordan Niethe wrote:
> > On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
> >  wrote:
> > >
> > > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > > address to be patched is temporarily mapped as writeable. Currently, a
> > > per-cpu vmalloc patch area is used for this purpose. While the patch
> > > area is per-cpu, the temporary page mapping is inserted into the kernel
> > > page tables for the duration of patching. The mapping is exposed to CPUs
> > > other than the patching CPU - this is undesirable from a hardening
> > > perspective. Use a temporary mm instead which keeps the mapping local to
> > > the CPU doing the patching.
> > >
> > > Use the `poking_init` init hook to prepare a temporary mm and patching
> > > address. Initialize the temporary mm by copying the init mm. Choose a
> > > randomized patching address inside the temporary mm userspace address
> > > space. The patching address is randomized between PAGE_SIZE and
> > > DEFAULT_MAP_WINDOW-PAGE_SIZE.
> > >
> > > Bits of entropy with 64K page size on BOOK3S_64:
> > >
> > > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> > >
> > > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> > > bits of entropy = log2(128TB / 64K)
> > > bits of entropy = 31
> > >
> > > The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> > > operates - by default the space above DEFAULT_MAP_WINDOW is not
> > > available. Currently the Hash MMU does not use a temporary mm so
> > > technically this upper limit isn't necessary; however, a larger
> > > randomization range does not further "harden" this overall approach and
> > > future work may introduce patching with a temporary mm on Hash as well.
> > >
> > > Randomization occurs only once during initialization at boot for each
> > > possible CPU in the system.
> > >
> > > Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
> > > respectively create and remove the temporary mapping with write
> > > permissions at patching_addr. Map the page with PAGE_KERNEL to set
> > > EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> > > KUAP) according to PowerISA v3.0b Figure 35 on Radix.
> > >
> > > Based on x86 implementation:
> > >
> > > commit 4fc19708b165
> > > ("x86/alternatives: Initialize temporary mm for patching")
> > >
> > > and:
> > >
> > > commit b3fd8e83ada0
> > > ("x86/alternatives: Use temporary mm for text poking")
> > >
> > > Signed-off-by: Christopher M. Riedl 
> > >
> > > ---
> > >
> > > v6:  * Small clean-ups (naming, formatting, style, etc).
> > >  * Call stop_using_temporary_mm() before pte_unmap_unlock() after
> > >patching.
> > >  * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.
> > >
> > > v5:  * Only support Book3s64 Radix MMU for now.
> > >  * Use a per-cpu datastructure to hold the patching_addr and
> > >patching_mm to avoid the need for a synchronization lock/mutex.
> > >
> > > v4:  * In the previous series this was two separate patches: one to init
> > >the temporary mm in poking_init() (unused in powerpc at the time)
> > >and the other to use it for patching (which removed all the
> > >per-cpu vmalloc code). Now that we use poking_init() in the
> > >existing per-cpu vmalloc approach, that separation doesn't work
> > >as nicely anymore so I just merged the two patches into one.
> > >  * Preload the SLB entry and hash the page for the patching_addr
> > >when using Hash on book3s64 to avoid taking an SLB and Hash fault
> > >during patching. The previous implementation was a hack which
> > >changed current->mm to allow the SLB and Hash fault handlers to
> > >work with the temporary mm since both of those code-paths always
> > >assume mm == current->mm.
> > >  * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
> > >have to manage the mm->context.active_cpus counter and mm cpumask
> > >since they determine (via mm_is_thread_local()) if the TLB flush
> > >in pte_clear() is local or not - it should always be local when
> > >we're using the temporary mm. On book3s64's Radix MMU we can
> > >just call local_flush_tlb_mm().
> > >  * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
> > >KUAP.
> > > ---
> > >  arch/powerpc/lib/code-patching.c | 119 +--
> > >  1 file changed, 112 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/arch/powerpc/lib/code-patching.c 
> > > b/arch/powerpc/lib/code-patching.c
> > > index e802e42c2789..af8e2a02a9dd 100644
> > > --- a/arch/powerpc/lib/code-patching.c
> > > +++ b/arch/powerpc/lib/code-patching.c
> > > @@ -11,6 +11,7 @@
> > >  #include 
> > >  #include 
> > >  #include 
> > > +#include 
> > >
> > >  #include 
> > >  

Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix

2021-09-15 Thread Christopher M. Riedl
On Tue Sep 14, 2021 at 11:24 PM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
>  wrote:
> > ... 
> > +/*
> > + * This can be called for kernel text or a module.
> > + */
> > +static int map_patch_mm(const void *addr, struct patch_mapping 
> > *patch_mapping)
> > +{
> > +   struct page *page;
> > +   struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > +   unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > +
> > +   if (is_vmalloc_or_module_addr(addr))
> > +   page = vmalloc_to_page(addr);
> > +   else
> > +   page = virt_to_page(addr);
> > +
> > +   patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > +_mapping->ptl);
> > +   if (unlikely(!patch_mapping->ptep)) {
> > +   pr_warn("map patch: failed to allocate pte for patching\n");
> > +   return -1;
> > +   }
> > +
> > +   set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > +  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
>
> I think because switch_mm_irqs_off() will not necessarily have a
> barrier so a ptesync would be needed.
> A spurious fault here from __patch_instruction() would not be handled
> correctly.

Sorry I don't quite follow - can you explain this to me in a bit more
detail?


Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix

2021-09-15 Thread Christopher M. Riedl
On Sat Sep 11, 2021 at 4:14 AM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
>  wrote:
> >
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped as writeable. Currently, a
> > per-cpu vmalloc patch area is used for this purpose. While the patch
> > area is per-cpu, the temporary page mapping is inserted into the kernel
> > page tables for the duration of patching. The mapping is exposed to CPUs
> > other than the patching CPU - this is undesirable from a hardening
> > perspective. Use a temporary mm instead which keeps the mapping local to
> > the CPU doing the patching.
> >
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE.
> >
> > Bits of entropy with 64K page size on BOOK3S_64:
> >
> > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> >
> > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> > bits of entropy = log2(128TB / 64K)
> > bits of entropy = 31
> >
> > The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> > operates - by default the space above DEFAULT_MAP_WINDOW is not
> > available. Currently the Hash MMU does not use a temporary mm so
> > technically this upper limit isn't necessary; however, a larger
> > randomization range does not further "harden" this overall approach and
> > future work may introduce patching with a temporary mm on Hash as well.
> >
> > Randomization occurs only once during initialization at boot for each
> > possible CPU in the system.
> >
> > Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
> > respectively create and remove the temporary mapping with write
> > permissions at patching_addr. Map the page with PAGE_KERNEL to set
> > EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> > KUAP) according to PowerISA v3.0b Figure 35 on Radix.
> >
> > Based on x86 implementation:
> >
> > commit 4fc19708b165
> > ("x86/alternatives: Initialize temporary mm for patching")
> >
> > and:
> >
> > commit b3fd8e83ada0
> > ("x86/alternatives: Use temporary mm for text poking")
> >
> > Signed-off-by: Christopher M. Riedl 
> >
> > ---
> >
> > v6:  * Small clean-ups (naming, formatting, style, etc).
> >  * Call stop_using_temporary_mm() before pte_unmap_unlock() after
> >patching.
> >  * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.
> >
> > v5:  * Only support Book3s64 Radix MMU for now.
> >  * Use a per-cpu datastructure to hold the patching_addr and
> >patching_mm to avoid the need for a synchronization lock/mutex.
> >
> > v4:  * In the previous series this was two separate patches: one to init
> >the temporary mm in poking_init() (unused in powerpc at the time)
> >and the other to use it for patching (which removed all the
> >per-cpu vmalloc code). Now that we use poking_init() in the
> >existing per-cpu vmalloc approach, that separation doesn't work
> >as nicely anymore so I just merged the two patches into one.
> >  * Preload the SLB entry and hash the page for the patching_addr
> >when using Hash on book3s64 to avoid taking an SLB and Hash fault
> >during patching. The previous implementation was a hack which
> >changed current->mm to allow the SLB and Hash fault handlers to
> >work with the temporary mm since both of those code-paths always
> >assume mm == current->mm.
> >  * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
> >have to manage the mm->context.active_cpus counter and mm cpumask
> >since they determine (via mm_is_thread_local()) if the TLB flush
> >in pte_clear() is local or not - it should always be local when
> >we're using the temporary mm. On book3s64's Radix MMU we can
> >just call local_flush_tlb_mm().
> >  * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
> >KUAP.
> > ---
> >  arch/powerpc/lib/code-patching.c | 119 +--
> >  1 file changed, 112 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index e802e42c2789..af8e2a02a9dd 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -11,6 +11,7 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  #include 
> >  #include 
> > @@ -103,6 +104,7 @@ static inline void stop_using_temporary_mm(struct 
> > temp_mm *temp_mm)
> >
> >  static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> >  static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> > +static DEFINE_PER_CPU(struct 

Re: [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for Radix MMU

2021-09-15 Thread Christopher M. Riedl
On Sat Sep 11, 2021 at 3:26 AM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:35 PM Christopher M. Riedl
>  wrote:
> >
> > x86 supports the notion of a temporary mm which restricts access to
> > temporary PTEs to a single CPU. A temporary mm is useful for situations
> > where a CPU needs to perform sensitive operations (such as patching a
> > STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> > said mappings to other CPUs. Another benefit is that other CPU TLBs do
> > not need to be flushed when the temporary mm is torn down.
> >
> > Mappings in the temporary mm can be set in the userspace portion of the
> > address-space.
> >
> > Interrupts must be disabled while the temporary mm is in use. HW
> > breakpoints, which may have been set by userspace as watchpoints on
> > addresses now within the temporary mm, are saved and disabled when
> > loading the temporary mm. The HW breakpoints are restored when unloading
> > the temporary mm. All HW breakpoints are indiscriminately disabled while
> > the temporary mm is in use - this may include breakpoints set by perf.
>
> I had thought CPUs with a DAWR might not need to do this because the
> privilege level that breakpoints trigger on can be configured. But it
> turns out in ptrace, etc we use HW_BRK_TYPE_PRIV_ALL.

Thanks for double checking :)

>
> >
> > Based on x86 implementation:
> >
> > commit cefa929c034e
> > ("x86/mm: Introduce temporary mm structs")
> >
> > Signed-off-by: Christopher M. Riedl 
> >
> > ---
> >
> > v6:  * Use {start,stop}_using_temporary_mm() instead of
> >{use,unuse}_temporary_mm() as suggested by Christophe.
> >
> > v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.
> >
> > v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
> >using/unusing the temp mm as suggested by Jann Horn to keep
> >the context.active counter in-sync on mm/nohash.
> >  * Disable SLB preload in the temporary mm when initializing the
> >temp_mm struct.
> >  * Include asm/debug.h header to fix build issue with
> >ppc44x_defconfig.
> > ---
> >  arch/powerpc/include/asm/debug.h |  1 +
> >  arch/powerpc/kernel/process.c|  5 +++
> >  arch/powerpc/lib/code-patching.c | 56 
> >  3 files changed, 62 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/debug.h 
> > b/arch/powerpc/include/asm/debug.h
> > index 86a14736c76c..dfd82635ea8b 100644
> > --- a/arch/powerpc/include/asm/debug.h
> > +++ b/arch/powerpc/include/asm/debug.h
> > @@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs 
> > *regs) { return 0; }
> >  #endif
> >
> >  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> >  bool ppc_breakpoint_available(void);
> >  #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> >  extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index 50436b52c213..6aa1f5c4d520 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct 
> > arch_hw_breakpoint *brk)
> > return 0;
> >  }
> >
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> > +{
> > +   memcpy(brk, this_cpu_ptr(_brk[nr]), sizeof(*brk));
> > +}
>
> The breakpoint code is already a little hard to follow. I'm worried
> doing this might spread breakpoint handling into more places in the
> future.
> What about something like having a breakpoint_pause() function which
> clears the hardware registers only and then a breakpoint_resume()
> function that copies from current_brk[] back to the hardware
> registers?
> Then we don't have to make another copy of the breakpoint state.

I think that sounds reasonable - I'll add those functions instead with
the next spin.

>
> > +
> >  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> >  {
> > memcpy(this_cpu_ptr(_brk[nr]), brk, sizeof(*brk));
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index f9a3019e37b4..8d61a7d35b89 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
>
> Sorry I might have missed it, but what was the reason for not putting
> this stuff in mmu_context.h?

x86 ended up moving this code into their code-patching file as well. I
suppose nobody has thought of another use for a temporary mm like this
yet :)

>
> > @@ -17,6 +17,9 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> > +#include 
> > +#include 
> >
> >  static int __patch_instruction(u32 *exec_addr, struct ppc_inst instr, u32 
> > *patch_addr)
> >  {
> > @@ -45,6 +48,59 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst 
> > instr)
> >  }
> >
> >  #ifdef CONFIG_STRICT_KERNEL_RWX
> > +
> > +struct temp_mm {
> > +   struct 

Re: [PATCH v3 4/8] powerpc/pseries/svm: Add a powerpc version of cc_platform_has()

2021-09-15 Thread Borislav Petkov
On Wed, Sep 15, 2021 at 07:18:34PM +0200, Christophe Leroy wrote:
> Could you please provide more explicit explanation why inlining such an
> helper is considered as bad practice and messy ?

Tom already told you to look at the previous threads. Let's read them
together. This one, for example:

https://lore.kernel.org/lkml/ysscwvpxevxw%2f...@infradead.org/

| > To take it out of line, I'm leaning towards the latter, creating a new
| > file that is built based on the ARCH_HAS_PROTECTED_GUEST setting.
| 
| Yes.  In general everytime architectures have to provide the prototype
| and not just the implementation of something we end up with a giant mess
| sooner or later.  In a few cases that is still warranted due to
| performance concerns, but i don't think that is the case here.

So I think what Christoph means here is that you want to have the
generic prototype defined in a header and arches get to implement it
exactly to the letter so that there's no mess.

As to what mess exactly, I'd let him explain that.

> Because as demonstrated in my previous response some days ago, taking that
> outline ends up with an unneccessary ugly generated code and we don't
> benefit front GCC's capability to fold in and opt out unreachable code.

And this is real fast path where a couple of instructions matter or what?

set_memory_encrypted/_decrypted doesn't look like one to me.

> I can't see your point here. Inlining the function wouldn't add any
> ifdeffery as far as I can see.

If the function is touching defines etc, they all need to be visible.
If that function needs to call other functions - which is the case on
x86, perhaps not so much on power - then you need to either ifdef around
them or provide stubs with ifdeffery in the headers. And you need to
make them global functions instead of keeping them static to the same
compilation unit, etc, etc.

With a separate compilation unit, you don't need any of that and it is
all kept in that single file.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [PATCH] swiotlb: set IO TLB segment size via cmdline

2021-09-15 Thread Jan Beulich
On 15.09.2021 15:37, Roman Skakun wrote:
>>> From: Roman Skakun 
>>>
>>> It is possible when default IO TLB size is not
>>> enough to fit a long buffers as described here [1].
>>>
>>> This patch makes a way to set this parameter
>>> using cmdline instead of recompiling a kernel.
>>>
>>> [1] https://www.xilinx.com/support/answers/72694.html
>>
>>  I'm not convinced the swiotlb use describe there falls under "intended
>>  use" - mapping a 1280x720 framebuffer in a single chunk?
> 
> I had the same issue while mapping DMA chuck ~4MB for gem fb when
> using xen vdispl.
> I got the next log:
> [ 142.030421] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
> 3686400 bytes), total 32768 (slots), used 32 (slots)
> 
> It happened when I tried to map bounce buffer, which has a large size.
> The default size if 128(IO_TLB_SEGSIZE) * 2048(IO_TLB_SHIFT) = 262144
> bytes, but we requested 3686400 bytes.
> When I change IO_TLB_SEGSIZE to 2048. (2048(IO_TLB_SEGSIZE)  *
> 2048(IO_TLB_SHIFT) = 4194304bytes).
> It makes possible to retrieve a bounce buffer for requested size.
> After changing this value, the problem is gone.

But the question remains: Why does the framebuffer need to be mapped
in a single giant chunk?

>>  In order to be sure to catch all uses like this one (including ones
>>  which make it upstream in parallel to yours), I think you will want
>>  to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.
> 
> I don't understand your point. Can you clarify this?

There's a concrete present example: I have a patch pending adding
another use of IO_TLB_SEGSIZE. This use would need to be replaced
like you do here in several places. The need for the additional
replacement would be quite obvious (from a build failure) if you
renamed the manifest constant. Without renaming, it'll take
someone running into an issue on a live system, which I consider
far worse. This is because a simple re-basing of one of the
patches on top of the other will not point out the need for the
extra replacement, nor would a test build (with both patches in
place).

Jan



Re: [PATCH v3 0/8] Implement generic cc_platform_has() helper function

2021-09-15 Thread Kuppuswamy, Sathyanarayanan




On 9/15/21 9:46 AM, Borislav Petkov wrote:

Sathya,

if you want to prepare the Intel variant intel_cc_platform_has() ontop
of those and send it to me, that would be good because then I can
integrate it all in one branch which can be used to base future work
ontop.


I have a Intel variant patch (please check following patch). But it includes
TDX changes as well. Shall I move TDX changes to different patch and just
create a separate patch for adding intel_cc_platform_has()?


commit fc5f98a0ed94629d903827c5b44ee9295f835831
Author: Kuppuswamy Sathyanarayanan 
Date:   Wed May 12 11:35:13 2021 -0700

x86/tdx: Add confidential guest support for TDX guest

TDX architecture provides a way for VM guests to be highly secure and
isolated (from untrusted VMM). To achieve this requirement, any data
coming from VMM cannot be completely trusted. TDX guest fixes this
issue by hardening the IO drivers against the attack from the VMM.
So, when adding hardening fixes to the generic drivers, to protect
custom fixes use cc_platform_has() API.

Also add TDX guest support to cc_platform_has() API to protect the
TDX specific fixes.

Signed-off-by: Kuppuswamy Sathyanarayanan 


diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index a5b14de03458..2e78358923a1 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -871,6 +871,7 @@ config INTEL_TDX_GUEST
depends on SECURITY
select X86_X2APIC
select SECURITY_LOCKDOWN_LSM
+   select ARCH_HAS_CC_PLATFORM
help
  Provide support for running in a trusted domain on Intel processors
  equipped with Trusted Domain eXtensions. TDX is a new Intel
diff --git a/arch/x86/include/asm/intel_cc_platform.h 
b/arch/x86/include/asm/intel_cc_platform.h
new file mode 100644
index ..472c3174beac
--- /dev/null
+++ b/arch/x86/include/asm/intel_cc_platform.h
@@ -0,0 +1,13 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* Copyright (C) 2021 Intel Corporation */
+#ifndef _ASM_X86_INTEL_CC_PLATFORM_H
+#define _ASM_X86_INTEL_CC_PLATFORM_H
+
+#if defined(CONFIG_CPU_SUP_INTEL) && defined(CONFIG_ARCH_HAS_CC_PLATFORM)
+bool intel_cc_platform_has(unsigned int flag);
+#else
+static inline bool intel_cc_platform_has(unsigned int flag) { return false; }
+#endif
+
+#endif /* _ASM_X86_INTEL_CC_PLATFORM_H */
+
diff --git a/arch/x86/kernel/cc_platform.c b/arch/x86/kernel/cc_platform.c
index 3c9bacd3c3f3..e83bc2f48efe 100644
--- a/arch/x86/kernel/cc_platform.c
+++ b/arch/x86/kernel/cc_platform.c
@@ -10,11 +10,16 @@
 #include 
 #include 
 #include 
+#include 
+
+#include 

 bool cc_platform_has(enum cc_attr attr)
 {
if (sme_me_mask)
return amd_cc_platform_has(attr);
+   else if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL)
+   return intel_cc_platform_has(attr);

return false;
 }
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..ab486a3b1eb0 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -60,6 +61,21 @@ static u64 msr_test_ctrl_cache __ro_after_init;
  */
 static bool cpu_model_supports_sld __ro_after_init;

+#ifdef CONFIG_ARCH_HAS_CC_PLATFORM
+bool intel_cc_platform_has(enum cc_attr attr)
+{
+   switch (attr) {
+   case CC_ATTR_GUEST_TDX:
+   return cpu_feature_enabled(X86_FEATURE_TDX_GUEST);
+   default:
+   return false;
+   }
+
+   return false;
+}
+EXPORT_SYMBOL_GPL(intel_cc_platform_has);
+#endif
+
 /*
  * Processors which have self-snooping capability can handle conflicting
  * memory type across CPUs by snooping its own cache. However, there exists
diff --git a/include/linux/cc_platform.h b/include/linux/cc_platform.h
index 253f3ea66cd8..e38430e6e396 100644
--- a/include/linux/cc_platform.h
+++ b/include/linux/cc_platform.h
@@ -61,6 +61,15 @@ enum cc_attr {
 * Examples include SEV-ES.
 */
CC_ATTR_GUEST_STATE_ENCRYPT,
+
+   /**
+* @CC_ATTR_GUEST_TDX: Trusted Domain Extension Support
+*
+* The platform/OS is running as a TDX guest/virtual machine.
+*
+* Examples include SEV-ES.
+*/
+   CC_ATTR_GUEST_TDX,
 };

 #ifdef CONFIG_ARCH_HAS_CC_PLATFORM


--
Sathyanarayanan Kuppuswamy
Linux Kernel Developer


Re: [PATCH v3 4/8] powerpc/pseries/svm: Add a powerpc version of cc_platform_has()

2021-09-15 Thread Christophe Leroy




Le 15/09/2021 à 12:08, Borislav Petkov a écrit :

On Wed, Sep 15, 2021 at 10:28:59AM +1000, Michael Ellerman wrote:

I don't love it, a new C file and an out-of-line call to then call back
to a static inline that for most configuration will return false ... but
whatever :)


Yeah, hch thinks it'll cause a big mess otherwise:

https://lore.kernel.org/lkml/ysscwvpxevxw%2f...@infradead.org/


Could you please provide more explicit explanation why inlining such an 
helper is considered as bad practice and messy ?


Because as demonstrated in my previous response some days ago, taking 
that outline ends up with an unneccessary ugly generated code and we 
don't benefit front GCC's capability to fold in and opt out unreachable 
code.


As pointed by Michael in most cases the function will just return false 
so behind the performance concern, there is also the code size and code 
coverage topic that is to be taken into account. And even when the 
function doesn't return false, the only thing it does folds into a 
single powerpc instruction so there is really no point in making a 
dedicated out-of-line fonction for that and suffer the cost and the size 
of a function call and to justify the addition of a dedicated C file.





I guess less ifdeffery is nice too.


I can't see your point here. Inlining the function wouldn't add any 
ifdeffery as far as I can see.


So, would you mind reconsidering your approach and allow architectures 
to provide inline implementation by just not enforcing a generic 
prototype ? Or otherwise provide more details and exemple of why the 
cons are more important versus the pros ?


Thanks
Christophe


Re: [PATCH v3 0/8] Implement generic cc_platform_has() helper function

2021-09-15 Thread Borislav Petkov
On Wed, Sep 08, 2021 at 05:58:31PM -0500, Tom Lendacky wrote:
> This patch series provides a generic helper function, cc_platform_has(),
> to replace the sme_active(), sev_active(), sev_es_active() and
> mem_encrypt_active() functions.
> 
> It is expected that as new confidential computing technologies are
> added to the kernel, they can all be covered by a single function call
> instead of a collection of specific function calls all called from the
> same locations.
> 
> The powerpc and s390 patches have been compile tested only. Can the
> folks copied on this series verify that nothing breaks for them. Also,
> a new file, arch/powerpc/platforms/pseries/cc_platform.c, has been
> created for powerpc to hold the out of line function.

...

> 
> Tom Lendacky (8):
>   x86/ioremap: Selectively build arch override encryption functions
>   mm: Introduce a function to check for confidential computing features
>   x86/sev: Add an x86 version of cc_platform_has()
>   powerpc/pseries/svm: Add a powerpc version of cc_platform_has()
>   x86/sme: Replace occurrences of sme_active() with cc_platform_has()
>   x86/sev: Replace occurrences of sev_active() with cc_platform_has()
>   x86/sev: Replace occurrences of sev_es_active() with cc_platform_has()
>   treewide: Replace the use of mem_encrypt_active() with
> cc_platform_has()

Ok, modulo the minor things the plan is to take this through tip after
-rc2 releases in order to pick up the powerpc build fix and have a clean
base (-rc2) to base stuff on, at the same time.

Pls holler if something's still amiss.

Sathya,

if you want to prepare the Intel variant intel_cc_platform_has() ontop
of those and send it to me, that would be good because then I can
integrate it all in one branch which can be used to base future work
ontop.

Thx.

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [PATCH v5 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance

2021-09-15 Thread Vincent Guittot
On Sat, 11 Sept 2021 at 03:19, Ricardo Neri
 wrote:
>
> When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> check for the idle state of the destination CPU, dst_cpu, but also of
> its SMT siblings.
>
> If dst_cpu is idle but its SMT siblings are busy, performance suffers
> if it pulls tasks from a medium priority CPU that does not have SMT
> siblings.
>
> Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> siblings of both dst_cpu and the CPUs in the candidate busiest group.
>
> Cc: Aubrey Li 
> Cc: Ben Segall 
> Cc: Daniel Bristot de Oliveira 
> Cc: Dietmar Eggemann 
> Cc: Mel Gorman 
> Cc: Quentin Perret 
> Cc: Rafael J. Wysocki 
> Cc: Srinivas Pandruvada 
> Cc: Steven Rostedt 
> Cc: Tim Chen 
> Reviewed-by: Joel Fernandes (Google) 
> Reviewed-by: Len Brown 
> Signed-off-by: Ricardo Neri 
> ---
> Changes since v4:
>   * Use sg_lb_stats::sum_nr_running the idle state of a scheduling group.
> (Vincent, Peter)
>   * Do not even idle CPUs in asym_smt_can_pull_tasks(). (Vincent)
>   * Updated function documentation and corrected a typo.
>
> Changes since v3:
>   * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> powerpc folks showed that this patch should not impact them. Also, more
> recent powerpc processor no longer use asym_packing. (PeterZ)
>   * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
>   * Removed unnecessary check for local CPUs when the local group has zero
> utilization. (Joel)
>   * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> the fact that it deals with SMT cases.
>   * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
> that callers can deal with non-SMT cases.
>
> Changes since v2:
>   * Reworded the commit message to reflect updates in code.
>   * Corrected misrepresentation of dst_cpu as the CPU doing the load
> balancing. (PeterZ)
>   * Removed call to arch_asym_check_smt_siblings() as it is now called in
> sched_asym().
>
> Changes since v1:
>   * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> tasks. Instead, reclassify the candidate busiest group, as it
> may still be selected. (PeterZ)
>   * Avoid an expensive and unnecessary call to cpumask_weight() when
> determining if a sched_group is comprised of SMT siblings.
> (PeterZ).
> ---
>  kernel/sched/fair.c | 94 +
>  1 file changed, 94 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 26db017c14a3..8d763dd0174b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8597,10 +8597,98 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
>  }
>
> +/**
> + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull 
> tasks
> + * @dst_cpu:   Destination CPU of the load balancing
> + * @sds:   Load-balancing data with statistics of the local group
> + * @sgs:   Load-balancing statistics of the candidate busiest group
> + * @sg:The candidate busiest group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks.
> + *
> + * If @dst_cpu does not have SMT siblings, it can pull tasks if two or more 
> of
> + * the SMT siblings of @sg are busy. If only one CPU in @sg is busy, pull 
> tasks
> + * only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings, and @sg has exactly one more
> + * busy CPU than @sds::local, let @dst_cpu pull tasks if it has higher 
> priority.
> + * Bigger imbalances in the number of busy CPUs will be dealt with in
> + * update_sd_pick_busiest().
> + *
> + * If @sg does not have SMT siblings, only pull tasks if all of the SMT 
> siblings
> + * of @dst_cpu are idle and @sg has lower priority.
> + */
> +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> +   struct sg_lb_stats *sgs,
> +   struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> +   bool local_is_smt, sg_is_smt;
> +   int sg_busy_cpus;
> +
> +   local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> +   sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> +
> +   sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> +   if (!local_is_smt) {
> +   /*
> +* If we are here, @dst_cpu is idle and does not have SMT
> +* siblings. Pull tasks if candidate group has two or more
> +* busy CPUs.
> +*/
> +   if (sg_is_smt && sg_busy_cpus >= 2)

Do you really need to test sg_is_smt ? if sg_busy_cpus >= 2 then
sd_is_smt must be true ?

Also, This is the default behavior where we want to even the number of
busy cpu. Shouldn't you return false and fall back to the default
behavior ?

That being said, the default 

[PATCH] powerpc: warn on emulation of dcbz instruction

2021-09-15 Thread Christophe Leroy
dcbz instruction shouldn't be used on non-cached memory. Using
it on non-cached memory can result in alignment exception and
implies a heavy handling.

Instead of silentely emulating the instruction and resulting in high
performance degradation, warn whenever an alignment exception is
taken due to dcbz, so that the user is made aware that dcbz
instruction has been used unexpectedly.

Reported-by: Stan Johnson 
Cc: Finn Thain 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/align.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kernel/align.c b/arch/powerpc/kernel/align.c
index bbb4181621dd..adc3a4a9c6e4 100644
--- a/arch/powerpc/kernel/align.c
+++ b/arch/powerpc/kernel/align.c
@@ -349,6 +349,7 @@ int fix_alignment(struct pt_regs *regs)
if (op.type != CACHEOP + DCBZ)
return -EINVAL;
PPC_WARN_ALIGNMENT(dcbz, regs);
+   WARN_ON_ONCE(1);
r = emulate_dcbz(op.ea, regs);
} else {
if (type == LARX || type == STCX)
-- 
2.31.1



[PATCH] powerpc/32s: Fix kuap_kernel_restore()

2021-09-15 Thread Christophe Leroy
At interrupt exit, kuap_kernel_restore() calls kuap_unclok() with the
value contained in regs->kuap. However, when regs->kuap contains
0x it means that KUAP was not unlocked so calling
kuap_unlock() is unrelevant and results in jeopardising the contents
of kernel space segment registers.

So check that regs->kuap doesn't contain KUAP_NONE before calling
kuap_unlock(). In the meantime it also means that if KUAP has not
been correcly locked back at interrupt exit, it must be locked
before continuing. This is done by checking the content of
current->thread.kuap which was returned by kuap_get_and_assert_locked()

Fixes: 16132529cee5 ("powerpc/32s: Rework Kernel Userspace Access Protection")
Reported-by: Stan Johnson 
Cc: Finn Thain 
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h | 8 
 1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index d4b145b279f6..9f38040f0641 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -136,6 +136,14 @@ static inline void kuap_kernel_restore(struct pt_regs 
*regs, unsigned long kuap)
if (kuap_is_disabled())
return;
 
+   if (unlikely(kuap != KUAP_NONE)) {
+   current->thread.kuap = KUAP_NONE;
+   kuap_lock(kuap, false);
+   }
+
+   if (likely(regs->kuap == KUAP_NONE))
+   return;
+
current->thread.kuap = regs->kuap;
 
kuap_unlock(regs->kuap, false);
-- 
2.31.1



Re: [PATCH] swiotlb: set IO TLB segment size via cmdline

2021-09-15 Thread Christoph Hellwig
On Wed, Sep 15, 2021 at 03:49:52PM +0200, Jan Beulich wrote:
> But the question remains: Why does the framebuffer need to be mapped
> in a single giant chunk?

More importantly: if you use dynamic dma mappings for your framebuffer
you're doing something wrong.


Re: [PATCH] swiotlb: set IO TLB segment size via cmdline

2021-09-15 Thread Roman Skakun
Hi Jan,

Thanks for the answer.

>> From: Roman Skakun 
>>
>> It is possible when default IO TLB size is not
>> enough to fit a long buffers as described here [1].
>>
>> This patch makes a way to set this parameter
>> using cmdline instead of recompiling a kernel.
>>
>> [1] https://www.xilinx.com/support/answers/72694.html
>
>  I'm not convinced the swiotlb use describe there falls under "intended
>  use" - mapping a 1280x720 framebuffer in a single chunk?

I had the same issue while mapping DMA chuck ~4MB for gem fb when
using xen vdispl.
I got the next log:
[ 142.030421] rcar-fcp fea2f000.fcp: swiotlb buffer is full (sz:
3686400 bytes), total 32768 (slots), used 32 (slots)

It happened when I tried to map bounce buffer, which has a large size.
The default size if 128(IO_TLB_SEGSIZE) * 2048(IO_TLB_SHIFT) = 262144
bytes, but we requested 3686400 bytes.
When I change IO_TLB_SEGSIZE to 2048. (2048(IO_TLB_SEGSIZE)  *
2048(IO_TLB_SHIFT) = 4194304bytes).
It makes possible to retrieve a bounce buffer for requested size.
After changing this value, the problem is gone.

>  the bottom of this page is also confusing, as following "Then we can
>  confirm the modified swiotlb size in the boot log:" there is a log
>  fragment showing the same original size of 64Mb.

I suspect, this is a mistake in the article.
According to 
https://elixir.bootlin.com/linux/v5.14.4/source/kernel/dma/swiotlb.c#L214
and
https://elixir.bootlin.com/linux/v5.15-rc1/source/kernel/dma/swiotlb.c#L182
The IO_TLB_SEGSIZE is not used to calculate total size of swiotlb area.
This explains why we have the same total size before and after changing of
TLB segment size.

>  In order to be sure to catch all uses like this one (including ones
>  which make it upstream in parallel to yours), I think you will want
>  to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.

I don't understand your point. Can you clarify this?

>> + /* get max IO TLB segment size */
>> + if (isdigit(*str)) {
>> + tmp = simple_strtoul(str, , 0);
>> + if (tmp)
>> + io_tlb_seg_size = ALIGN(tmp, IO_TLB_SEGSIZE);
>
> From all I can tell io_tlb_seg_size wants to be a power of 2. Merely
> aligning to a multiple of IO_TLB_SEGSIZE isn't going to be enough.

Yes, right, thanks!

Cheers,
Roman.

вт, 14 сент. 2021 г. в 18:29, Jan Beulich :
>
> On 14.09.2021 17:10, Roman Skakun wrote:
> > From: Roman Skakun 
> >
> > It is possible when default IO TLB size is not
> > enough to fit a long buffers as described here [1].
> >
> > This patch makes a way to set this parameter
> > using cmdline instead of recompiling a kernel.
> >
> > [1] https://www.xilinx.com/support/answers/72694.html
>
> I'm not convinced the swiotlb use describe there falls under "intended
> use" - mapping a 1280x720 framebuffer in a single chunk? (As an aside,
> the bottom of this page is also confusing, as following "Then we can
> confirm the modified swiotlb size in the boot log:" there is a log
> fragment showing the same original size of 64Mb.
>
> > --- a/arch/mips/cavium-octeon/dma-octeon.c
> > +++ b/arch/mips/cavium-octeon/dma-octeon.c
> > @@ -237,7 +237,7 @@ void __init plat_swiotlb_setup(void)
> >   swiotlbsize = 64 * (1<<20);
> >  #endif
> >   swiotlb_nslabs = swiotlbsize >> IO_TLB_SHIFT;
> > - swiotlb_nslabs = ALIGN(swiotlb_nslabs, IO_TLB_SEGSIZE);
> > + swiotlb_nslabs = ALIGN(swiotlb_nslabs, swiotlb_io_seg_size());
>
> In order to be sure to catch all uses like this one (including ones
> which make it upstream in parallel to yours), I think you will want
> to rename the original IO_TLB_SEGSIZE to e.g. IO_TLB_DEFAULT_SEGSIZE.
>
> > @@ -81,15 +86,30 @@ static unsigned int max_segment;
> >  static unsigned long default_nslabs = IO_TLB_DEFAULT_SIZE >> IO_TLB_SHIFT;
> >
> >  static int __init
> > -setup_io_tlb_npages(char *str)
> > +setup_io_tlb_params(char *str)
> >  {
> > + unsigned long tmp;
> > +
> >   if (isdigit(*str)) {
> > - /* avoid tail segment of size < IO_TLB_SEGSIZE */
> > - default_nslabs =
> > - ALIGN(simple_strtoul(str, , 0), IO_TLB_SEGSIZE);
> > + default_nslabs = simple_strtoul(str, , 0);
> >   }
> >   if (*str == ',')
> >   ++str;
> > +
> > + /* get max IO TLB segment size */
> > + if (isdigit(*str)) {
> > + tmp = simple_strtoul(str, , 0);
> > + if (tmp)
> > + io_tlb_seg_size = ALIGN(tmp, IO_TLB_SEGSIZE);
>
> From all I can tell io_tlb_seg_size wants to be a power of 2. Merely
> aligning to a multiple of IO_TLB_SEGSIZE isn't going to be enough.
>
> Jan
>


-- 
Best Regards, Roman.


[PATCH] video: fbdev: use memset_io() instead of memset()

2021-09-15 Thread Christophe Leroy
While investigating a lockup at startup on Powerbook 3400C, it was
identified that the fbdev driver generates alignment exception at
startup:

--- interrupt: 600 at memset+0x60/0xc0
NIP:  c0021414 LR: c03fc49c CTR: 7fff
REGS: ca021c10 TRAP: 0600   Tainted: GW  
(5.14.2-pmac-00727-g12a41fa69492)
MSR:  9032   CR: 44008442  XER: 2100
DAR: cab80020 DSISR: 00017c07
GPR00: 0007 ca021cd0 c14412e0 cab8  0010 cab8001c 
0004
GPR08: 0010 7fff   84008442  c0006fb4 

GPR16:        
0010
GPR24:  8180 0320 c15fa400 c14d1878  c14d1800 
c094e19c
NIP [c0021414] memset+0x60/0xc0
LR [c03fc49c] chipsfb_pci_init+0x160/0x580
--- interrupt: 600
[ca021cd0] [c03fc46c] chipsfb_pci_init+0x130/0x580 (unreliable)
[ca021d20] [c03a3a70] pci_device_probe+0xf8/0x1b8
[ca021d50] [c043d584] really_probe.part.0+0xac/0x388
[ca021d70] [c043d914] __driver_probe_device+0xb4/0x170
[ca021d90] [c043da18] driver_probe_device+0x48/0x144
[ca021dc0] [c043e318] __driver_attach+0x11c/0x1c4
[ca021de0] [c043ad30] bus_for_each_dev+0x88/0xf0
[ca021e10] [c043c724] bus_add_driver+0x190/0x22c
[ca021e40] [c043ee94] driver_register+0x9c/0x170
[ca021e60] [c0006c28] do_one_initcall+0x54/0x1ec
[ca021ed0] [c08246e4] kernel_init_freeable+0x1c0/0x270
[ca021f10] [c0006fdc] kernel_init+0x28/0x11c
[ca021f30] [c0017148] ret_from_kernel_thread+0x14/0x1c
Instruction dump:
7d4601a4 39490777 7d4701a4 39490888 7d4801a4 39490999 7d4901a4 39290aaa
7d2a01a4 4c00012c 4bfffe88 0fe0 <4bfffe80> 9421fff0 38210010 
48001970

This is due to 'dcbz' instruction being used on non-cached memory.
'dcbz' instruction is used by memset() to zeroize a complete
cacheline at once, and memset() is not expected to be used on non
cached memory.

When performing a 'sparse' check on fbdev driver, it also appears
that the use of memset() is unexpected:

drivers/video/fbdev/chipsfb.c:334:17: warning: incorrect type in 
argument 1 (different address spaces)
drivers/video/fbdev/chipsfb.c:334:17:expected void *
drivers/video/fbdev/chipsfb.c:334:17:got char [noderef] __iomem 
*screen_base
drivers/video/fbdev/chipsfb.c:334:15: warning: memset with byte count 
of 1048576

Use fb_memset() instead of memset(). fb_memset() is defined as
memset_io() for powerpc.

Fixes: 8c8709334cec ("[PATCH] ppc32: Remove CONFIG_PMAC_PBOOK")
Reported-by: Stan Johnson 
Cc: Finn Thain 
Signed-off-by: Christophe Leroy 
---
 drivers/video/fbdev/chipsfb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/video/fbdev/chipsfb.c b/drivers/video/fbdev/chipsfb.c
index 998067b701fa..393894af26f8 100644
--- a/drivers/video/fbdev/chipsfb.c
+++ b/drivers/video/fbdev/chipsfb.c
@@ -331,7 +331,7 @@ static const struct fb_var_screeninfo chipsfb_var = {
 
 static void init_chips(struct fb_info *p, unsigned long addr)
 {
-   memset(p->screen_base, 0, 0x10);
+   fb_memset(p->screen_base, 0, 0x10);
 
p->fix = chipsfb_fix;
p->fix.smem_start = addr;
-- 
2.31.1



[PATCH] serial: 8250: SERIAL_8250_FSL should not default to y when compile-testing

2021-09-15 Thread Geert Uytterhoeven
Commit b1442c55ce8977aa ("serial: 8250: extend compile-test coverage")
added compile-test support to the Freescale 16550 driver.  However, as
SERIAL_8250_FSL is an invisible symbol, merely enabling COMPILE_TEST now
enables this driver.

Fix this by making SERIAL_8250_FSL visible.  Tighten the dependencies to
prevent asking the user about this driver when configuring a kernel
without appropriate Freescale SoC or ACPI support.

Fixes: b1442c55ce8977aa ("serial: 8250: extend compile-test coverage")
Signed-off-by: Geert Uytterhoeven 
---
Yes, it's ugly, but I see no better solution. Do you?

 drivers/tty/serial/8250/Kconfig | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/serial/8250/Kconfig b/drivers/tty/serial/8250/Kconfig
index 808268edd2e82a45..a2978b31144e94f2 100644
--- a/drivers/tty/serial/8250/Kconfig
+++ b/drivers/tty/serial/8250/Kconfig
@@ -361,9 +361,13 @@ config SERIAL_8250_BCM2835AUX
  If unsure, say N.
 
 config SERIAL_8250_FSL
-   bool
+   bool "Freescale 16550-style UART support (8250 based driver)"
depends on SERIAL_8250_CONSOLE
-   default PPC || ARM || ARM64 || COMPILE_TEST
+   depends on FSL_SOC || ARCH_LAYERSCAPE || SOC_LS1021A || (ARM64 && ACPI) 
|| COMPILE_TEST
+   default FSL_SOC || ARCH_LAYERSCAPE || SOC_LS1021A || (ARM64 && ACPI)
+   help
+ Selecting this option will add support for the 16550-style serial
+ port hardware found on Freescale SoCs.
 
 config SERIAL_8250_DW
tristate "Support for Synopsys DesignWare 8250 quirks"
-- 
2.25.1



Re: [PATCH] pci: Rename pcibios_add_device to match

2021-09-15 Thread Niklas Schnelle
On Tue, 2021-09-14 at 01:27 +1000, Oliver O'Halloran wrote:
> The general convention for pcibios_* hooks is that they're named after
> the corresponding pci_* function they provide a hook for. The exception
> is pcibios_add_device() which provides a hook for pci_device_add(). This
> has been irritating me for years so rename it.
> 
> Also, remove the export of the microblaze version. The only caller
> must be compiled as a built-in so there's no reason for the export.
> 
> Signed-off-by: Oliver O'Halloran 
> ---
>  arch/microblaze/pci/pci-common.c   | 3 +--
>  arch/powerpc/kernel/pci-common.c   | 2 +-
>  arch/powerpc/platforms/powernv/pci-sriov.c | 2 +-
>  arch/s390/pci/pci.c| 2 +-
>  arch/sparc/kernel/pci.c| 2 +-
>  arch/x86/pci/common.c  | 2 +-
>  drivers/pci/pci.c  | 4 ++--
>  drivers/pci/probe.c| 4 ++--
>  include/linux/pci.h| 2 +-
>  9 files changed, 11 insertions(+), 12 deletions(-)
> 
> 
.. snip ..
> diff --git a/arch/s390/pci/pci.c b/arch/s390/pci/pci.c
> index e7e6788d75a8..ded3321b7208 100644
> --- a/arch/s390/pci/pci.c
> +++ b/arch/s390/pci/pci.c
> @@ -561,7 +561,7 @@ static void zpci_cleanup_bus_resources(struct zpci_dev 
> *zdev)
>   zdev->has_resources = 0;
>  }
>  
> -int pcibios_add_device(struct pci_dev *pdev)
> +int pcibios_device_add(struct pci_dev *pdev)
>  {
>   struct zpci_dev *zdev = to_zpci(pdev);
>   struct resource *res;
> 
.. snip ..
>


I agree with your assesment this is indeed confusing. Interestingly
pcibios_release_device() also doesn't follow the convention exactly as
it is called from pci_release_dev() but at least that isn't very
confusing.

So for the arch/s390/pci bit:

Acked-by: Niklas Schnelle 



Re: [PATCH v3 4/8] powerpc/pseries/svm: Add a powerpc version of cc_platform_has()

2021-09-15 Thread Borislav Petkov
On Wed, Sep 15, 2021 at 10:28:59AM +1000, Michael Ellerman wrote:
> I don't love it, a new C file and an out-of-line call to then call back
> to a static inline that for most configuration will return false ... but
> whatever :)

Yeah, hch thinks it'll cause a big mess otherwise:

https://lore.kernel.org/lkml/ysscwvpxevxw%2f...@infradead.org/

I guess less ifdeffery is nice too.

> Acked-by: Michael Ellerman  (powerpc)

Thx.

> Yeah, fixed in mainline today, thanks for trying to cross compile :)

Always!

:-)

-- 
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


Re: [PATCH trivial v2] powerpc/powernv/dump: Fix typo in comment

2021-09-15 Thread Joel Stanley
On Tue, 14 Sept 2021 at 14:38, Vasant Hegde
 wrote:
>
> Signed-off-by: Vasant Hegde 

Reviewed-by: Joel Stanley 

> ---
>  arch/powerpc/platforms/powernv/opal-dump.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/powernv/opal-dump.c 
> b/arch/powerpc/platforms/powernv/opal-dump.c
> index 00c5a59d82d9..717d1d30ade5 100644
> --- a/arch/powerpc/platforms/powernv/opal-dump.c
> +++ b/arch/powerpc/platforms/powernv/opal-dump.c
> @@ -419,7 +419,7 @@ void __init opal_platform_dump_init(void)
> int rc;
> int dump_irq;
>
> -   /* ELOG not supported by firmware */
> +   /* Dump not supported by firmware */
> if (!opal_check_token(OPAL_DUMP_READ))
> return;
>
> --
> 2.31.1
>


Re: [PATCH] powerpc/powernv/flash: Check OPAL flash calls exist before using

2021-09-15 Thread Vasant Hegde

On 9/15/21 11:53 AM, Michael Ellerman wrote:

Vasant Hegde  writes:

Currently only FSP based powernv systems supports firmware update
interfaces. Hence check that the token OPAL_FLASH_VALIDATE exists
before initalising the flash driver.

Signed-off-by: Vasant Hegde 
---
  arch/powerpc/platforms/powernv/opal-flash.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-flash.c 
b/arch/powerpc/platforms/powernv/opal-flash.c
index 7e7d38b17420..05490fc22fae 100644
--- a/arch/powerpc/platforms/powernv/opal-flash.c
+++ b/arch/powerpc/platforms/powernv/opal-flash.c
@@ -520,6 +520,10 @@ void __init opal_flash_update_init(void)
  {
int ret;
  
+	/* Firmware update is not supported by firmware */

+   if (!opal_check_token(OPAL_FLASH_VALIDATE))
+   return;
+




Michael,


That will mean the following files no longer appear on BMC systems:

   /sys/firmware/opal/image
   /sys/firmware/opal/validate_flash
   /sys/firmware/opal/manage_flash
   /sys/firmware/opal/update_flash

Presumably those files don't actually work correctly, but are we sure
their mere existence isn't used by anything at all?


That's correct. We never used these files/interfaces on BMC based systems.



We've had trouble in the past where removing sysfs files breaks tools
unexpectedly, see smt_snooze_delay.


AFAIK only update_flash uses these interfaces on baremetal systems.
This change shouldn't break update_flash as these interfaces never used/worked
on BMC based powernv systems.

-Vasant



Re: [PATCH] powerpc/powernv/flash: Check OPAL flash calls exist before using

2021-09-15 Thread Vasant Hegde

On 9/15/21 11:53 AM, Michael Ellerman wrote:

Vasant Hegde  writes:

Currently only FSP based powernv systems supports firmware update
interfaces. Hence check that the token OPAL_FLASH_VALIDATE exists
before initalising the flash driver.

Signed-off-by: Vasant Hegde 
---
  arch/powerpc/platforms/powernv/opal-flash.c | 4 
  1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/platforms/powernv/opal-flash.c 
b/arch/powerpc/platforms/powernv/opal-flash.c
index 7e7d38b17420..05490fc22fae 100644
--- a/arch/powerpc/platforms/powernv/opal-flash.c
+++ b/arch/powerpc/platforms/powernv/opal-flash.c
@@ -520,6 +520,10 @@ void __init opal_flash_update_init(void)
  {
int ret;
  
+	/* Firmware update is not supported by firmware */

+   if (!opal_check_token(OPAL_FLASH_VALIDATE))
+   return;
+




Michael,


That will mean the following files no longer appear on BMC systems:

   /sys/firmware/opal/image
   /sys/firmware/opal/validate_flash
   /sys/firmware/opal/manage_flash
   /sys/firmware/opal/update_flash

Presumably those files don't actually work correctly, but are we sure
their mere existence isn't used by anything at all?


That's correct. We never used these files/interfaces on BMC based systems.



We've had trouble in the past where removing sysfs files breaks tools
unexpectedly, see smt_snooze_delay.


AFAIK only update_flash uses these interfaces on baremetal systems.
This change shouldn't break update_flash as these interfaces never used/worked
on BMC based powernv systems.

-Vasant



Re: [PATCH] powerpc/powernv/flash: Check OPAL flash calls exist before using

2021-09-15 Thread Michael Ellerman
Vasant Hegde  writes:
> Currently only FSP based powernv systems supports firmware update
> interfaces. Hence check that the token OPAL_FLASH_VALIDATE exists
> before initalising the flash driver.
>
> Signed-off-by: Vasant Hegde 
> ---
>  arch/powerpc/platforms/powernv/opal-flash.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/powerpc/platforms/powernv/opal-flash.c 
> b/arch/powerpc/platforms/powernv/opal-flash.c
> index 7e7d38b17420..05490fc22fae 100644
> --- a/arch/powerpc/platforms/powernv/opal-flash.c
> +++ b/arch/powerpc/platforms/powernv/opal-flash.c
> @@ -520,6 +520,10 @@ void __init opal_flash_update_init(void)
>  {
>   int ret;
>  
> + /* Firmware update is not supported by firmware */
> + if (!opal_check_token(OPAL_FLASH_VALIDATE))
> + return;
> +

That will mean the following files no longer appear on BMC systems:

  /sys/firmware/opal/image
  /sys/firmware/opal/validate_flash
  /sys/firmware/opal/manage_flash
  /sys/firmware/opal/update_flash

Presumably those files don't actually work correctly, but are we sure
their mere existence isn't used by anything at all?

We've had trouble in the past where removing sysfs files breaks tools
unexpectedly, see smt_snooze_delay.

cheers