Re: [PATCH v8 5/6] powerpc/code-patching: Use temporary mm for Radix MMU

2022-10-24 Thread Christopher M. Riedl
On Mon Oct 24, 2022 at 12:17 AM CDT, Benjamin Gray wrote:
> On Mon, 2022-10-24 at 14:45 +1100, Russell Currey wrote:
> > On Fri, 2022-10-21 at 16:22 +1100, Benjamin Gray wrote:
> > > From: "Christopher M. Riedl" 
> > >

-%<--

> > >
> > > ---
> >
> > Is the section following the --- your addendum to Chris' patch?  That
> > cuts it off from git, including your signoff.  It'd be better to have
> > it together as one commit message and note the bits you contributed
> > below the --- after your signoff.
> >
> > Commits where you're modifying someone else's previous work should
> > include their signoff above yours, as well.
>
> Addendum to his wording, to break it off from the "From..." section
> (which is me splicing together his comments from previous patches with
> some minor changes to account for the patch changes). I found out
> earlier today that Git will treat it as a comment :(
>
> I'll add the signed off by back, I wasn't sure whether to leave it
> there after making changes (same in patch 2).
>  

This commit has lots of my words so should probably keep the sign-off - if only
to guarantee that blame is properly directed at me for any nonsense therein ^^.

Patch 2 probably doesn't need my sign-off any more - iirc, I actually defended
the BUG_ON()s (which are WARN_ON()s now) at some point.


Re: Fwd: Fwd: X stopped working with 5.14 on iBook

2021-11-02 Thread Christopher M. Riedl
On Mon Nov 1, 2021 at 9:20 PM CDT, Finn Thain wrote:
> Hi Christopher,
>
> After many builds and tests, Stan and I were able to determine that this
> regression only affects builds with CONFIG_USER_NS=y. That is,
>
> d3ccc9781560 + CONFIG_USER_NS=y --> fail
> d3ccc9781560 + CONFIG_USER_NS=n --> okay
> d3ccc9781560~ + CONFIG_USER_NS=y --> okay
> d3ccc9781560~ + CONFIG_USER_NS=n --> okay
>
> Stan also tested a PowerMac G3 system and found that the regression is
> not
> present there. Thus far, only PowerMac G4 systems are known to be
> affected
> (Stan's Cube and Riccardo's PowerBook).
>
> I asked Stan to try v5.15-rc after reverting commit d3ccc9781560.
> Unexpectedly, this build had the same issue. So, it appears there are
> multiple bad commits that produce this Xorg failure, of which
> d3ccc9781560
> is just the first.
>
> But there's no easy way to identify the other bad commits using
> bisection.
> So I've addressed this message to you. Can you help fix this regression?

Hi,

I switched email addresses a few times since that patch - also I am not
employed at IBM any longer so that @linux.ibm.com email doesn't work
either. In any case, I'll take a look and see if I can figure out what's
going on. I do actually have a PowerBook G4 here (if it can be coaxed to
boot) that could help me root cause this.

Thanks!
Chris R.

>
> Regards,
> Finn
>
> On Fri, 22 Oct 2021, Christophe Leroy wrote:
>
> > ...
> > > 
> > >  Forwarded Message ----
> > > Subject: Fwd: X stopped working with 5.14 on iBook
> > > Date: Fri, 22 Oct 2021 11:35:21 -0600
> > > From: Stan Johnson
> > > To: Christopher M. Riedl 
> > > CC: Finn Thain 
> > > 
> > > Hello Christopher Riedl,
> > > 
> > > Please see the message below, in which a git bisect identifies a commit
> > > which may have stopped X from working on some PowerPC G4 systems
> > > (specifically the G4 PowerBook and Cube, possibly others).
> > > 
> > > I'm not sure how to proceed with further tests. If the identified commit
> > > could not have caused the problem, then further testing may be needed.
> > > Please let me know if you need any additional information.
> > > 
> > > Hopefully your e-mail filter will allow messages from yahoo.com addresses.
> > > 
> > > thanks for your help
> > > 
> > > -Stan Johnson
> > > 
> > >  Forwarded Message 
> > > Subject: Re: X stopped working with 5.14 on iBook
> > > Date: Fri, 22 Oct 2021 11:25:14 -0600
> > > From: Stan Johnson
> > > To: debian-powe...@lists.debian.org
> > > CC: Riccardo Mottola 
> > > 
> > > On 10/14/21 9:21 PM, Stan Johnson wrote:
> > > > ...
> > > > Debian's 5.10.0-8 config file works (as expected) with Debian's 5.10.0-8
> > > > kernel source.
> > > > ...
> > > > X works with 5.14 using a tuned config file derived from 5.13 testing.
> > > > ...
> > > 
> > > Update:
> > > 
> > > The issue originally reported by Riccardo Mottola was that X wasn't
> > > working on a PowerBook G4 using Debian's default
> > > vmlinux-5.14.0-2-powerpc kernel. I was able to confirm that the X
> > > failure also occurs on a G4 Cube. My G4 Cube has Debian SID,
> > > sysvinit-core, Xfce and wdm installed. To test whether X works, I
> > > disabled wdm, then I log in at the text console and run "startx". When X
> > > fails, the screen goes blank and the backlight stays on; when X works,
> > > the normal desktop comes up.
> > > 
> > > X works in mainline v5.12 built using a config file based on Debian's
> > > config-5.10.0-8-powerpc.
> > > 
> > > X fails in mainline v5.13 built using a config file based on Debian's
> > > config-5.10.0-8-powerpc.
> > > 
> > > With much help and advice from Finn Thain, I was able to run a bisect
> > > using a config file based on Debian's config-5.10.0-8-powerpc, with
> > > v5.12 "good" and v5.13 "bad".
> > > 
> > > $ git reset --hard
> > > HEAD is now at 62fb9874f5da Linux 5.13
> > > $ git bisect start v5.13
> > > Updating files: 100% (12992/12992), done.
> > > Previous HEAD position was 62fb9874f5da Linux 5.13
> > > HEAD is now at 9f4ad9e425a1 Linux 5.12
> > > $ git bisect bad v5.13
> > > $ git bisect good v5.12
> > > Bisecting: 8739 revisions le

Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix

2021-09-15 Thread Christopher M. Riedl
On Tue Sep 14, 2021 at 11:24 PM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
>  wrote:
> > ... 
> > +/*
> > + * This can be called for kernel text or a module.
> > + */
> > +static int map_patch_mm(const void *addr, struct patch_mapping 
> > *patch_mapping)
> > +{
> > +   struct page *page;
> > +   struct mm_struct *patching_mm = __this_cpu_read(cpu_patching_mm);
> > +   unsigned long patching_addr = __this_cpu_read(cpu_patching_addr);
> > +
> > +   if (is_vmalloc_or_module_addr(addr))
> > +   page = vmalloc_to_page(addr);
> > +   else
> > +   page = virt_to_page(addr);
> > +
> > +   patch_mapping->ptep = get_locked_pte(patching_mm, patching_addr,
> > +&patch_mapping->ptl);
> > +   if (unlikely(!patch_mapping->ptep)) {
> > +   pr_warn("map patch: failed to allocate pte for patching\n");
> > +   return -1;
> > +   }
> > +
> > +   set_pte_at(patching_mm, patching_addr, patch_mapping->ptep,
> > +  pte_mkdirty(mk_pte(page, PAGE_KERNEL)));
>
> I think because switch_mm_irqs_off() will not necessarily have a
> barrier so a ptesync would be needed.
> A spurious fault here from __patch_instruction() would not be handled
> correctly.

Sorry I don't quite follow - can you explain this to me in a bit more
detail?


Re: [PATCH v6 4/4] powerpc/64s: Initialize and use a temporary mm for patching on Radix

2021-09-15 Thread Christopher M. Riedl
On Sat Sep 11, 2021 at 4:14 AM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:39 PM Christopher M. Riedl
>  wrote:
> >
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped as writeable. Currently, a
> > per-cpu vmalloc patch area is used for this purpose. While the patch
> > area is per-cpu, the temporary page mapping is inserted into the kernel
> > page tables for the duration of patching. The mapping is exposed to CPUs
> > other than the patching CPU - this is undesirable from a hardening
> > perspective. Use a temporary mm instead which keeps the mapping local to
> > the CPU doing the patching.
> >
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE.
> >
> > Bits of entropy with 64K page size on BOOK3S_64:
> >
> > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> >
> > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> > bits of entropy = log2(128TB / 64K)
> > bits of entropy = 31
> >
> > The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> > operates - by default the space above DEFAULT_MAP_WINDOW is not
> > available. Currently the Hash MMU does not use a temporary mm so
> > technically this upper limit isn't necessary; however, a larger
> > randomization range does not further "harden" this overall approach and
> > future work may introduce patching with a temporary mm on Hash as well.
> >
> > Randomization occurs only once during initialization at boot for each
> > possible CPU in the system.
> >
> > Introduce two new functions, map_patch_mm() and unmap_patch_mm(), to
> > respectively create and remove the temporary mapping with write
> > permissions at patching_addr. Map the page with PAGE_KERNEL to set
> > EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> > KUAP) according to PowerISA v3.0b Figure 35 on Radix.
> >
> > Based on x86 implementation:
> >
> > commit 4fc19708b165
> > ("x86/alternatives: Initialize temporary mm for patching")
> >
> > and:
> >
> > commit b3fd8e83ada0
> > ("x86/alternatives: Use temporary mm for text poking")
> >
> > Signed-off-by: Christopher M. Riedl 
> >
> > ---
> >
> > v6:  * Small clean-ups (naming, formatting, style, etc).
> >  * Call stop_using_temporary_mm() before pte_unmap_unlock() after
> >patching.
> >  * Replace BUG_ON()s in poking_init() w/ WARN_ON()s.
> >
> > v5:  * Only support Book3s64 Radix MMU for now.
> >  * Use a per-cpu datastructure to hold the patching_addr and
> >patching_mm to avoid the need for a synchronization lock/mutex.
> >
> > v4:  * In the previous series this was two separate patches: one to init
> >the temporary mm in poking_init() (unused in powerpc at the time)
> >and the other to use it for patching (which removed all the
> >per-cpu vmalloc code). Now that we use poking_init() in the
> >existing per-cpu vmalloc approach, that separation doesn't work
> >as nicely anymore so I just merged the two patches into one.
> >  * Preload the SLB entry and hash the page for the patching_addr
> >when using Hash on book3s64 to avoid taking an SLB and Hash fault
> >during patching. The previous implementation was a hack which
> >changed current->mm to allow the SLB and Hash fault handlers to
> >work with the temporary mm since both of those code-paths always
> >assume mm == current->mm.
> >  * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
> >have to manage the mm->context.active_cpus counter and mm cpumask
> >since they determine (via mm_is_thread_local()) if the TLB flush
> >in pte_clear() is local or not - it should always be local when
> >we're using the temporary mm. On book3s64's Radix MMU we can
> >just call local_flush_tlb_mm().
> >  * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
> >KUAP.
> > ---
> >  arch/powerpc/lib/code-patching.c | 119 +--
> >  1 file changed, 112 insertions(+), 7 deletions(-)
> >
> >

Re: [PATCH v6 1/4] powerpc/64s: Introduce temporary mm for Radix MMU

2021-09-15 Thread Christopher M. Riedl
On Sat Sep 11, 2021 at 3:26 AM CDT, Jordan Niethe wrote:
> On Sat, Sep 11, 2021 at 12:35 PM Christopher M. Riedl
>  wrote:
> >
> > x86 supports the notion of a temporary mm which restricts access to
> > temporary PTEs to a single CPU. A temporary mm is useful for situations
> > where a CPU needs to perform sensitive operations (such as patching a
> > STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> > said mappings to other CPUs. Another benefit is that other CPU TLBs do
> > not need to be flushed when the temporary mm is torn down.
> >
> > Mappings in the temporary mm can be set in the userspace portion of the
> > address-space.
> >
> > Interrupts must be disabled while the temporary mm is in use. HW
> > breakpoints, which may have been set by userspace as watchpoints on
> > addresses now within the temporary mm, are saved and disabled when
> > loading the temporary mm. The HW breakpoints are restored when unloading
> > the temporary mm. All HW breakpoints are indiscriminately disabled while
> > the temporary mm is in use - this may include breakpoints set by perf.
>
> I had thought CPUs with a DAWR might not need to do this because the
> privilege level that breakpoints trigger on can be configured. But it
> turns out in ptrace, etc we use HW_BRK_TYPE_PRIV_ALL.

Thanks for double checking :)

>
> >
> > Based on x86 implementation:
> >
> > commit cefa929c034e
> > ("x86/mm: Introduce temporary mm structs")
> >
> > Signed-off-by: Christopher M. Riedl 
> >
> > ---
> >
> > v6:  * Use {start,stop}_using_temporary_mm() instead of
> >{use,unuse}_temporary_mm() as suggested by Christophe.
> >
> > v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.
> >
> > v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
> >using/unusing the temp mm as suggested by Jann Horn to keep
> >the context.active counter in-sync on mm/nohash.
> >  * Disable SLB preload in the temporary mm when initializing the
> >temp_mm struct.
> >  * Include asm/debug.h header to fix build issue with
> >ppc44x_defconfig.
> > ---
> >  arch/powerpc/include/asm/debug.h |  1 +
> >  arch/powerpc/kernel/process.c|  5 +++
> >  arch/powerpc/lib/code-patching.c | 56 
> >  3 files changed, 62 insertions(+)
> >
> > diff --git a/arch/powerpc/include/asm/debug.h 
> > b/arch/powerpc/include/asm/debug.h
> > index 86a14736c76c..dfd82635ea8b 100644
> > --- a/arch/powerpc/include/asm/debug.h
> > +++ b/arch/powerpc/include/asm/debug.h
> > @@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs 
> > *regs) { return 0; }
> >  #endif
> >
> >  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> >  bool ppc_breakpoint_available(void);
> >  #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> >  extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index 50436b52c213..6aa1f5c4d520 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct 
> > arch_hw_breakpoint *brk)
> > return 0;
> >  }
> >
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> > +{
> > +   memcpy(brk, this_cpu_ptr(¤t_brk[nr]), sizeof(*brk));
> > +}
>
> The breakpoint code is already a little hard to follow. I'm worried
> doing this might spread breakpoint handling into more places in the
> future.
> What about something like having a breakpoint_pause() function which
> clears the hardware registers only and then a breakpoint_resume()
> function that copies from current_brk[] back to the hardware
> registers?
> Then we don't have to make another copy of the breakpoint state.

I think that sounds reasonable - I'll add those functions instead with
the next spin.

>
> > +
> >  void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> >  {
> > memcpy(this_cpu_ptr(¤t_brk[nr]), brk, sizeof(*brk));
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index f9a3019e37b4..8d61a7d35b89 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
>
> Sorry I might have missed it, but what was the reason for not putting
> this stuff in

[PATCH v6 2/4] powerpc: Rework and improve STRICT_KERNEL_RWX patching

2021-09-10 Thread Christopher M. Riedl
Rework code-patching with STRICT_KERNEL_RWX to prepare for a later patch
which uses a temporary mm for patching under the Book3s64 Radix MMU.
Make improvements by adding a WARN_ON when the patchsite doesn't match
after patching and return the error from __patch_instruction() properly.

Signed-off-by: Christopher M. Riedl 

---

v6:  * Remove the pr_warn() message from unmap_patch_area().

v5:  * New to series.
---
 arch/powerpc/lib/code-patching.c | 35 
 1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 8d61a7d35b89..8d0bb86125d5 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -102,6 +102,7 @@ static inline void stop_using_temporary_mm(struct temp_mm 
*temp_mm)
 }
 
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
 
 static int text_area_cpu_up(unsigned int cpu)
 {
@@ -114,6 +115,7 @@ static int text_area_cpu_up(unsigned int cpu)
return -1;
}
this_cpu_write(text_poke_area, area);
+   this_cpu_write(cpu_patching_addr, (unsigned long)area->addr);
 
return 0;
 }
@@ -139,7 +141,7 @@ void __init poking_init(void)
 /*
  * This can be called for kernel text or a module.
  */
-static int map_patch_area(void *addr, unsigned long text_poke_addr)
+static int map_patch_area(void *addr)
 {
unsigned long pfn;
int err;
@@ -149,17 +151,20 @@ static int map_patch_area(void *addr, unsigned long 
text_poke_addr)
else
pfn = __pa_symbol(addr) >> PAGE_SHIFT;
 
-   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
+   err = map_kernel_page(__this_cpu_read(cpu_patching_addr),
+ (pfn << PAGE_SHIFT), PAGE_KERNEL);
 
-   pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
+   pr_devel("Mapped addr %lx with pfn %lx:%d\n",
+__this_cpu_read(cpu_patching_addr), pfn, err);
if (err)
return -1;
 
return 0;
 }
 
-static inline int unmap_patch_area(unsigned long addr)
+static inline int unmap_patch_area(void)
 {
+   unsigned long addr = __this_cpu_read(cpu_patching_addr);
pte_t *ptep;
pmd_t *pmdp;
pud_t *pudp;
@@ -199,11 +204,9 @@ static inline int unmap_patch_area(unsigned long addr)
 
 static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
 {
-   int err;
+   int err, rc = 0;
u32 *patch_addr = NULL;
unsigned long flags;
-   unsigned long text_poke_addr;
-   unsigned long kaddr = (unsigned long)addr;
 
/*
 * During early early boot patch_instruction is called
@@ -215,24 +218,20 @@ static int do_patch_instruction(u32 *addr, struct 
ppc_inst instr)
 
local_irq_save(flags);
 
-   text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr;
-   if (map_patch_area(addr, text_poke_addr)) {
-   err = -1;
+   err = map_patch_area(addr);
+   if (err)
goto out;
-   }
 
-   patch_addr = (u32 *)(text_poke_addr + (kaddr & ~PAGE_MASK));
+   patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | 
offset_in_page(addr));
+   rc = __patch_instruction(addr, instr, patch_addr);
 
-   __patch_instruction(addr, instr, patch_addr);
-
-   err = unmap_patch_area(text_poke_addr);
-   if (err)
-   pr_warn("failed to unmap %lx\n", text_poke_addr);
+   err = unmap_patch_area();
 
 out:
local_irq_restore(flags);
+   WARN_ON(!ppc_inst_equal(ppc_inst_read(addr), instr));
 
-   return err;
+   return rc ? rc : err;
 }
 #else /* !CONFIG_STRICT_KERNEL_RWX */
 
-- 
2.32.0



[PATCH v6 3/4] powerpc: Use WARN_ON and fix check in poking_init

2021-09-10 Thread Christopher M. Riedl
The latest kernel docs list BUG_ON() as 'deprecated' and that they
should be replaced with WARN_ON() (or pr_warn()) when possible. The
BUG_ON() in poking_init() warrants a WARN_ON() rather than a pr_warn()
since the error condition is deemed "unreachable".

Also take this opportunity to fix the failure check in the WARN_ON():
cpuhp_setup_state(CPUHP_AP_ONLINE_DYN, ...) returns a positive integer
on success and a negative integer on failure.

Signed-off-by: Christopher M. Riedl 

---

v6:  * New to series - based on Christophe's relentless feedback in the
   crusade against BUG_ON()s :)
---
 arch/powerpc/lib/code-patching.c | 9 ++---
 1 file changed, 2 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 8d0bb86125d5..e802e42c2789 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -126,16 +126,11 @@ static int text_area_cpu_down(unsigned int cpu)
return 0;
 }
 
-/*
- * Although BUG_ON() is rude, in this case it should only happen if ENOMEM, and
- * we judge it as being preferable to a kernel that will crash later when
- * someone tries to use patch_instruction().
- */
 void __init poking_init(void)
 {
-   BUG_ON(!cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
+   WARN_ON(cpuhp_setup_state(CPUHP_AP_ONLINE_DYN,
"powerpc/text_poke:online", text_area_cpu_up,
-   text_area_cpu_down));
+   text_area_cpu_down) < 0);
 }
 
 /*
-- 
2.32.0



[PATCH v6 0/4] Use per-CPU temporary mappings for patching on Radix MMU

2021-09-10 Thread Christopher M. Riedl
When compiled with CONFIG_STRICT_KERNEL_RWX, the kernel must create
temporary mappings when patching itself. These mappings temporarily
override the strict RWX text protections to permit a write. Currently,
powerpc allocates a per-CPU VM area for patching. Patching occurs as
follows:

1. Map page in per-CPU VM area w/ PAGE_KERNEL protection
2. Patch text
3. Remove the temporary mapping

While the VM area is per-CPU, the mapping is actually inserted into the
kernel page tables. Presumably, this could allow another CPU to access
the normally write-protected text - either malicously or accidentally -
via this same mapping if the address of the VM area is known. Ideally,
the mapping should be kept local to the CPU doing the patching [0].

x86 introduced "temporary mm" structs which allow the creation of mappings
local to a particular CPU [1]. This series intends to bring the notion of a
temporary mm to powerpc's Book3s64 Radix MMU and harden it by using such a
mapping for patching a kernel with strict RWX permissions.

Tested boot and ftrace:
- QEMU+KVM (host: POWER9 Blackbird): Radix MMU w/ KUAP
- QEMU+KVM (host: POWER9 Blackbird): Hash MMU

Tested boot:
- QEMU+TCG: ppc44x (bamboo)
- QEMU+TCG: g5 (mac99)

I also tested with various extra config options enabled as suggested in
section 12) in Documentation/process/submit-checklist.rst.

v6: * Split series to separate powerpc percpu temporary mm
  implementation and LKDTM test (still working on this one) and
  implement some of Christophe Leroy's feedback.
* Rebase on linuxppc/next: powerpc-5.15-1

v5: * Only support Book3s64 Radix MMU for now. There are some issues with
  the previous implementation on the Hash MMU as pointed out by Nick
  Piggin. Fixing these is not trivial so we only support the Radix MMU
  for now.

v4: * It's time to revisit this series again since @jpn and @mpe fixed
  our known STRICT_*_RWX bugs on powerpc/64s.
* Rebase on linuxppc/next:
  commit ee1bc694fbaec ("powerpc/kvm: Fix build error when 
PPC_MEM_KEYS/PPC_PSERIES=n")
* Completely rework how map_patch() works on book3s64 Hash MMU
* Split the LKDTM x86_64 and powerpc bits into separate patches
* Annotate commit messages with changes from v3 instead of
  listing them here completely out-of context...

v3: * Rebase on linuxppc/next: commit 9123e3a74ec7 ("Linux 5.9-rc1")
* Move temporary mm implementation into code-patching.c where it
  belongs
* Implement LKDTM hijacker test on x86_64 (on IBM time oof) Do
* not use address zero for the patching address in the
  temporary mm (thanks @dja for pointing this out!)
* Wrap the LKDTM test w/ CONFIG_SMP as suggested by Christophe
  Leroy
* Comments to clarify PTE pre-allocation and patching addr
  selection

v2: * Rebase on linuxppc/next:
  commit 105fb38124a4 ("powerpc/8xx: Modify ptep_get()")
* Always dirty pte when mapping patch
* Use `ppc_inst_len` instead of `sizeof` on instructions
* Declare LKDTM patching addr accessor in header where it belongs   

v1: * Rebase on linuxppc/next (4336b9337824)
* Save and restore second hw watchpoint
* Use new ppc_inst_* functions for patching check and in LKDTM test

rfc-v2: * Many fixes and improvements mostly based on extensive feedback
  and testing by Christophe Leroy (thanks!).
* Make patching_mm and patching_addr static and move
  '__ro_after_init' to after the variable name (more common in
  other parts of the kernel)
* Use 'asm/debug.h' header instead of 'asm/hw_breakpoint.h' to
  fix PPC64e compile
* Add comment explaining why we use BUG_ON() during the init
  call to setup for patching later
* Move ptep into patch_mapping to avoid walking page tables a
  second time when unmapping the temporary mapping
* Use KUAP under non-radix, also manually dirty the PTE for patch
  mapping on non-BOOK3S_64 platforms
* Properly return any error from __patch_instruction
* Do not use 'memcmp' where a simple comparison is appropriate
* Simplify expression for patch address by removing pointer maths
* Add LKDTM test

[0]: https://github.com/linuxppc/issues/issues/224
[1]: 
https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.a...@gmail.com/

Christopher M. Riedl (4):
  powerpc/64s: Introduce temporary mm for Radix MMU
  powerpc: Rework and improve STRICT_KERNEL_RWX patching
  powerpc: Use WARN_ON and fix check in poking_init
  powerpc/64s: Initialize and use a temporary mm for patching on Radix

 arch/powerpc/include/asm/debug.h |   1 +
 arch/powerpc/kernel/process.c|

[PATCH v6 1/4] powerpc/64s: Introduce temporary mm for Radix MMU

2021-09-10 Thread Christopher M. Riedl
x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. Another benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use - this may include breakpoints set by perf.

Based on x86 implementation:

commit cefa929c034e
("x86/mm: Introduce temporary mm structs")

Signed-off-by: Christopher M. Riedl 

---

v6:  * Use {start,stop}_using_temporary_mm() instead of
   {use,unuse}_temporary_mm() as suggested by Christophe.

v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.

v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
   using/unusing the temp mm as suggested by Jann Horn to keep
   the context.active counter in-sync on mm/nohash.
 * Disable SLB preload in the temporary mm when initializing the
   temp_mm struct.
 * Include asm/debug.h header to fix build issue with
   ppc44x_defconfig.
---
 arch/powerpc/include/asm/debug.h |  1 +
 arch/powerpc/kernel/process.c|  5 +++
 arch/powerpc/lib/code-patching.c | 56 
 3 files changed, 62 insertions(+)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 86a14736c76c..dfd82635ea8b 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs 
*regs) { return 0; }
 #endif
 
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 50436b52c213..6aa1f5c4d520 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct 
arch_hw_breakpoint *brk)
return 0;
 }
 
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+{
+   memcpy(brk, this_cpu_ptr(¤t_brk[nr]), sizeof(*brk));
+}
+
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
 {
memcpy(this_cpu_ptr(¤t_brk[nr]), brk, sizeof(*brk));
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index f9a3019e37b4..8d61a7d35b89 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -17,6 +17,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 static int __patch_instruction(u32 *exec_addr, struct ppc_inst instr, u32 
*patch_addr)
 {
@@ -45,6 +48,59 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+struct temp_mm {
+   struct mm_struct *temp;
+   struct mm_struct *prev;
+   struct arch_hw_breakpoint brk[HBP_NUM_MAX];
+};
+
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+{
+   /* We currently only support temporary mm on the Book3s64 Radix MMU */
+   WARN_ON(!radix_enabled());
+
+   temp_mm->temp = mm;
+   temp_mm->prev = NULL;
+   memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
+}
+
+static inline void start_using_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   temp_mm->prev = current->active_mm;
+   switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
+
+   WARN_ON(!mm_is_thread_local(temp_mm->temp));
+
+   if (ppc_breakpoint_available()) {
+   struct arch_hw_breakpoint null_brk = {0};
+   int i = 0;
+
+   for (; i < nr_wp_slots(); ++i) {
+   __get_breakpoint(i, &temp_mm->brk[i]);
+   if (temp_mm->brk[i].type != 0)
+   __set_breakpoint(i, &null_brk);
+   }
+   }
+}
+
+static inline void stop_using_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
+
+   if (ppc_breakpoint_available()) {
+   int i = 0;
+
+   for (; i < nr_wp_slots(); ++i)
+   if (temp_mm->brk[i].type != 

Re: [PATCH v5 7/8] powerpc/64s: Initialize and use a temporary mm for patching on Radix

2021-08-11 Thread Christopher M. Riedl
On Thu Aug 5, 2021 at 4:48 AM CDT, Christophe Leroy wrote:
>
>
> Le 13/07/2021 à 07:31, Christopher M. Riedl a écrit :
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped as writeable. Currently, a
> > per-cpu vmalloc patch area is used for this purpose. While the patch
> > area is per-cpu, the temporary page mapping is inserted into the kernel
> > page tables for the duration of patching. The mapping is exposed to CPUs
> > other than the patching CPU - this is undesirable from a hardening
> > perspective. Use a temporary mm instead which keeps the mapping local to
> > the CPU doing the patching.
> > 
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE.
> > 
> > Bits of entropy with 64K page size on BOOK3S_64:
> > 
> >  bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> > 
> >  PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> >  bits of entropy = log2(128TB / 64K)
> > bits of entropy = 31
> > 
> > The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
> > operates - by default the space above DEFAULT_MAP_WINDOW is not
> > available. Currently the Hash MMU does not use a temporary mm so
> > technically this upper limit isn't necessary; however, a larger
> > randomization range does not further "harden" this overall approach and
> > future work may introduce patching with a temporary mm on Hash as well.
> > 
> > Randomization occurs only once during initialization at boot for each
> > possible CPU in the system.
> > 
> > Introduce two new functions, map_patch() and unmap_patch(), to
> > respectively create and remove the temporary mapping with write
> > permissions at patching_addr. Map the page with PAGE_KERNEL to set
> > EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
> > KUAP) according to PowerISA v3.0b Figure 35 on Radix.
> > 
> > Based on x86 implementation:
> > 
> > commit 4fc19708b165
> > ("x86/alternatives: Initialize temporary mm for patching")
> > 
> > and:
> > 
> > commit b3fd8e83ada0
> > ("x86/alternatives: Use temporary mm for text poking")
> > 
> > Signed-off-by: Christopher M. Riedl 
> > 
> > ---
> > 
> > v5:  * Only support Book3s64 Radix MMU for now.
> >   * Use a per-cpu datastructure to hold the patching_addr and
> > patching_mm to avoid the need for a synchronization lock/mutex.
> > 
> > v4:  * In the previous series this was two separate patches: one to init
> > the temporary mm in poking_init() (unused in powerpc at the time)
> > and the other to use it for patching (which removed all the
> > per-cpu vmalloc code). Now that we use poking_init() in the
> > existing per-cpu vmalloc approach, that separation doesn't work
> > as nicely anymore so I just merged the two patches into one.
> >   * Preload the SLB entry and hash the page for the patching_addr
> > when using Hash on book3s64 to avoid taking an SLB and Hash fault
> > during patching. The previous implementation was a hack which
> > changed current->mm to allow the SLB and Hash fault handlers to
> > work with the temporary mm since both of those code-paths always
> > assume mm == current->mm.
> >   * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
> > have to manage the mm->context.active_cpus counter and mm cpumask
> > since they determine (via mm_is_thread_local()) if the TLB flush
> > in pte_clear() is local or not - it should always be local when
> > we're using the temporary mm. On book3s64's Radix MMU we can
> > just call local_flush_tlb_mm().
> >   * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
> > KUAP.
> > ---
> >   arch/powerpc/lib/code-patching.c | 132 +--
> >   1 file changed, 125 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index 9f2eba9b70ee4..027dabd42b8dd 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-

Re: [PATCH v5 6/8] powerpc: Rework and improve STRICT_KERNEL_RWX patching

2021-08-11 Thread Christopher M. Riedl
On Thu Aug 5, 2021 at 4:34 AM CDT, Christophe Leroy wrote:
>
>
> Le 13/07/2021 à 07:31, Christopher M. Riedl a écrit :
> > Rework code-patching with STRICT_KERNEL_RWX to prepare for the next
> > patch which uses a temporary mm for patching under the Book3s64 Radix
> > MMU. Make improvements by adding a WARN_ON when the patchsite doesn't
> > match after patching and return the error from __patch_instruction()
> > properly.
> > 
> > Signed-off-by: Christopher M. Riedl 
> > 
> > ---
> > 
> > v5:  * New to series.
> > ---
> >   arch/powerpc/lib/code-patching.c | 51 +---
> >   1 file changed, 27 insertions(+), 24 deletions(-)
> > 
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index 3122d8e4cc013..9f2eba9b70ee4 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -102,11 +102,12 @@ static inline void unuse_temporary_mm(struct temp_mm 
> > *temp_mm)
> >   }
> >   
> >   static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
> > +static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
> >   
> >   #if IS_BUILTIN(CONFIG_LKDTM)
> >   unsigned long read_cpu_patching_addr(unsigned int cpu)
> >   {
> > -   return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
> > +   return per_cpu(cpu_patching_addr, cpu);
> >   }
> >   #endif
> >   
> > @@ -121,6 +122,7 @@ static int text_area_cpu_up(unsigned int cpu)
> > return -1;
> > }
> > this_cpu_write(text_poke_area, area);
> > +   this_cpu_write(cpu_patching_addr, (unsigned long)area->addr);
> >   
> > return 0;
> >   }
> > @@ -146,7 +148,7 @@ void __init poking_init(void)
> >   /*
> >* This can be called for kernel text or a module.
> >*/
> > -static int map_patch_area(void *addr, unsigned long text_poke_addr)
> > +static int map_patch_area(void *addr)
> >   {
> > unsigned long pfn;
> > int err;
> > @@ -156,17 +158,20 @@ static int map_patch_area(void *addr, unsigned long 
> > text_poke_addr)
> > else
> > pfn = __pa_symbol(addr) >> PAGE_SHIFT;
> >   
> > -   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
> > +   err = map_kernel_page(__this_cpu_read(cpu_patching_addr),
> > + (pfn << PAGE_SHIFT), PAGE_KERNEL);
> >   
> > -   pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
> > +   pr_devel("Mapped addr %lx with pfn %lx:%d\n",
> > +__this_cpu_read(cpu_patching_addr), pfn, err);
> > if (err)
> > return -1;
> >   
> > return 0;
> >   }
> >   
> > -static inline int unmap_patch_area(unsigned long addr)
> > +static inline int unmap_patch_area(void)
> >   {
> > +   unsigned long addr = __this_cpu_read(cpu_patching_addr);
> > pte_t *ptep;
> > pmd_t *pmdp;
> > pud_t *pudp;
> > @@ -175,23 +180,23 @@ static inline int unmap_patch_area(unsigned long addr)
> >   
> > pgdp = pgd_offset_k(addr);
> > if (unlikely(!pgdp))
> > -   return -EINVAL;
> > +   goto out_err;
> >   
> > p4dp = p4d_offset(pgdp, addr);
> > if (unlikely(!p4dp))
> > -   return -EINVAL;
> > +   goto out_err;
> >   
> > pudp = pud_offset(p4dp, addr);
> > if (unlikely(!pudp))
> > -   return -EINVAL;
> > +   goto out_err;
> >   
> > pmdp = pmd_offset(pudp, addr);
> > if (unlikely(!pmdp))
> > -   return -EINVAL;
> > +   goto out_err;
> >   
> > ptep = pte_offset_kernel(pmdp, addr);
> > if (unlikely(!ptep))
> > -   return -EINVAL;
> > +   goto out_err;
> >   
> > pr_devel("clearing mm %p, pte %p, addr %lx\n", &init_mm, ptep, addr);
> >   
> > @@ -202,15 +207,17 @@ static inline int unmap_patch_area(unsigned long addr)
> > flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
> >   
> > return 0;
> > +
> > +out_err:
> > +   pr_warn("failed to unmap %lx\n", addr);
> > +   return -EINVAL;
>
> Can you keep that in the caller of unmap_patch_area() instead of all
> those goto stuff ?
>

Yeah I think that's fair. I'll do this in the next spin.

> >   }
> >   
> >   static int do_patch_instruction(u32 *addr, struct 

Re: [PATCH v5 5/8] powerpc/64s: Introduce temporary mm for Radix MMU

2021-08-11 Thread Christopher M. Riedl
On Thu Aug 5, 2021 at 4:27 AM CDT, Christophe Leroy wrote:
>
>
> Le 13/07/2021 à 07:31, Christopher M. Riedl a écrit :
> > x86 supports the notion of a temporary mm which restricts access to
> > temporary PTEs to a single CPU. A temporary mm is useful for situations
> > where a CPU needs to perform sensitive operations (such as patching a
> > STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
> > said mappings to other CPUs. Another benefit is that other CPU TLBs do
> > not need to be flushed when the temporary mm is torn down.
> > 
> > Mappings in the temporary mm can be set in the userspace portion of the
> > address-space.
> > 
> > Interrupts must be disabled while the temporary mm is in use. HW
> > breakpoints, which may have been set by userspace as watchpoints on
> > addresses now within the temporary mm, are saved and disabled when
> > loading the temporary mm. The HW breakpoints are restored when unloading
> > the temporary mm. All HW breakpoints are indiscriminately disabled while
> > the temporary mm is in use.
>
> Can you explain more about that breakpoint stuff ? Why is it a special
> case here at all ? Isn't it
> the same when you switch from one user task to another one ? x86 commit
> doesn't say anythink about
> breakpoints.
>

We do not check if the breakpoint is on a kernel address (perf can do
this IIUC) and just disable all of them. I had to dig, but x86 has a
comment with their implementation at arch/x86/kernel/alternative.c:743.

I can reword that part of the commit message if it's unclear.

> > 
> > Based on x86 implementation:
> > 
> > commit cefa929c034e
> > ("x86/mm: Introduce temporary mm structs")
> > 
> > Signed-off-by: Christopher M. Riedl 
> > 
> > ---
> > 
> > v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.
> > 
> > v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
> > using/unusing the temp mm as suggested by Jann Horn to keep
> > the context.active counter in-sync on mm/nohash.
> >   * Disable SLB preload in the temporary mm when initializing the
> > temp_mm struct.
> >   * Include asm/debug.h header to fix build issue with
> > ppc44x_defconfig.
> > ---
> >   arch/powerpc/include/asm/debug.h |  1 +
> >   arch/powerpc/kernel/process.c|  5 +++
> >   arch/powerpc/lib/code-patching.c | 56 
> >   3 files changed, 62 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/debug.h 
> > b/arch/powerpc/include/asm/debug.h
> > index 86a14736c76c3..dfd82635ea8b3 100644
> > --- a/arch/powerpc/include/asm/debug.h
> > +++ b/arch/powerpc/include/asm/debug.h
> > @@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs 
> > *regs) { return 0; }
> >   #endif
> >   
> >   void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
> >   bool ppc_breakpoint_available(void);
> >   #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> >   extern void do_send_trap(struct pt_regs *regs, unsigned long address,
> > diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
> > index 185beb2905801..a0776200772e8 100644
> > --- a/arch/powerpc/kernel/process.c
> > +++ b/arch/powerpc/kernel/process.c
> > @@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct 
> > arch_hw_breakpoint *brk)
> > return 0;
> >   }
> >   
> > +void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> > +{
> > +   memcpy(brk, this_cpu_ptr(¤t_brk[nr]), sizeof(*brk));
> > +}
> > +
> >   void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
> >   {
> > memcpy(this_cpu_ptr(¤t_brk[nr]), brk, sizeof(*brk));
> > diff --git a/arch/powerpc/lib/code-patching.c 
> > b/arch/powerpc/lib/code-patching.c
> > index 54b6157d44e95..3122d8e4cc013 100644
> > --- a/arch/powerpc/lib/code-patching.c
> > +++ b/arch/powerpc/lib/code-patching.c
> > @@ -17,6 +17,9 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> > +#include 
> > +#include 
> >   
> >   static int __patch_instruction(u32 *exec_addr, struct ppc_inst instr, u32 
> > *patch_addr)
> >   {
> > @@ -45,6 +48,59 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst 
> > instr)
> >   }
> >   
> >   #ifdef CONFIG_STRICT_KERNEL_RWX
> > +
> > +struct temp_mm {
> > +   struct mm_struct *temp;
> > +  

Re: [PATCH v5 8/8] lkdtm/powerpc: Fix code patching hijack test

2021-08-11 Thread Christopher M. Riedl
On Thu Aug 5, 2021 at 4:18 AM CDT, Christophe Leroy wrote:
>
>
> Le 13/07/2021 à 07:31, Christopher M. Riedl a écrit :
> > Code patching on powerpc with a STRICT_KERNEL_RWX uses a userspace
> > address in a temporary mm on Radix now. Use __put_user() to avoid write
> > failures due to KUAP when attempting a "hijack" on the patching address.
> > __put_user() also works with the non-userspace, vmalloc-based patching
> > address on non-Radix MMUs.
>
> It is not really clean to use __put_user() on non user address,
> allthought it works by change.
>
> I think it would be better to do something like
>
> if (is_kernel_addr(addr))
> copy_to_kernel_nofault(...);
> else
> copy_to_user_nofault(...);
>

Yes that looks much better. I'll pick this up and try it for the next
spin. Thanks!

>
>
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   drivers/misc/lkdtm/perms.c | 9 -
> >   1 file changed, 9 deletions(-)
> > 
> > diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
> > index 41e87e5f9cc86..da6a34a0a49fb 100644
> > --- a/drivers/misc/lkdtm/perms.c
> > +++ b/drivers/misc/lkdtm/perms.c
> > @@ -262,16 +262,7 @@ static inline u32 lkdtm_read_patch_site(void)
> >   /* Returns True if the write succeeds */
> >   static inline bool lkdtm_try_write(u32 data, u32 *addr)
> >   {
> > -#ifdef CONFIG_PPC
> > -   __put_kernel_nofault(addr, &data, u32, err);
> > -   return true;
> > -
> > -err:
> > -   return false;
> > -#endif
> > -#ifdef CONFIG_X86_64
> > return !__put_user(data, addr);
> > -#endif
> >   }
> >   
> >   static int lkdtm_patching_cpu(void *data)
> > 



Re: [PATCH v5 2/8] lkdtm/powerpc: Add test to hijack a patch mapping

2021-08-11 Thread Christopher M. Riedl
On Thu Aug 5, 2021 at 4:13 AM CDT, Christophe Leroy wrote:
>
>
> Le 13/07/2021 à 07:31, Christopher M. Riedl a écrit :
> > When live patching with STRICT_KERNEL_RWX the CPU doing the patching
> > must temporarily remap the page(s) containing the patch site with +W
> > permissions. While this temporary mapping is in use, another CPU could
> > write to the same mapping and maliciously alter kernel text. Implement a
> > LKDTM test to attempt to exploit such an opening during code patching.
> > The test is implemented on powerpc and requires LKDTM built into the
> > kernel (building LKDTM as a module is insufficient).
> > 
> > The LKDTM "hijack" test works as follows:
> > 
> >1. A CPU executes an infinite loop to patch an instruction. This is
> >   the "patching" CPU.
> >2. Another CPU attempts to write to the address of the temporary
> >   mapping used by the "patching" CPU. This other CPU is the
> >   "hijacker" CPU. The hijack either fails with a fault/error or
> >   succeeds, in which case some kernel text is now overwritten.
> > 
> > The virtual address of the temporary patch mapping is provided via an
> > LKDTM-specific accessor to the hijacker CPU. This test assumes a
> > hypothetical situation where this address was leaked previously.
> > 
> > How to run the test:
> > 
> > mount -t debugfs none /sys/kernel/debug
> > (echo HIJACK_PATCH > /sys/kernel/debug/provoke-crash/DIRECT)
> > 
> > A passing test indicates that it is not possible to overwrite kernel
> > text from another CPU by using the temporary mapping established by
> > a CPU for patching.
> > 
> > Signed-off-by: Christopher M. Riedl 
> > 
> > ---
> > 
> > v5:  * Use `u32*` instead of `struct ppc_inst*` based on new series in
> > upstream.
> > 
> > v4:  * Separate the powerpc and x86_64 bits into individual patches.
> >   * Use __put_kernel_nofault() when attempting to hijack the mapping
> >   * Use raw_smp_processor_id() to avoid triggering the BUG() when
> > calling smp_processor_id() in preemptible code - the only thing
> > that matters is that one of the threads is bound to a different
> > CPU - we are not using smp_processor_id() to access any per-cpu
> > data or similar where preemption should be disabled.
> >   * Rework the patching_cpu() kthread stop condition to avoid:
> > https://lwn.net/Articles/628628/
> > ---
> >   drivers/misc/lkdtm/core.c  |   1 +
> >   drivers/misc/lkdtm/lkdtm.h |   1 +
> >   drivers/misc/lkdtm/perms.c | 134 +
> >   3 files changed, 136 insertions(+)
> > 
> > diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
> > index 8024b6a5cc7fc..fbcb95eda337b 100644
> > --- a/drivers/misc/lkdtm/core.c
> > +++ b/drivers/misc/lkdtm/core.c
> > @@ -147,6 +147,7 @@ static const struct crashtype crashtypes[] = {
> > CRASHTYPE(WRITE_RO),
> > CRASHTYPE(WRITE_RO_AFTER_INIT),
> > CRASHTYPE(WRITE_KERN),
> > +   CRASHTYPE(HIJACK_PATCH),
> > CRASHTYPE(REFCOUNT_INC_OVERFLOW),
> > CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
> > CRASHTYPE(REFCOUNT_INC_NOT_ZERO_OVERFLOW),
> > diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
> > index 99f90d3e5e9cb..87e7e6136d962 100644
> > --- a/drivers/misc/lkdtm/lkdtm.h
> > +++ b/drivers/misc/lkdtm/lkdtm.h
> > @@ -62,6 +62,7 @@ void lkdtm_EXEC_USERSPACE(void);
> >   void lkdtm_EXEC_NULL(void);
> >   void lkdtm_ACCESS_USERSPACE(void);
> >   void lkdtm_ACCESS_NULL(void);
> > +void lkdtm_HIJACK_PATCH(void);
> >   
> >   /* refcount.c */
> >   void lkdtm_REFCOUNT_INC_OVERFLOW(void);
> > diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
> > index 2dede2ef658f3..39e7456852229 100644
> > --- a/drivers/misc/lkdtm/perms.c
> > +++ b/drivers/misc/lkdtm/perms.c
> > @@ -9,6 +9,7 @@
> >   #include 
> >   #include 
> >   #include 
> > +#include 
> >   #include 
> >   
> >   /* Whether or not to fill the target memory area with do_nothing(). */
> > @@ -222,6 +223,139 @@ void lkdtm_ACCESS_NULL(void)
> > pr_err("FAIL: survived bad write\n");
> >   }
> >   
> > +#if (IS_BUILTIN(CONFIG_LKDTM) && defined(CONFIG_STRICT_KERNEL_RWX) && \
> > +   defined(CONFIG_PPC))
>
>
> I think this test shouldn't be limited to CONFIG_PPC and shouldn't be
> limited to
> CONFIG_STRICT_KERNEL

Re: [PATCH v5 4/8] lkdtm/x86_64: Add test to hijack a patch mapping

2021-08-11 Thread Christopher M. Riedl
On Thu Aug 5, 2021 at 4:09 AM CDT, Christophe Leroy wrote:
>
>
> Le 13/07/2021 à 07:31, Christopher M. Riedl a écrit :
> > A previous commit implemented an LKDTM test on powerpc to exploit the
> > temporary mapping established when patching code with STRICT_KERNEL_RWX
> > enabled. Extend the test to work on x86_64 as well.
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   drivers/misc/lkdtm/perms.c | 26 ++
> >   1 file changed, 22 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
> > index 39e7456852229..41e87e5f9cc86 100644
> > --- a/drivers/misc/lkdtm/perms.c
> > +++ b/drivers/misc/lkdtm/perms.c
> > @@ -224,7 +224,7 @@ void lkdtm_ACCESS_NULL(void)
> >   }
> >   
> >   #if (IS_BUILTIN(CONFIG_LKDTM) && defined(CONFIG_STRICT_KERNEL_RWX) && \
> > -   defined(CONFIG_PPC))
> > +   (defined(CONFIG_PPC) || defined(CONFIG_X86_64)))
> >   /*
> >* This is just a dummy location to patch-over.
> >*/
> > @@ -233,12 +233,25 @@ static void patching_target(void)
> > return;
> >   }
> >   
> > -#include 
> >   const u32 *patch_site = (const u32 *)&patching_target;
> >   
> > +#ifdef CONFIG_PPC
> > +#include 
> > +#endif
> > +
> > +#ifdef CONFIG_X86_64
> > +#include 
> > +#endif
> > +
> >   static inline int lkdtm_do_patch(u32 data)
> >   {
> > +#ifdef CONFIG_PPC
> > return patch_instruction((u32 *)patch_site, ppc_inst(data));
> > +#endif
> > +#ifdef CONFIG_X86_64
> > +   text_poke((void *)patch_site, &data, sizeof(u32));
> > +   return 0;
> > +#endif
> >   }
> >   
> >   static inline u32 lkdtm_read_patch_site(void)
> > @@ -249,11 +262,16 @@ static inline u32 lkdtm_read_patch_site(void)
> >   /* Returns True if the write succeeds */
> >   static inline bool lkdtm_try_write(u32 data, u32 *addr)
> >   {
> > +#ifdef CONFIG_PPC
> > __put_kernel_nofault(addr, &data, u32, err);
> > return true;
> >   
> >   err:
> > return false;
> > +#endif
> > +#ifdef CONFIG_X86_64
> > +   return !__put_user(data, addr);
> > +#endif
> >   }
> >   
> >   static int lkdtm_patching_cpu(void *data)
> > @@ -346,8 +364,8 @@ void lkdtm_HIJACK_PATCH(void)
> >   
> >   void lkdtm_HIJACK_PATCH(void)
> >   {
> > -   if (!IS_ENABLED(CONFIG_PPC))
> > -   pr_err("XFAIL: this test only runs on powerpc\n");
> > +   if (!IS_ENABLED(CONFIG_PPC) && !IS_ENABLED(CONFIG_X86_64))
> > +   pr_err("XFAIL: this test only runs on powerpc and x86_64\n");
> > if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
> > pr_err("XFAIL: this test requires CONFIG_STRICT_KERNEL_RWX\n");
> > if (!IS_BUILTIN(CONFIG_LKDTM))
> > 
>
> Instead of spreading arch specific stuff into LKDTM, wouldn't it make
> sence to define common a
> common API ? Because the day another arch like arm64 implements it own
> approach, do we add specific
> functions again and again into LKDTM ?

Hmm a common patch/poke kernel API is probably out of scope for this
series? I do agree though - since you suggested splitting the series
maybe that's something I can add along with the LKDTM patches.

>
> Also, I find it odd to define tests only when they can succeed. For
> other tests like
> ACCESS_USERSPACE, they are there all the time, regardless of whether we
> have selected
> CONFIG_PPC_KUAP or not. I think it should be the same here, have it all
> there time, if
> CONFIG_STRICT_KERNEL_RWX is selected the test succeeds otherwise it
> fails, but it is always there.

I followed the approach in lkdtm_DOUBLE_FAULT and others in
drivers/misc/lkdtm/bugs.c. I suppose it doesn't hurt to always build the
test irrespective of CONFIG_STRICT_KERNEL_RWX.

>
> Christophe



Re: [PATCH v5 0/8] Use per-CPU temporary mappings for patching on Radix MMU

2021-08-11 Thread Christopher M. Riedl
On Thu Aug 5, 2021 at 4:03 AM CDT, Christophe Leroy wrote:
>
>
> Le 13/07/2021 à 07:31, Christopher M. Riedl a écrit :
> > When compiled with CONFIG_STRICT_KERNEL_RWX, the kernel must create
> > temporary mappings when patching itself. These mappings temporarily
> > override the strict RWX text protections to permit a write. Currently,
> > powerpc allocates a per-CPU VM area for patching. Patching occurs as
> > follows:
> > 
> > 1. Map page in per-CPU VM area w/ PAGE_KERNEL protection
> > 2. Patch text
> > 3. Remove the temporary mapping
> > 
> > While the VM area is per-CPU, the mapping is actually inserted into the
> > kernel page tables. Presumably, this could allow another CPU to access
> > the normally write-protected text - either malicously or accidentally -
> > via this same mapping if the address of the VM area is known. Ideally,
> > the mapping should be kept local to the CPU doing the patching [0].
> > 
> > x86 introduced "temporary mm" structs which allow the creation of mappings
> > local to a particular CPU [1]. This series intends to bring the notion of a
> > temporary mm to powerpc's Book3s64 Radix MMU and harden it by using such a
> > mapping for patching a kernel with strict RWX permissions.
> > 
> > The first four patches implement an LKDTM test "proof-of-concept" which
> > exploits the potential vulnerability (ie. the temporary mapping during 
> > patching
> > is exposed in the kernel page tables and accessible by other CPUs) using a
> > simple brute-force approach. This test is implemented for both powerpc and
> > x86_64. The test passes on powerpc Radix with this new series, fails on
> > upstream powerpc, passes on upstream x86_64, and fails on an older (ancient)
> > x86_64 tree without the x86_64 temporary mm patches. The remaining patches 
> > add
> > support for and use a temporary mm for code patching on powerpc with the 
> > Radix
> > MMU.
>
> I think four first patches (together with last one) are quite
> independent from the heart of the
> series itself which is patches 5, 6, 7. Maybe you should split that
> series it two series ? After all
> those selftests are nice to have but are not absolutely necessary, that
> would help getting forward I
> think.
>

Hmm you're probably right. The selftest at least proves there is a
potential attack which I think is necessary for any hardening related
series/patch. I'll split the series into separate powerpc temp mm and
LKDTM series for the next spin.

> > 
> > Tested boot, ftrace, and repeated LKDTM "hijack":
> > - QEMU+KVM (host: POWER9 Blackbird): Radix MMU w/ KUAP
> > - QEMU+KVM (host: POWER9 Blackbird): Hash MMU
> > 
> > Tested repeated LKDTM "hijack":
> > - QEMU+KVM (host: AMD desktop): x86_64 upstream
> > - QEMU+KVM (host: AMD desktop): x86_64 w/o percpu temp mm to
> >   verify the LKDTM "hijack" test fails
> > 
> > Tested boot and ftrace:
> > - QEMU+TCG: ppc44x (bamboo)
> > - QEMU+TCG: g5 (mac99)
> > 
> > I also tested with various extra config options enabled as suggested in
> > section 12) in Documentation/process/submit-checklist.rst.
> > 
> > v5: * Only support Book3s64 Radix MMU for now. There are some issues with
> >   the previous implementation on the Hash MMU as pointed out by Nick
> >   Piggin. Fixing these is not trivial so we only support the Radix MMU
> >   for now. I tried using realmode (no data translation) to patch with
> >   Hash to at least avoid exposing the patch mapping to other CPUs but
> >   this doesn't reliably work either since we cannot access vmalloc'd
> >   space in realmode.
>
> So you now accept to have two different mode depending on the platform ?

By necessity yes.

> As far as I remember I commented some time ago that non SMP didn't need
> that feature and you were
> reluctant to have two different implementations. What made you change
> your mind ? (just curious).
>

The book3s64 hash mmu support is a pain ;) Supporting both the temp-mm
and vmalloc implementations turned out to be relatively simple - I
initially thought this would be messier. For now we will support both;
however, in the future I'd still like to implement the percpu temp-mm
support for the Hash MMU as well. I suppose we could re-evaluate then if
we want/need both implementations (I know you're in favor of keeping the
vmalloc-based approach for performance reasons on non-SMP).

>
> > * Use percpu variables for the patching_mm and patching_addr. This
> >   av

[PATCH v5 6/8] powerpc: Rework and improve STRICT_KERNEL_RWX patching

2021-07-12 Thread Christopher M. Riedl
Rework code-patching with STRICT_KERNEL_RWX to prepare for the next
patch which uses a temporary mm for patching under the Book3s64 Radix
MMU. Make improvements by adding a WARN_ON when the patchsite doesn't
match after patching and return the error from __patch_instruction()
properly.

Signed-off-by: Christopher M. Riedl 

---

v5:  * New to series.
---
 arch/powerpc/lib/code-patching.c | 51 +---
 1 file changed, 27 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 3122d8e4cc013..9f2eba9b70ee4 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -102,11 +102,12 @@ static inline void unuse_temporary_mm(struct temp_mm 
*temp_mm)
 }
 
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
 
 #if IS_BUILTIN(CONFIG_LKDTM)
 unsigned long read_cpu_patching_addr(unsigned int cpu)
 {
-   return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
+   return per_cpu(cpu_patching_addr, cpu);
 }
 #endif
 
@@ -121,6 +122,7 @@ static int text_area_cpu_up(unsigned int cpu)
return -1;
}
this_cpu_write(text_poke_area, area);
+   this_cpu_write(cpu_patching_addr, (unsigned long)area->addr);
 
return 0;
 }
@@ -146,7 +148,7 @@ void __init poking_init(void)
 /*
  * This can be called for kernel text or a module.
  */
-static int map_patch_area(void *addr, unsigned long text_poke_addr)
+static int map_patch_area(void *addr)
 {
unsigned long pfn;
int err;
@@ -156,17 +158,20 @@ static int map_patch_area(void *addr, unsigned long 
text_poke_addr)
else
pfn = __pa_symbol(addr) >> PAGE_SHIFT;
 
-   err = map_kernel_page(text_poke_addr, (pfn << PAGE_SHIFT), PAGE_KERNEL);
+   err = map_kernel_page(__this_cpu_read(cpu_patching_addr),
+ (pfn << PAGE_SHIFT), PAGE_KERNEL);
 
-   pr_devel("Mapped addr %lx with pfn %lx:%d\n", text_poke_addr, pfn, err);
+   pr_devel("Mapped addr %lx with pfn %lx:%d\n",
+__this_cpu_read(cpu_patching_addr), pfn, err);
if (err)
return -1;
 
return 0;
 }
 
-static inline int unmap_patch_area(unsigned long addr)
+static inline int unmap_patch_area(void)
 {
+   unsigned long addr = __this_cpu_read(cpu_patching_addr);
pte_t *ptep;
pmd_t *pmdp;
pud_t *pudp;
@@ -175,23 +180,23 @@ static inline int unmap_patch_area(unsigned long addr)
 
pgdp = pgd_offset_k(addr);
if (unlikely(!pgdp))
-   return -EINVAL;
+   goto out_err;
 
p4dp = p4d_offset(pgdp, addr);
if (unlikely(!p4dp))
-   return -EINVAL;
+   goto out_err;
 
pudp = pud_offset(p4dp, addr);
if (unlikely(!pudp))
-   return -EINVAL;
+   goto out_err;
 
pmdp = pmd_offset(pudp, addr);
if (unlikely(!pmdp))
-   return -EINVAL;
+   goto out_err;
 
ptep = pte_offset_kernel(pmdp, addr);
if (unlikely(!ptep))
-   return -EINVAL;
+   goto out_err;
 
pr_devel("clearing mm %p, pte %p, addr %lx\n", &init_mm, ptep, addr);
 
@@ -202,15 +207,17 @@ static inline int unmap_patch_area(unsigned long addr)
flush_tlb_kernel_range(addr, addr + PAGE_SIZE);
 
return 0;
+
+out_err:
+   pr_warn("failed to unmap %lx\n", addr);
+   return -EINVAL;
 }
 
 static int do_patch_instruction(u32 *addr, struct ppc_inst instr)
 {
-   int err;
+   int err, rc = 0;
u32 *patch_addr = NULL;
unsigned long flags;
-   unsigned long text_poke_addr;
-   unsigned long kaddr = (unsigned long)addr;
 
/*
 * During early early boot patch_instruction is called
@@ -222,24 +229,20 @@ static int do_patch_instruction(u32 *addr, struct 
ppc_inst instr)
 
local_irq_save(flags);
 
-   text_poke_addr = (unsigned long)__this_cpu_read(text_poke_area)->addr;
-   if (map_patch_area(addr, text_poke_addr)) {
-   err = -1;
+   err = map_patch_area(addr);
+   if (err)
goto out;
-   }
-
-   patch_addr = (u32 *)(text_poke_addr + (kaddr & ~PAGE_MASK));
 
-   __patch_instruction(addr, instr, patch_addr);
+   patch_addr = (u32 *)(__this_cpu_read(cpu_patching_addr) | 
offset_in_page(addr));
+   rc = __patch_instruction(addr, instr, patch_addr);
 
-   err = unmap_patch_area(text_poke_addr);
-   if (err)
-   pr_warn("failed to unmap %lx\n", text_poke_addr);
+   err = unmap_patch_area();
 
 out:
local_irq_restore(flags);
+   WARN_ON(!ppc_inst_equal(ppc_inst_read(addr), instr));
 
-   return err;
+   return rc ? rc : err;
 }
 #else /* !CONFIG_STRICT_KERNEL_RWX */
 
-- 
2.26.1



[PATCH v5 5/8] powerpc/64s: Introduce temporary mm for Radix MMU

2021-07-12 Thread Christopher M. Riedl
x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. Another benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use.

Based on x86 implementation:

commit cefa929c034e
("x86/mm: Introduce temporary mm structs")

Signed-off-by: Christopher M. Riedl 

---

v5:  * Drop support for using a temporary mm on Book3s64 Hash MMU.

v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
   using/unusing the temp mm as suggested by Jann Horn to keep
   the context.active counter in-sync on mm/nohash.
 * Disable SLB preload in the temporary mm when initializing the
   temp_mm struct.
 * Include asm/debug.h header to fix build issue with
   ppc44x_defconfig.
---
 arch/powerpc/include/asm/debug.h |  1 +
 arch/powerpc/kernel/process.c|  5 +++
 arch/powerpc/lib/code-patching.c | 56 
 3 files changed, 62 insertions(+)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 86a14736c76c3..dfd82635ea8b3 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs 
*regs) { return 0; }
 #endif
 
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 185beb2905801..a0776200772e8 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -865,6 +865,11 @@ static inline int set_breakpoint_8xx(struct 
arch_hw_breakpoint *brk)
return 0;
 }
 
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+{
+   memcpy(brk, this_cpu_ptr(¤t_brk[nr]), sizeof(*brk));
+}
+
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
 {
memcpy(this_cpu_ptr(¤t_brk[nr]), brk, sizeof(*brk));
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 54b6157d44e95..3122d8e4cc013 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -17,6 +17,9 @@
 #include 
 #include 
 #include 
+#include 
+#include 
+#include 
 
 static int __patch_instruction(u32 *exec_addr, struct ppc_inst instr, u32 
*patch_addr)
 {
@@ -45,6 +48,59 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+struct temp_mm {
+   struct mm_struct *temp;
+   struct mm_struct *prev;
+   struct arch_hw_breakpoint brk[HBP_NUM_MAX];
+};
+
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+{
+   /* We currently only support temporary mm on the Book3s64 Radix MMU */
+   WARN_ON(!radix_enabled());
+
+   temp_mm->temp = mm;
+   temp_mm->prev = NULL;
+   memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
+}
+
+static inline void use_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   temp_mm->prev = current->active_mm;
+   switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
+
+   WARN_ON(!mm_is_thread_local(temp_mm->temp));
+
+   if (ppc_breakpoint_available()) {
+   struct arch_hw_breakpoint null_brk = {0};
+   int i = 0;
+
+   for (; i < nr_wp_slots(); ++i) {
+   __get_breakpoint(i, &temp_mm->brk[i]);
+   if (temp_mm->brk[i].type != 0)
+   __set_breakpoint(i, &null_brk);
+   }
+   }
+}
+
+static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
+
+   if (ppc_breakpoint_available()) {
+   int i = 0;
+
+   for (; i < nr_wp_slots(); ++i)
+   if (temp_mm->brk[i].type != 0)
+   __set_breakpoint(i, &temp_mm->brk[i]);
+   }
+}
+
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
 #if IS_BUILTIN(CONFIG_LKDTM)
-- 
2.26.1



[PATCH v5 8/8] lkdtm/powerpc: Fix code patching hijack test

2021-07-12 Thread Christopher M. Riedl
Code patching on powerpc with a STRICT_KERNEL_RWX uses a userspace
address in a temporary mm on Radix now. Use __put_user() to avoid write
failures due to KUAP when attempting a "hijack" on the patching address.
__put_user() also works with the non-userspace, vmalloc-based patching
address on non-Radix MMUs.

Signed-off-by: Christopher M. Riedl 
---
 drivers/misc/lkdtm/perms.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 41e87e5f9cc86..da6a34a0a49fb 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -262,16 +262,7 @@ static inline u32 lkdtm_read_patch_site(void)
 /* Returns True if the write succeeds */
 static inline bool lkdtm_try_write(u32 data, u32 *addr)
 {
-#ifdef CONFIG_PPC
-   __put_kernel_nofault(addr, &data, u32, err);
-   return true;
-
-err:
-   return false;
-#endif
-#ifdef CONFIG_X86_64
return !__put_user(data, addr);
-#endif
 }
 
 static int lkdtm_patching_cpu(void *data)
-- 
2.26.1



[PATCH v5 1/8] powerpc: Add LKDTM accessor for patching addr

2021-07-12 Thread Christopher M. Riedl
When live patching with STRICT_KERNEL_RWX a mapping is installed at a
"patching address" with temporary write permissions. Provide a
LKDTM-only accessor function for this address in preparation for a LKDTM
test which attempts to "hijack" this mapping by writing to it from
another CPU.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/code-patching.h | 4 
 arch/powerpc/lib/code-patching.c | 7 +++
 2 files changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h
index a95f63788c6b1..16fbc58a4932f 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -184,4 +184,8 @@ static inline unsigned long ppc_kallsyms_lookup_name(const 
char *name)
 #define PPC_INST_STD_LRPPC_RAW_STD(_R0, _R1, PPC_LR_STKOFF)
 #endif /* CONFIG_PPC64 */
 
+#if IS_BUILTIN(CONFIG_LKDTM) && IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)
+unsigned long read_cpu_patching_addr(unsigned int cpu);
+#endif
+
 #endif /* _ASM_POWERPC_CODE_PATCHING_H */
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index f9a3019e37b43..54b6157d44e95 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -47,6 +47,13 @@ int raw_patch_instruction(u32 *addr, struct ppc_inst instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+   return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
+}
+#endif
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;
-- 
2.26.1



[PATCH v5 7/8] powerpc/64s: Initialize and use a temporary mm for patching on Radix

2021-07-12 Thread Christopher M. Riedl
When code patching a STRICT_KERNEL_RWX kernel the page containing the
address to be patched is temporarily mapped as writeable. Currently, a
per-cpu vmalloc patch area is used for this purpose. While the patch
area is per-cpu, the temporary page mapping is inserted into the kernel
page tables for the duration of patching. The mapping is exposed to CPUs
other than the patching CPU - this is undesirable from a hardening
perspective. Use a temporary mm instead which keeps the mapping local to
the CPU doing the patching.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE.

Bits of entropy with 64K page size on BOOK3S_64:

bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
bits of entropy = log2(128TB / 64K)
bits of entropy = 31

The upper limit is DEFAULT_MAP_WINDOW due to how the Book3s64 Hash MMU
operates - by default the space above DEFAULT_MAP_WINDOW is not
available. Currently the Hash MMU does not use a temporary mm so
technically this upper limit isn't necessary; however, a larger
randomization range does not further "harden" this overall approach and
future work may introduce patching with a temporary mm on Hash as well.

Randomization occurs only once during initialization at boot for each
possible CPU in the system.

Introduce two new functions, map_patch() and unmap_patch(), to
respectively create and remove the temporary mapping with write
permissions at patching_addr. Map the page with PAGE_KERNEL to set
EAA[0] for the PTE which ignores the AMR (so no need to unlock/lock
KUAP) according to PowerISA v3.0b Figure 35 on Radix.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

and:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

Signed-off-by: Christopher M. Riedl 

---

v5:  * Only support Book3s64 Radix MMU for now.
 * Use a per-cpu datastructure to hold the patching_addr and
   patching_mm to avoid the need for a synchronization lock/mutex.

v4:  * In the previous series this was two separate patches: one to init
   the temporary mm in poking_init() (unused in powerpc at the time)
   and the other to use it for patching (which removed all the
   per-cpu vmalloc code). Now that we use poking_init() in the
   existing per-cpu vmalloc approach, that separation doesn't work
   as nicely anymore so I just merged the two patches into one.
 * Preload the SLB entry and hash the page for the patching_addr
   when using Hash on book3s64 to avoid taking an SLB and Hash fault
   during patching. The previous implementation was a hack which
   changed current->mm to allow the SLB and Hash fault handlers to
   work with the temporary mm since both of those code-paths always
   assume mm == current->mm.
 * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
   have to manage the mm->context.active_cpus counter and mm cpumask
   since they determine (via mm_is_thread_local()) if the TLB flush
   in pte_clear() is local or not - it should always be local when
   we're using the temporary mm. On book3s64's Radix MMU we can
   just call local_flush_tlb_mm().
 * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
   KUAP.
---
 arch/powerpc/lib/code-patching.c | 132 +--
 1 file changed, 125 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 9f2eba9b70ee4..027dabd42b8dd 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -103,6 +104,7 @@ static inline void unuse_temporary_mm(struct temp_mm 
*temp_mm)
 
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 static DEFINE_PER_CPU(unsigned long, cpu_patching_addr);
+static DEFINE_PER_CPU(struct mm_struct *, cpu_patching_mm);
 
 #if IS_BUILTIN(CONFIG_LKDTM)
 unsigned long read_cpu_patching_addr(unsigned int cpu)
@@ -133,6 +135,51 @@ static int text_area_cpu_down(unsigned int cpu)
return 0;
 }
 
+static __always_inline void __poking_init_temp_mm(void)
+{
+   int cpu;
+   spinlock_t *ptl; /* for protecting pte table */
+   pte_t *ptep;
+   struct mm_struct *patching_mm;
+   unsigned long patching_addr;
+
+   for_each_possible_cpu(cpu) {
+   /*
+* Some parts of the kernel (static keys for example) depend on
+* successful code patching. Code patching under
+* 

[PATCH v5 4/8] lkdtm/x86_64: Add test to hijack a patch mapping

2021-07-12 Thread Christopher M. Riedl
A previous commit implemented an LKDTM test on powerpc to exploit the
temporary mapping established when patching code with STRICT_KERNEL_RWX
enabled. Extend the test to work on x86_64 as well.

Signed-off-by: Christopher M. Riedl 
---
 drivers/misc/lkdtm/perms.c | 26 ++
 1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 39e7456852229..41e87e5f9cc86 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -224,7 +224,7 @@ void lkdtm_ACCESS_NULL(void)
 }
 
 #if (IS_BUILTIN(CONFIG_LKDTM) && defined(CONFIG_STRICT_KERNEL_RWX) && \
-   defined(CONFIG_PPC))
+   (defined(CONFIG_PPC) || defined(CONFIG_X86_64)))
 /*
  * This is just a dummy location to patch-over.
  */
@@ -233,12 +233,25 @@ static void patching_target(void)
return;
 }
 
-#include 
 const u32 *patch_site = (const u32 *)&patching_target;
 
+#ifdef CONFIG_PPC
+#include 
+#endif
+
+#ifdef CONFIG_X86_64
+#include 
+#endif
+
 static inline int lkdtm_do_patch(u32 data)
 {
+#ifdef CONFIG_PPC
return patch_instruction((u32 *)patch_site, ppc_inst(data));
+#endif
+#ifdef CONFIG_X86_64
+   text_poke((void *)patch_site, &data, sizeof(u32));
+   return 0;
+#endif
 }
 
 static inline u32 lkdtm_read_patch_site(void)
@@ -249,11 +262,16 @@ static inline u32 lkdtm_read_patch_site(void)
 /* Returns True if the write succeeds */
 static inline bool lkdtm_try_write(u32 data, u32 *addr)
 {
+#ifdef CONFIG_PPC
__put_kernel_nofault(addr, &data, u32, err);
return true;
 
 err:
return false;
+#endif
+#ifdef CONFIG_X86_64
+   return !__put_user(data, addr);
+#endif
 }
 
 static int lkdtm_patching_cpu(void *data)
@@ -346,8 +364,8 @@ void lkdtm_HIJACK_PATCH(void)
 
 void lkdtm_HIJACK_PATCH(void)
 {
-   if (!IS_ENABLED(CONFIG_PPC))
-   pr_err("XFAIL: this test only runs on powerpc\n");
+   if (!IS_ENABLED(CONFIG_PPC) && !IS_ENABLED(CONFIG_X86_64))
+   pr_err("XFAIL: this test only runs on powerpc and x86_64\n");
if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
pr_err("XFAIL: this test requires CONFIG_STRICT_KERNEL_RWX\n");
if (!IS_BUILTIN(CONFIG_LKDTM))
-- 
2.26.1



[PATCH v5 0/8] Use per-CPU temporary mappings for patching on Radix MMU

2021-07-12 Thread Christopher M. Riedl
tic and move
  '__ro_after_init' to after the variable name (more common in
  other parts of the kernel)
* Use 'asm/debug.h' header instead of 'asm/hw_breakpoint.h' to
  fix PPC64e compile
* Add comment explaining why we use BUG_ON() during the init
  call to setup for patching later
* Move ptep into patch_mapping to avoid walking page tables a
  second time when unmapping the temporary mapping
* Use KUAP under non-radix, also manually dirty the PTE for patch
  mapping on non-BOOK3S_64 platforms
* Properly return any error from __patch_instruction
* Do not use 'memcmp' where a simple comparison is appropriate
* Simplify expression for patch address by removing pointer maths
* Add LKDTM test

[0]: https://github.com/linuxppc/issues/issues/224
[1]: 
https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.a...@gmail.com/

Christopher M. Riedl (8):
  powerpc: Add LKDTM accessor for patching addr
  lkdtm/powerpc: Add test to hijack a patch mapping
  x86_64: Add LKDTM accessor for patching addr
  lkdtm/x86_64: Add test to hijack a patch mapping
  powerpc/64s: Introduce temporary mm for Radix MMU
  powerpc: Rework and improve STRICT_KERNEL_RWX patching
  powerpc/64s: Initialize and use a temporary mm for patching on Radix
  lkdtm/powerpc: Fix code patching hijack test

 arch/powerpc/include/asm/code-patching.h |   4 +
 arch/powerpc/include/asm/debug.h |   1 +
 arch/powerpc/kernel/process.c|   5 +
 arch/powerpc/lib/code-patching.c | 240 ---
 arch/x86/include/asm/text-patching.h |   4 +
 arch/x86/kernel/alternative.c|   7 +
 drivers/misc/lkdtm/core.c|   1 +
 drivers/misc/lkdtm/lkdtm.h   |   1 +
 drivers/misc/lkdtm/perms.c   | 143 ++
 9 files changed, 378 insertions(+), 28 deletions(-)

-- 
2.26.1



[PATCH v5 2/8] lkdtm/powerpc: Add test to hijack a patch mapping

2021-07-12 Thread Christopher M. Riedl
When live patching with STRICT_KERNEL_RWX the CPU doing the patching
must temporarily remap the page(s) containing the patch site with +W
permissions. While this temporary mapping is in use, another CPU could
write to the same mapping and maliciously alter kernel text. Implement a
LKDTM test to attempt to exploit such an opening during code patching.
The test is implemented on powerpc and requires LKDTM built into the
kernel (building LKDTM as a module is insufficient).

The LKDTM "hijack" test works as follows:

  1. A CPU executes an infinite loop to patch an instruction. This is
 the "patching" CPU.
  2. Another CPU attempts to write to the address of the temporary
 mapping used by the "patching" CPU. This other CPU is the
 "hijacker" CPU. The hijack either fails with a fault/error or
 succeeds, in which case some kernel text is now overwritten.

The virtual address of the temporary patch mapping is provided via an
LKDTM-specific accessor to the hijacker CPU. This test assumes a
hypothetical situation where this address was leaked previously.

How to run the test:

mount -t debugfs none /sys/kernel/debug
(echo HIJACK_PATCH > /sys/kernel/debug/provoke-crash/DIRECT)

A passing test indicates that it is not possible to overwrite kernel
text from another CPU by using the temporary mapping established by
a CPU for patching.

Signed-off-by: Christopher M. Riedl 

---

v5:  * Use `u32*` instead of `struct ppc_inst*` based on new series in
   upstream.

v4:  * Separate the powerpc and x86_64 bits into individual patches.
 * Use __put_kernel_nofault() when attempting to hijack the mapping
 * Use raw_smp_processor_id() to avoid triggering the BUG() when
   calling smp_processor_id() in preemptible code - the only thing
   that matters is that one of the threads is bound to a different
   CPU - we are not using smp_processor_id() to access any per-cpu
   data or similar where preemption should be disabled.
 * Rework the patching_cpu() kthread stop condition to avoid:
   https://lwn.net/Articles/628628/
---
 drivers/misc/lkdtm/core.c  |   1 +
 drivers/misc/lkdtm/lkdtm.h |   1 +
 drivers/misc/lkdtm/perms.c | 134 +
 3 files changed, 136 insertions(+)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index 8024b6a5cc7fc..fbcb95eda337b 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -147,6 +147,7 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
CRASHTYPE(WRITE_KERN),
+   CRASHTYPE(HIJACK_PATCH),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
CRASHTYPE(REFCOUNT_INC_NOT_ZERO_OVERFLOW),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 99f90d3e5e9cb..87e7e6136d962 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -62,6 +62,7 @@ void lkdtm_EXEC_USERSPACE(void);
 void lkdtm_EXEC_NULL(void);
 void lkdtm_ACCESS_USERSPACE(void);
 void lkdtm_ACCESS_NULL(void);
+void lkdtm_HIJACK_PATCH(void);
 
 /* refcount.c */
 void lkdtm_REFCOUNT_INC_OVERFLOW(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 2dede2ef658f3..39e7456852229 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -222,6 +223,139 @@ void lkdtm_ACCESS_NULL(void)
pr_err("FAIL: survived bad write\n");
 }
 
+#if (IS_BUILTIN(CONFIG_LKDTM) && defined(CONFIG_STRICT_KERNEL_RWX) && \
+   defined(CONFIG_PPC))
+/*
+ * This is just a dummy location to patch-over.
+ */
+static void patching_target(void)
+{
+   return;
+}
+
+#include 
+const u32 *patch_site = (const u32 *)&patching_target;
+
+static inline int lkdtm_do_patch(u32 data)
+{
+   return patch_instruction((u32 *)patch_site, ppc_inst(data));
+}
+
+static inline u32 lkdtm_read_patch_site(void)
+{
+   return READ_ONCE(*patch_site);
+}
+
+/* Returns True if the write succeeds */
+static inline bool lkdtm_try_write(u32 data, u32 *addr)
+{
+   __put_kernel_nofault(addr, &data, u32, err);
+   return true;
+
+err:
+   return false;
+}
+
+static int lkdtm_patching_cpu(void *data)
+{
+   int err = 0;
+   u32 val = 0xdeadbeef;
+
+   pr_info("starting patching_cpu=%d\n", raw_smp_processor_id());
+
+   do {
+   err = lkdtm_do_patch(val);
+   } while (lkdtm_read_patch_site() == val && !err && 
!kthread_should_stop());
+
+   if (err)
+   pr_warn("XFAIL: patch_instruction returned error: %d\n", err);
+
+   while (!kthread_should_stop()) {
+   set_current_state(TASK_INTERRUPTIBLE);
+   schedule();

[PATCH v5 3/8] x86_64: Add LKDTM accessor for patching addr

2021-07-12 Thread Christopher M. Riedl
When live patching with STRICT_KERNEL_RWX a mapping is installed at a
"patching address" with temporary write permissions. Provide a
LKDTM-only accessor function for this address in preparation for a LKDTM
test which attempts to "hijack" this mapping by writing to it from
another CPU.

Signed-off-by: Christopher M. Riedl 
---
 arch/x86/include/asm/text-patching.h | 4 
 arch/x86/kernel/alternative.c| 7 +++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/text-patching.h 
b/arch/x86/include/asm/text-patching.h
index b7421780e4e92..f0caf9ee13bd8 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -167,4 +167,8 @@ void int3_emulate_ret(struct pt_regs *regs)
 }
 #endif /* !CONFIG_UML_X86 */
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu);
+#endif
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index e9da3dc712541..28bb92b695639 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -773,6 +773,13 @@ static inline void unuse_temporary_mm(temp_mm_state_t 
prev_state)
 __ro_after_init struct mm_struct *poking_mm;
 __ro_after_init unsigned long poking_addr;
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+   return poking_addr;
+}
+#endif
+
 static void *__text_poke(void *addr, const void *opcode, size_t len)
 {
bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
-- 
2.26.1



Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-07-08 Thread Christopher M. Riedl
On Thu Jul 1, 2021 at 2:51 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 5:02 pm:
> > On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> >> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> >> > address to be patched is temporarily mapped as writeable. Currently, a
> >> > per-cpu vmalloc patch area is used for this purpose. While the patch
> >> > area is per-cpu, the temporary page mapping is inserted into the kernel
> >> > page tables for the duration of patching. The mapping is exposed to CPUs
> >> > other than the patching CPU - this is undesirable from a hardening
> >> > perspective. Use a temporary mm instead which keeps the mapping local to
> >> > the CPU doing the patching.
> >> > 
> >> > Use the `poking_init` init hook to prepare a temporary mm and patching
> >> > address. Initialize the temporary mm by copying the init mm. Choose a
> >> > randomized patching address inside the temporary mm userspace address
> >> > space. The patching address is randomized between PAGE_SIZE and
> >> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> >> > the Book3s64 Hash MMU operates - by default the space above
> >> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> >> > all platforms/MMUs is randomized inside this range.  The number of
> >> > possible random addresses is dependent on PAGE_SIZE and limited by
> >> > DEFAULT_MAP_WINDOW.
> >> > 
> >> > Bits of entropy with 64K page size on BOOK3S_64:
> >> > 
> >> > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> >> > 
> >> > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> >> > bits of entropy = log2(128TB / 64K) bits of entropy = 31
> >> > 
> >> > Randomization occurs only once during initialization at boot.
> >> > 
> >> > Introduce two new functions, map_patch() and unmap_patch(), to
> >> > respectively create and remove the temporary mapping with write
> >> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> >> > the page for patching with PAGE_SHARED since the kernel cannot access
> >> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> >> > 
> >> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> >> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
> >> > taking an SLB and Hash fault during patching.
> >>
> >> What prevents the SLBE or HPTE from being removed before the last
> >> access?
> > 
> > This code runs with local IRQs disabled - we also don't access anything
> > else in userspace so I'm not sure what else could cause the entries to
> > be removed TBH.
> > 
> >>
> >>
> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> > +
> >> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >> >  {
> >> > -struct vm_struct *area;
> >> > +int err;
> >> >  
> >> > -area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> >> > -if (!area) {
> >> > -WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> >> > -cpu);
> >> > -return -1;
> >> > -}
> >> > -this_cpu_write(text_poke_area, area);
> >> > +if (radix_enabled())
> >> > +return 0;
> >> >  
> >> > -return 0;
> >> > -}
> >> > +err = slb_allocate_user(patching_mm, patching_addr);
> >> > +if (err)
> >> > +pr_warn("map patch: failed to allocate slb entry\n");
> >> >  
> >> > -static int text_area_cpu_down(unsigned int cpu)
> >> > -{
> >> > -free_vm_area(this_cpu_read(text_poke_area));
> >> > -return 0;
> >> > +err = hash_page_mm(patching_mm, patching_addr, 
> >> > pgprot_val(pgprot), 0,
> >> > +   HPTE_USE_KERNEL_KEY);
> >> > +if (err)
> >> > +pr_warn("map patch: failed to insert hashed page\n");
> >> > +
> >> > +/* See comment in switch_slb() in mm/book3s64/slb.c */
> >> > +isync();
> >>
> >> I'm not sure if this is enough. Could we context switch here? You've
> >> got the PTL so no with a normal kernel but maybe yes with an RT kernel
> >> How about taking an machine check that clears the SLB? Could the HPTE
> >> get removed by something else here?
> > 
> > All of this happens after a local_irq_save() which should at least
> > prevent context switches IIUC.
>
> Ah yeah I didn't look that far back. A machine check can take out SLB
> entries.
>
> > I am not sure what else could cause the
> > HPTE to get removed here.
>
> Other CPUs?
>

Right because the HPTEs are "global".

> >> You want to prevent faults because you might be patching a fault
> >> handler?
> > 
> > In a more general sense: I don't think we want to take page faults every
> > time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
> > fault handler codepath also checks `current->

Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-07-08 Thread Christopher M. Riedl
On Thu Jul 1, 2021 at 2:37 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 4:53 pm:
> > On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> >> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> >> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> >> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> >> >> "Christopher M. Riedl"  writes:
> >> >> >>
> >> >> >> > Switching to a different mm with Hash translation causes SLB 
> >> >> >> > entries to
> >> >> >> > be preloaded from the current thread_info. This reduces SLB 
> >> >> >> > faults, for
> >> >> >> > example when threads share a common mm but operate on different 
> >> >> >> > address
> >> >> >> > ranges.
> >> >> >> >
> >> >> >> > Preloading entries from the thread_info struct may not always be
> >> >> >> > appropriate - such as when switching to a temporary mm. Introduce 
> >> >> >> > a new
> >> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also 
> >> >> >> > move the
> >> >> >> > SLB preload code into a separate function since switch_slb() is 
> >> >> >> > already
> >> >> >> > quite long. The default behavior (preloading SLB entries from the
> >> >> >> > current thread_info struct) remains unchanged.
> >> >> >> >
> >> >> >> > Signed-off-by: Christopher M. Riedl 
> >> >> >> >
> >> >> >> > ---
> >> >> >> >
> >> >> >> > v4:  * New to series.
> >> >> >> > ---
> >> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
> >> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >> >> >  arch/powerpc/mm/book3s64/slb.c   | 56 
> >> >> >> > ++--
> >> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >> >> >
> >> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
> >> >> >> > b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >> >> >u32 pkey_allocation_map;
> >> >> >> >s16 execute_only_pkey; /* key holding execute-only protection */
> >> >> >> >  #endif
> >> >> >> > +
> >> >> >> > +  /* Do not preload SLB entries from thread_info during 
> >> >> >> > switch_slb() */
> >> >> >> > +  bool skip_slb_preload;
> >> >> >> >  } mm_context_t;
> >> >> >> >  
> >> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
> >> >> >> > b/arch/powerpc/include/asm/mmu_context.h
> >> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct 
> >> >> >> > mm_struct *oldmm,
> >> >> >> >return 0;
> >> >> >> >  }
> >> >> >> >  
> >> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> >> >> > +
> >> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> >> >> > +{
> >> >> >> > +  mm->context.skip_slb_preload = true;
> >> >> >> > +}
> >

Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-07-01 Thread Christopher M. Riedl
On Thu Jul 1, 2021 at 1:12 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of May 6, 2021 2:34 pm:
> > When code patching a STRICT_KERNEL_RWX kernel the page containing the
> > address to be patched is temporarily mapped as writeable. Currently, a
> > per-cpu vmalloc patch area is used for this purpose. While the patch
> > area is per-cpu, the temporary page mapping is inserted into the kernel
> > page tables for the duration of patching. The mapping is exposed to CPUs
> > other than the patching CPU - this is undesirable from a hardening
> > perspective. Use a temporary mm instead which keeps the mapping local to
> > the CPU doing the patching.
> > 
> > Use the `poking_init` init hook to prepare a temporary mm and patching
> > address. Initialize the temporary mm by copying the init mm. Choose a
> > randomized patching address inside the temporary mm userspace address
> > space. The patching address is randomized between PAGE_SIZE and
> > DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
> > the Book3s64 Hash MMU operates - by default the space above
> > DEFAULT_MAP_WINDOW is not available. For now, the patching address for
> > all platforms/MMUs is randomized inside this range.  The number of
> > possible random addresses is dependent on PAGE_SIZE and limited by
> > DEFAULT_MAP_WINDOW.
> > 
> > Bits of entropy with 64K page size on BOOK3S_64:
> > 
> > bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)
> > 
> > PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
> > bits of entropy = log2(128TB / 64K) bits of entropy = 31
> > 
> > Randomization occurs only once during initialization at boot.
> > 
> > Introduce two new functions, map_patch() and unmap_patch(), to
> > respectively create and remove the temporary mapping with write
> > permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
> > the page for patching with PAGE_SHARED since the kernel cannot access
> > userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.
> > 
> > Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
> > for the patching_addr when using the Hash MMU on Book3s64 to avoid
> > taking an SLB and Hash fault during patching.
>
> What prevents the SLBE or HPTE from being removed before the last
> access?

This code runs with local IRQs disabled - we also don't access anything
else in userspace so I'm not sure what else could cause the entries to
be removed TBH.

>
>
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >  {
> > -   struct vm_struct *area;
> > +   int err;
> >  
> > -   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -   if (!area) {
> > -   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -   cpu);
> > -   return -1;
> > -   }
> > -   this_cpu_write(text_poke_area, area);
> > +   if (radix_enabled())
> > +   return 0;
> >  
> > -   return 0;
> > -}
> > +   err = slb_allocate_user(patching_mm, patching_addr);
> > +   if (err)
> > +   pr_warn("map patch: failed to allocate slb entry\n");
> >  
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -   free_vm_area(this_cpu_read(text_poke_area));
> > -   return 0;
> > +   err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> > +  HPTE_USE_KERNEL_KEY);
> > +   if (err)
> > +   pr_warn("map patch: failed to insert hashed page\n");
> > +
> > +   /* See comment in switch_slb() in mm/book3s64/slb.c */
> > +   isync();
>
> I'm not sure if this is enough. Could we context switch here? You've
> got the PTL so no with a normal kernel but maybe yes with an RT kernel
> How about taking an machine check that clears the SLB? Could the HPTE
> get removed by something else here?

All of this happens after a local_irq_save() which should at least
prevent context switches IIUC. I am not sure what else could cause the
HPTE to get removed here.

>
> You want to prevent faults because you might be patching a fault
> handler?

In a more general sense: I don't think we want to take page faults every
time we patch an instruction with a STRICT_RWX kernel. The Hash MMU page
fault handler codepath also checks `current->mm` in some places which
won't match the temporary mm. Also `current->mm` can be NULL which
caused problems in my earlier revisions of this series.

>
> Thanks,
> Nick



Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-06-30 Thread Christopher M. Riedl
On Thu Jul 1, 2021 at 1:04 AM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 3:28 pm:
> > On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> >> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> >> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> >> "Christopher M. Riedl"  writes:
> >> >>
> >> >> > Switching to a different mm with Hash translation causes SLB entries 
> >> >> > to
> >> >> > be preloaded from the current thread_info. This reduces SLB faults, 
> >> >> > for
> >> >> > example when threads share a common mm but operate on different 
> >> >> > address
> >> >> > ranges.
> >> >> >
> >> >> > Preloading entries from the thread_info struct may not always be
> >> >> > appropriate - such as when switching to a temporary mm. Introduce a 
> >> >> > new
> >> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move 
> >> >> > the
> >> >> > SLB preload code into a separate function since switch_slb() is 
> >> >> > already
> >> >> > quite long. The default behavior (preloading SLB entries from the
> >> >> > current thread_info struct) remains unchanged.
> >> >> >
> >> >> > Signed-off-by: Christopher M. Riedl 
> >> >> >
> >> >> > ---
> >> >> >
> >> >> > v4:  * New to series.
> >> >> > ---
> >> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
> >> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >> >  arch/powerpc/mm/book3s64/slb.c   | 56 
> >> >> > ++--
> >> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >> >
> >> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
> >> >> > b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >> >   u32 pkey_allocation_map;
> >> >> >   s16 execute_only_pkey; /* key holding execute-only protection */
> >> >> >  #endif
> >> >> > +
> >> >> > + /* Do not preload SLB entries from thread_info during 
> >> >> > switch_slb() */
> >> >> > + bool skip_slb_preload;
> >> >> >  } mm_context_t;
> >> >> >  
> >> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
> >> >> > b/arch/powerpc/include/asm/mmu_context.h
> >> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct 
> >> >> > *oldmm,
> >> >> >   return 0;
> >> >> >  }
> >> >> >  
> >> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> >> > +
> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> >> > +{
> >> >> > + mm->context.skip_slb_preload = true;
> >> >> > +}
> >> >> > +
> >> >> > +#else
> >> >> > +
> >> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> >> >> > +
> >> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> >> >> > +
> >> >> >  #include 
> >> >> >  
> >> >> >  #endif /* __KERNEL__ */
> >> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
> >> >> > b/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > index c10fc8a72fb37..3479910264c59 100644
> >> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> >> >> > +++ b/arch/pow

Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-06-30 Thread Christopher M. Riedl
On Wed Jun 30, 2021 at 11:15 PM CDT, Nicholas Piggin wrote:
> Excerpts from Christopher M. Riedl's message of July 1, 2021 1:48 pm:
> > On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> >> "Christopher M. Riedl"  writes:
> >>
> >> > Switching to a different mm with Hash translation causes SLB entries to
> >> > be preloaded from the current thread_info. This reduces SLB faults, for
> >> > example when threads share a common mm but operate on different address
> >> > ranges.
> >> >
> >> > Preloading entries from the thread_info struct may not always be
> >> > appropriate - such as when switching to a temporary mm. Introduce a new
> >> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
> >> > SLB preload code into a separate function since switch_slb() is already
> >> > quite long. The default behavior (preloading SLB entries from the
> >> > current thread_info struct) remains unchanged.
> >> >
> >> > Signed-off-by: Christopher M. Riedl 
> >> >
> >> > ---
> >> >
> >> > v4:  * New to series.
> >> > ---
> >> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
> >> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >> >  arch/powerpc/mm/book3s64/slb.c   | 56 ++--
> >> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >> >
> >> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
> >> > b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> >> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> >> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> >> > @@ -130,6 +130,9 @@ typedef struct {
> >> >  u32 pkey_allocation_map;
> >> >  s16 execute_only_pkey; /* key holding execute-only protection */
> >> >  #endif
> >> > +
> >> > +/* Do not preload SLB entries from thread_info during 
> >> > switch_slb() */
> >> > +bool skip_slb_preload;
> >> >  } mm_context_t;
> >> >  
> >> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> >> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
> >> > b/arch/powerpc/include/asm/mmu_context.h
> >> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> >> > --- a/arch/powerpc/include/asm/mmu_context.h
> >> > +++ b/arch/powerpc/include/asm/mmu_context.h
> >> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct 
> >> > *oldmm,
> >> >  return 0;
> >> >  }
> >> >  
> >> > +#ifdef CONFIG_PPC_BOOK3S_64
> >> > +
> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> >> > +{
> >> > +mm->context.skip_slb_preload = true;
> >> > +}
> >> > +
> >> > +#else
> >> > +
> >> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> >> > +
> >> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> >> > +
> >> >  #include 
> >> >  
> >> >  #endif /* __KERNEL__ */
> >> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
> >> > b/arch/powerpc/mm/book3s64/mmu_context.c
> >> > index c10fc8a72fb37..3479910264c59 100644
> >> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> >> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> >> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct 
> >> > mm_struct *mm)
> >> >  atomic_set(&mm->context.active_cpus, 0);
> >> >  atomic_set(&mm->context.copros, 0);
> >> >  
> >> > +mm->context.skip_slb_preload = false;
> >> > +
> >> >  return 0;
> >> >  }
> >> >  
> >> > diff --git a/arch/powerpc/mm/book3s64/slb.c 
> >> > b/arch/powerpc/mm/book3s64/slb.c
> >> > index c91bd85eb90e3..da0836cb855af 100644
> >> > --- a/arch/powerpc/mm/book3s64/slb.c
> >> > +++ b/arch/powerpc/mm/book3s64/slb.c
> >> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int 
> >> > index)
> >> >  asm volatile("slbie %0" : : "r" (slbie_d

Re: [RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-06-30 Thread Christopher M. Riedl
On Sun Jun 20, 2021 at 10:19 PM CDT, Daniel Axtens wrote:
> Hi Chris,
>
> > +   /*
> > +* Choose a randomized, page-aligned address from the range:
> > +* [PAGE_SIZE, DEFAULT_MAP_WINDOW - PAGE_SIZE]
> > +* The lower address bound is PAGE_SIZE to avoid the zero-page.
> > +* The upper address bound is DEFAULT_MAP_WINDOW - PAGE_SIZE to stay
> > +* under DEFAULT_MAP_WINDOW with the Book3s64 Hash MMU.
> > +*/
> > +   patching_addr = PAGE_SIZE + ((get_random_long() & PAGE_MASK)
> > +   % (DEFAULT_MAP_WINDOW - 2 * PAGE_SIZE));
>
> I checked and poking_init() comes after the functions that init the RNG,
> so this should be fine. The maths - while a bit fiddly to reason about -
> does check out.

Thanks for double checking.

>
> > +
> > +   /*
> > +* PTE allocation uses GFP_KERNEL which means we need to pre-allocate
> > +* the PTE here. We cannot do the allocation during patching with IRQs
> > +* disabled (ie. "atomic" context).
> > +*/
> > +   ptep = get_locked_pte(patching_mm, patching_addr, &ptl);
> > +   BUG_ON(!ptep);
> > +   pte_unmap_unlock(ptep, ptl);
> > +}
> >  
> >  #if IS_BUILTIN(CONFIG_LKDTM)
> >  unsigned long read_cpu_patching_addr(unsigned int cpu)
> >  {
> > -   return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
> > +   return patching_addr;
> >  }
> >  #endif
> >  
> > -static int text_area_cpu_up(unsigned int cpu)
> > +struct patch_mapping {
> > +   spinlock_t *ptl; /* for protecting pte table */
> > +   pte_t *ptep;
> > +   struct temp_mm temp_mm;
> > +};
> > +
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline int hash_prefault_mapping(pgprot_t pgprot)
> >  {
> > -   struct vm_struct *area;
> > +   int err;
> >  
> > -   area = get_vm_area(PAGE_SIZE, VM_ALLOC);
> > -   if (!area) {
> > -   WARN_ONCE(1, "Failed to create text area for cpu %d\n",
> > -   cpu);
> > -   return -1;
> > -   }
> > -   this_cpu_write(text_poke_area, area);
> > +   if (radix_enabled())
> > +   return 0;
> >  
> > -   return 0;
> > -}
> > +   err = slb_allocate_user(patching_mm, patching_addr);
> > +   if (err)
> > +   pr_warn("map patch: failed to allocate slb entry\n");
> >  
>
> Here if slb_allocate_user() fails, you'll print a warning and then fall
> through to the rest of the function. You do return err, but there's a
> later call to hash_page_mm() that also sets err. Can slb_allocate_user()
> fail while hash_page_mm() succeeds, and would that be a problem?

Hmm, yes I think this is a problem. If slb_allocate_user() fails then we
could potentially mask that error until the actual patching
fails/miscompares later (and that *will* certainly fail in this case). I
will return the error and exit the function early in v5 of the series.
Thanks!

>
> > -static int text_area_cpu_down(unsigned int cpu)
> > -{
> > -   free_vm_area(this_cpu_read(text_poke_area));
> > -   return 0;
> > +   err = hash_page_mm(patching_mm, patching_addr, pgprot_val(pgprot), 0,
> > +  HPTE_USE_KERNEL_KEY);
> > +   if (err)
> > +   pr_warn("map patch: failed to insert hashed page\n");
> > +
> > +   /* See comment in switch_slb() in mm/book3s64/slb.c */
> > +   isync();
> > +
>
> The comment reads:
>
> /*
> * Synchronize slbmte preloads with possible subsequent user memory
> * address accesses by the kernel (user mode won't happen until
> * rfid, which is safe).
> */
> isync();
>
> I have to say having read the description of isync I'm not 100% sure why
> that's enough (don't we also need stores to complete?) but I'm happy to
> take commit 5434ae74629a ("powerpc/64s/hash: Add a SLB preload cache")
> on trust here!
>
> I think it does make sense for you to have that barrier here: you are
> potentially about to start poking at the memory mapped through that SLB
> entry so you should make sure you're fully synchronised.
>
> > +   return err;
> >  }
> >  
>
> > +   init_temp_mm(&patch_mapping->temp_mm, patching_mm);
> > +   use_temporary_mm(&patch_mapping->temp_mm);
> >  
> > -   pmdp = pmd_offset(pudp, addr);
> > -   if (unlikely(!pmdp))
> > -   return -EINVAL;
> > +   /*
> > +* On Book3s64 with the Hash MMU we have to manually insert the SLB
> > +* entry and HPTE to prevent taking faults on the patching_addr later.
> > +*/
> > +   return(hash_prefault_mapping(pgprot));
>
> hmm, `return hash_prefault_mapping(pgprot);` or
> `return (hash_prefault_mapping((pgprot));` maybe?

Yeah, I noticed I left the extra parentheses here after the RESEND. I
think this is left-over when I had another wrapper here... anyway, I'll
clean it up for v5.

>
> Kind regards,
> Daniel



Re: [RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-06-30 Thread Christopher M. Riedl
On Sun Jun 20, 2021 at 10:13 PM CDT, Daniel Axtens wrote:
> "Christopher M. Riedl"  writes:
>
> > Switching to a different mm with Hash translation causes SLB entries to
> > be preloaded from the current thread_info. This reduces SLB faults, for
> > example when threads share a common mm but operate on different address
> > ranges.
> >
> > Preloading entries from the thread_info struct may not always be
> > appropriate - such as when switching to a temporary mm. Introduce a new
> > boolean in mm_context_t to skip the SLB preload entirely. Also move the
> > SLB preload code into a separate function since switch_slb() is already
> > quite long. The default behavior (preloading SLB entries from the
> > current thread_info struct) remains unchanged.
> >
> > Signed-off-by: Christopher M. Riedl 
> >
> > ---
> >
> > v4:  * New to series.
> > ---
> >  arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
> >  arch/powerpc/include/asm/mmu_context.h   | 13 ++
> >  arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
> >  arch/powerpc/mm/book3s64/slb.c   | 56 ++--
> >  4 files changed, 50 insertions(+), 24 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
> > b/arch/powerpc/include/asm/book3s/64/mmu.h
> > index eace8c3f7b0a1..b23a9dcdee5af 100644
> > --- a/arch/powerpc/include/asm/book3s/64/mmu.h
> > +++ b/arch/powerpc/include/asm/book3s/64/mmu.h
> > @@ -130,6 +130,9 @@ typedef struct {
> > u32 pkey_allocation_map;
> > s16 execute_only_pkey; /* key holding execute-only protection */
> >  #endif
> > +
> > +   /* Do not preload SLB entries from thread_info during switch_slb() */
> > +   bool skip_slb_preload;
> >  } mm_context_t;
> >  
> >  static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
> > diff --git a/arch/powerpc/include/asm/mmu_context.h 
> > b/arch/powerpc/include/asm/mmu_context.h
> > index 4bc45d3ed8b0e..264787e90b1a1 100644
> > --- a/arch/powerpc/include/asm/mmu_context.h
> > +++ b/arch/powerpc/include/asm/mmu_context.h
> > @@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct 
> > *oldmm,
> > return 0;
> >  }
> >  
> > +#ifdef CONFIG_PPC_BOOK3S_64
> > +
> > +static inline void skip_slb_preload_mm(struct mm_struct *mm)
> > +{
> > +   mm->context.skip_slb_preload = true;
> > +}
> > +
> > +#else
> > +
> > +static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
> > +
> > +#endif /* CONFIG_PPC_BOOK3S_64 */
> > +
> >  #include 
> >  
> >  #endif /* __KERNEL__ */
> > diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
> > b/arch/powerpc/mm/book3s64/mmu_context.c
> > index c10fc8a72fb37..3479910264c59 100644
> > --- a/arch/powerpc/mm/book3s64/mmu_context.c
> > +++ b/arch/powerpc/mm/book3s64/mmu_context.c
> > @@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct 
> > mm_struct *mm)
> > atomic_set(&mm->context.active_cpus, 0);
> > atomic_set(&mm->context.copros, 0);
> >  
> > +   mm->context.skip_slb_preload = false;
> > +
> > return 0;
> >  }
> >  
> > diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
> > index c91bd85eb90e3..da0836cb855af 100644
> > --- a/arch/powerpc/mm/book3s64/slb.c
> > +++ b/arch/powerpc/mm/book3s64/slb.c
> > @@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
> > asm volatile("slbie %0" : : "r" (slbie_data));
> >  }
> >  
> > +static void preload_slb_entries(struct task_struct *tsk, struct mm_struct 
> > *mm)
> Should this be explicitly inline or even __always_inline? I'm thinking
> switch_slb is probably a fairly hot path on hash?

Yes absolutely. I'll make this change in v5.

>
> > +{
> > +   struct thread_info *ti = task_thread_info(tsk);
> > +   unsigned char i;
> > +
> > +   /*
> > +* We gradually age out SLBs after a number of context switches to
> > +* reduce reload overhead of unused entries (like we do with FP/VEC
> > +* reload). Each time we wrap 256 switches, take an entry out of the
> > +* SLB preload cache.
> > +*/
> > +   tsk->thread.load_slb++;
> > +   if (!tsk->thread.load_slb) {
> > +   unsigned long pc = KSTK_EIP(tsk);
> > +
> > +   preload_age(ti);
> > +   preload_add(ti, pc);
> > +   }
> > +
> > +   for (i = 0; i < ti->slb_prelo

Re: [RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock

2021-05-07 Thread Christopher M. Riedl
On Thu May 6, 2021 at 5:51 AM CDT, Peter Zijlstra wrote:
> On Wed, May 05, 2021 at 11:34:51PM -0500, Christopher M. Riedl wrote:
> > Powerpc allows for multiple CPUs to patch concurrently. When patching
> > with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
> > CPUs for the few times that patching occurs. Use a spinlock to protect
> > the patching_mm from concurrent use.
> > 
> > Modify patch_instruction() to acquire the lock, perform the patch op,
> > and then release the lock.
> > 
> > Also introduce {lock,unlock}_patching() along with
> > patch_instruction_unlocked() to avoid per-iteration lock overhead when
> > patch_instruction() is called in a loop. A follow-up patch converts some
> > uses of patch_instruction() to use patch_instruction_unlocked() instead.
>
> x86 uses text_mutex for all this, why not do the same?

I wasn't entirely sure if there is a problem with potentially going to
sleep in some of the places where patch_instruction() is called - the
spinlock avoids that (hypothetical) problem.

I just tried switching to text_mutex and at least on a P9 machine the
series boots w/ the Hash and Radix MMUs (with some lockdep errors). I
can rework this in the next version to use text_mutex if I don't find
any new problems with more extensive testing. It does mean more changes
to use patch_instruction_unlocked() in kprobe/optprobe/ftace in
arch/powerpc since iirc those are called with text_mutex already held.

Thanks!
Chris R.


[RESEND PATCH v4 00/11] Use per-CPU temporary mappings for patching

2021-05-05 Thread Christopher M. Riedl
tch_instruction
* Do not use 'memcmp' where a simple comparison is appropriate
* Simplify expression for patch address by removing pointer maths
* Add LKDTM test

[0]: https://github.com/linuxppc/issues/issues/224
[1]: 
https://lore.kernel.org/kernel-hardening/20190426232303.28381-1-nadav.a...@gmail.com/

Christopher M. Riedl (11):
  powerpc: Add LKDTM accessor for patching addr
  lkdtm/powerpc: Add test to hijack a patch mapping
  x86_64: Add LKDTM accessor for patching addr
  lkdtm/x86_64: Add test to hijack a patch mapping
  powerpc/64s: Add ability to skip SLB preload
  powerpc: Introduce temporary mm
  powerpc/64s: Make slb_allocate_user() non-static
  powerpc: Initialize and use a temporary mm for patching
  lkdtm/powerpc: Fix code patching hijack test
  powerpc: Protect patching_mm with a lock
  powerpc: Use patch_instruction_unlocked() in loops

 arch/powerpc/include/asm/book3s/64/mmu-hash.h |   1 +
 arch/powerpc/include/asm/book3s/64/mmu.h  |   3 +
 arch/powerpc/include/asm/code-patching.h  |   8 +
 arch/powerpc/include/asm/debug.h  |   1 +
 arch/powerpc/include/asm/mmu_context.h|  13 +
 arch/powerpc/kernel/epapr_paravirt.c  |   9 +-
 arch/powerpc/kernel/optprobes.c   |  22 +-
 arch/powerpc/kernel/process.c |   5 +
 arch/powerpc/lib/code-patching.c  | 348 +-
 arch/powerpc/lib/feature-fixups.c | 114 --
 arch/powerpc/mm/book3s64/mmu_context.c|   2 +
 arch/powerpc/mm/book3s64/slb.c|  60 +--
 arch/powerpc/xmon/xmon.c  |  22 +-
 arch/x86/include/asm/text-patching.h  |   4 +
 arch/x86/kernel/alternative.c |   7 +
 drivers/misc/lkdtm/core.c |   1 +
 drivers/misc/lkdtm/lkdtm.h|   1 +
 drivers/misc/lkdtm/perms.c| 149 
 18 files changed, 608 insertions(+), 162 deletions(-)

-- 
2.26.1



[RESEND PATCH v4 11/11] powerpc: Use patch_instruction_unlocked() in loops

2021-05-05 Thread Christopher M. Riedl
Now that patching requires a lock to prevent concurrent access to
patching_mm, every call to patch_instruction() acquires and releases a
spinlock. There are several places where patch_instruction() is called
in a loop. Convert these to acquire the lock once before the loop, call
patch_instruction_unlocked() in the loop body, and then release the lock
again after the loop terminates - as in:

for (i = 0; i < n; ++i)
patch_instruction(...); <-- lock/unlock every iteration

changes to:

flags = lock_patching(); <-- lock once

for (i = 0; i < n; ++i)
patch_instruction_unlocked(...);

unlock_patching(flags); <-- unlock once

Signed-off-by: Christopher M. Riedl 

---

v4:  * New to series.
---
 arch/powerpc/kernel/epapr_paravirt.c |   9 ++-
 arch/powerpc/kernel/optprobes.c  |  22 --
 arch/powerpc/lib/feature-fixups.c| 114 +++
 arch/powerpc/xmon/xmon.c |  22 --
 4 files changed, 120 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/kernel/epapr_paravirt.c 
b/arch/powerpc/kernel/epapr_paravirt.c
index 2ed14d4a47f59..b639e71cf9dec 100644
--- a/arch/powerpc/kernel/epapr_paravirt.c
+++ b/arch/powerpc/kernel/epapr_paravirt.c
@@ -28,6 +28,7 @@ static int __init early_init_dt_scan_epapr(unsigned long node,
const u32 *insts;
int len;
int i;
+   unsigned long flags;
 
insts = of_get_flat_dt_prop(node, "hcall-instructions", &len);
if (!insts)
@@ -36,14 +37,18 @@ static int __init early_init_dt_scan_epapr(unsigned long 
node,
if (len % 4 || len > (4 * 4))
return -1;
 
+   flags = lock_patching();
+
for (i = 0; i < (len / 4); i++) {
struct ppc_inst inst = ppc_inst(be32_to_cpu(insts[i]));
-   patch_instruction((struct ppc_inst *)(epapr_hypercall_start + 
i), inst);
+   patch_instruction_unlocked((struct ppc_inst 
*)(epapr_hypercall_start + i), inst);
 #if !defined(CONFIG_64BIT) || defined(CONFIG_PPC_BOOK3E_64)
-   patch_instruction((struct ppc_inst *)(epapr_ev_idle_start + i), 
inst);
+   patch_instruction_unlocked((struct ppc_inst 
*)(epapr_ev_idle_start + i), inst);
 #endif
}
 
+   unlock_patching(flags);
+
 #if !defined(CONFIG_64BIT) || defined(CONFIG_PPC_BOOK3E_64)
if (of_get_flat_dt_prop(node, "has-idle", NULL))
epapr_has_idle = true;
diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index cdf87086fa33a..deaeb6e8d1a00 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -200,7 +200,7 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
struct ppc_inst branch_op_callback, branch_emulate_step, temp;
kprobe_opcode_t *op_callback_addr, *emulate_step_addr, *buff;
long b_offset;
-   unsigned long nip, size;
+   unsigned long nip, size, flags;
int rc, i;
 
kprobe_ppc_optinsn_slots.insn_size = MAX_OPTINSN_SIZE;
@@ -237,13 +237,20 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
/* We can optimize this via patch_instruction_window later */
size = (TMPL_END_IDX * sizeof(kprobe_opcode_t)) / sizeof(int);
pr_devel("Copying template to %p, size %lu\n", buff, size);
+
+   flags = lock_patching();
+
for (i = 0; i < size; i++) {
-   rc = patch_instruction((struct ppc_inst *)(buff + i),
-  ppc_inst(*(optprobe_template_entry + 
i)));
-   if (rc < 0)
+   rc = patch_instruction_unlocked((struct ppc_inst *)(buff + i),
+   
ppc_inst(*(optprobe_template_entry + i)));
+   if (rc < 0) {
+   unlock_patching(flags);
goto error;
+   }
}
 
+   unlock_patching(flags);
+
/*
 * Fixup the template with instructions to:
 * 1. load the address of the actual probepoint
@@ -322,6 +329,9 @@ void arch_optimize_kprobes(struct list_head *oplist)
struct ppc_inst instr;
struct optimized_kprobe *op;
struct optimized_kprobe *tmp;
+   unsigned long flags;
+
+   flags = lock_patching();
 
list_for_each_entry_safe(op, tmp, oplist, list) {
/*
@@ -333,9 +343,11 @@ void arch_optimize_kprobes(struct list_head *oplist)
create_branch(&instr,
  (struct ppc_inst *)op->kp.addr,
  (unsigned long)op->optinsn.insn, 0);
-   patch_instruction((struct ppc_inst *)op->kp.addr, instr);
+   patch_instruction_unlocked((struct ppc_inst *)op->kp.addr, 
instr);
list_del_init(&op->list);
}
+
+   unlock_pat

[RESEND PATCH v4 06/11] powerpc: Introduce temporary mm

2021-05-05 Thread Christopher M. Riedl
x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. A side benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use.

With the Book3s64 Hash MMU the SLB is preloaded with entries from the
current thread_info struct during switch_slb(). This could cause a
Machine Check (MCE) due to an SLB Multihit when creating arbitrary
userspace mappings in the temporary mm later. Disable SLB preload from
the thread_info struct for any temporary mm to avoid this.

Based on x86 implementation:

commit cefa929c034e
("x86/mm: Introduce temporary mm structs")

Signed-off-by: Christopher M. Riedl 

---

v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
   using/unusing the temp mm as suggested by Jann Horn to keep
   the context.active counter in-sync on mm/nohash.
 * Disable SLB preload in the temporary mm when initializing the
   temp_mm struct.
 * Include asm/debug.h header to fix build issue with
   ppc44x_defconfig.
---
 arch/powerpc/include/asm/debug.h |  1 +
 arch/powerpc/kernel/process.c|  5 +++
 arch/powerpc/lib/code-patching.c | 67 
 3 files changed, 73 insertions(+)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 86a14736c76c3..dfd82635ea8b3 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs 
*regs) { return 0; }
 #endif
 
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 89e34aa273e21..8e94cabaea3c3 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -864,6 +864,11 @@ static inline int set_breakpoint_8xx(struct 
arch_hw_breakpoint *brk)
return 0;
 }
 
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+{
+   memcpy(brk, this_cpu_ptr(¤t_brk[nr]), sizeof(*brk));
+}
+
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
 {
memcpy(this_cpu_ptr(¤t_brk[nr]), brk, sizeof(*brk));
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 2b1b3e9043ade..cbdfba8a39360 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 static int __patch_instruction(struct ppc_inst *exec_addr, struct ppc_inst 
instr,
   struct ppc_inst *patch_addr)
@@ -46,6 +48,71 @@ int raw_patch_instruction(struct ppc_inst *addr, struct 
ppc_inst instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+struct temp_mm {
+   struct mm_struct *temp;
+   struct mm_struct *prev;
+   struct arch_hw_breakpoint brk[HBP_NUM_MAX];
+};
+
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+{
+   /* Do not preload SLB entries from the thread_info struct */
+   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled())
+   skip_slb_preload_mm(mm);
+
+   temp_mm->temp = mm;
+   temp_mm->prev = NULL;
+   memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
+}
+
+static inline void use_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   temp_mm->prev = current->active_mm;
+   switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
+
+   WARN_ON(!mm_is_thread_local(temp_mm->temp));
+
+   if (ppc_breakpoint_available()) {
+   struct arch_hw_breakpoint null_brk = {0};
+   int i = 0;
+
+   for (; i < nr_wp_slots(); ++i) {
+   __get_breakpoint(i, &temp_mm->brk[i]);
+   if (temp_mm->brk[i].type != 0)
+   __set_breakpoint(i, &null_brk);
+   }
+   }
+}
+
+static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   switch_mm_irqs_off(

[RESEND PATCH v4 07/11] powerpc/64s: Make slb_allocate_user() non-static

2021-05-05 Thread Christopher M. Riedl
With Book3s64 Hash translation, manually inserting a PTE requires
updating the Linux PTE, inserting a SLB entry, and inserting the hashed
page. The first is handled via the usual kernel abstractions, the second
requires slb_allocate_user() which is currently 'static', and the third
is available via hash_page_mm() already.

Make slb_allocate_user() non-static and add a prototype so the next
patch can use it during code-patching.

Signed-off-by: Christopher M. Riedl 

---

v4:  * New to series.
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 1 +
 arch/powerpc/mm/book3s64/slb.c| 4 +---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 3004f3323144d..189854eebba77 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -525,6 +525,7 @@ void slb_dump_contents(struct slb_entry *slb_ptr);
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
 void preload_new_slb_context(unsigned long start, unsigned long sp);
+long slb_allocate_user(struct mm_struct *mm, unsigned long ea);
 #endif /* __ASSEMBLY__ */
 
 /*
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index da0836cb855af..532eb51bc5211 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -29,8 +29,6 @@
 #include "internal.h"
 
 
-static long slb_allocate_user(struct mm_struct *mm, unsigned long ea);
-
 bool stress_slb_enabled __initdata;
 
 static int __init parse_stress_slb(char *p)
@@ -791,7 +789,7 @@ static long slb_allocate_kernel(unsigned long ea, unsigned 
long id)
return slb_insert_entry(ea, context, flags, ssize, true);
 }
 
-static long slb_allocate_user(struct mm_struct *mm, unsigned long ea)
+long slb_allocate_user(struct mm_struct *mm, unsigned long ea)
 {
unsigned long context;
unsigned long flags;
-- 
2.26.1



[RESEND PATCH v4 10/11] powerpc: Protect patching_mm with a lock

2021-05-05 Thread Christopher M. Riedl
Powerpc allows for multiple CPUs to patch concurrently. When patching
with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
CPUs for the few times that patching occurs. Use a spinlock to protect
the patching_mm from concurrent use.

Modify patch_instruction() to acquire the lock, perform the patch op,
and then release the lock.

Also introduce {lock,unlock}_patching() along with
patch_instruction_unlocked() to avoid per-iteration lock overhead when
patch_instruction() is called in a loop. A follow-up patch converts some
uses of patch_instruction() to use patch_instruction_unlocked() instead.

Signed-off-by: Christopher M. Riedl 

---

v4:  * New to series.
---
 arch/powerpc/include/asm/code-patching.h |  4 ++
 arch/powerpc/lib/code-patching.c | 85 +---
 2 files changed, 79 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h
index e51c81e4a9bda..2efa11b68cd8f 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -28,8 +28,12 @@ int create_branch(struct ppc_inst *instr, const struct 
ppc_inst *addr,
 int create_cond_branch(struct ppc_inst *instr, const struct ppc_inst *addr,
   unsigned long target, int flags);
 int patch_branch(struct ppc_inst *addr, unsigned long target, int flags);
+int patch_branch_unlocked(struct ppc_inst *addr, unsigned long target, int 
flags);
 int patch_instruction(struct ppc_inst *addr, struct ppc_inst instr);
+int patch_instruction_unlocked(struct ppc_inst *addr, struct ppc_inst instr);
 int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr);
+unsigned long lock_patching(void);
+void unlock_patching(unsigned long flags);
 
 static inline unsigned long patch_site_addr(s32 *site)
 {
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 7e15abc09ec04..0a496bb52bbf4 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -52,13 +52,17 @@ int raw_patch_instruction(struct ppc_inst *addr, struct 
ppc_inst instr)
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
 
+static DEFINE_SPINLOCK(patching_lock);
+
 struct temp_mm {
struct mm_struct *temp;
struct mm_struct *prev;
struct arch_hw_breakpoint brk[HBP_NUM_MAX];
+   spinlock_t *lock; /* protect access to the temporary mm */
 };
 
-static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm,
+   spinlock_t *lock)
 {
/* Do not preload SLB entries from the thread_info struct */
if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled())
@@ -66,12 +70,14 @@ static inline void init_temp_mm(struct temp_mm *temp_mm, 
struct mm_struct *mm)
 
temp_mm->temp = mm;
temp_mm->prev = NULL;
+   temp_mm->lock = lock;
memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
 }
 
 static inline void use_temporary_mm(struct temp_mm *temp_mm)
 {
lockdep_assert_irqs_disabled();
+   lockdep_assert_held(temp_mm->lock);
 
temp_mm->prev = current->active_mm;
switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
@@ -93,11 +99,13 @@ static inline void use_temporary_mm(struct temp_mm *temp_mm)
 static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
 {
lockdep_assert_irqs_disabled();
+   lockdep_assert_held(temp_mm->lock);
 
switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
 
/*
-* On book3s64 the active_cpus counter increments in
+* The temporary mm can only be in use on a single CPU at a time due to
+* the temp_mm->lock. On book3s64 the active_cpus counter increments in
 * switch_mm_irqs_off(). With the Hash MMU this counter affects if TLB
 * flushes are local. We have to manually decrement that counter here
 * along with removing our current CPU from the mm's cpumask so that in
@@ -230,7 +238,7 @@ static int map_patch(const void *addr, struct patch_mapping 
*patch_mapping)
pte = pte_mkdirty(pte);
set_pte_at(patching_mm, patching_addr, patch_mapping->ptep, pte);
 
-   init_temp_mm(&patch_mapping->temp_mm, patching_mm);
+   init_temp_mm(&patch_mapping->temp_mm, patching_mm, &patching_lock);
use_temporary_mm(&patch_mapping->temp_mm);
 
/*
@@ -258,7 +266,6 @@ static int do_patch_instruction(struct ppc_inst *addr, 
struct ppc_inst instr)
 {
int err;
struct ppc_inst *patch_addr = NULL;
-   unsigned long flags;
struct patch_mapping patch_mapping;
 
/*
@@ -269,11 +276,12 @@ static int do_patch_instruction(struct ppc_inst *addr, 
struct ppc_inst instr)
if (!patching_mm)
return raw_patch_instruction(ad

[RESEND PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-05-05 Thread Christopher M. Riedl
When code patching a STRICT_KERNEL_RWX kernel the page containing the
address to be patched is temporarily mapped as writeable. Currently, a
per-cpu vmalloc patch area is used for this purpose. While the patch
area is per-cpu, the temporary page mapping is inserted into the kernel
page tables for the duration of patching. The mapping is exposed to CPUs
other than the patching CPU - this is undesirable from a hardening
perspective. Use a temporary mm instead which keeps the mapping local to
the CPU doing the patching.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
the Book3s64 Hash MMU operates - by default the space above
DEFAULT_MAP_WINDOW is not available. For now, the patching address for
all platforms/MMUs is randomized inside this range.  The number of
possible random addresses is dependent on PAGE_SIZE and limited by
DEFAULT_MAP_WINDOW.

Bits of entropy with 64K page size on BOOK3S_64:

bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
bits of entropy = log2(128TB / 64K) bits of entropy = 31

Randomization occurs only once during initialization at boot.

Introduce two new functions, map_patch() and unmap_patch(), to
respectively create and remove the temporary mapping with write
permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
the page for patching with PAGE_SHARED since the kernel cannot access
userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.

Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
for the patching_addr when using the Hash MMU on Book3s64 to avoid
taking an SLB and Hash fault during patching.

Since patching_addr is now a userspace address, lock/unlock KUAP on
non-Book3s64 platforms. On Book3s64 with a Radix MMU, mapping the page
with PAGE_KERNEL sets EAA[0] for the PTE which ignores the AMR (KUAP)
according to PowerISA v3.0b Figure 35. On Book3s64 with a Hash MMU, the
hash PTE for the mapping is inserted with HPTE_USE_KERNEL_KEY which
similarly avoids the need for switching KUAP.

Finally, add a new WARN_ON() to check that the instruction was patched
as intended after the temporary mapping is torn down.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

and:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

Signed-off-by: Christopher M. Riedl 

---

v4:  * In the previous series this was two separate patches: one to init
   the temporary mm in poking_init() (unused in powerpc at the time)
   and the other to use it for patching (which removed all the
   per-cpu vmalloc code). Now that we use poking_init() in the
   existing per-cpu vmalloc approach, that separation doesn't work
   as nicely anymore so I just merged the two patches into one.
 * Preload the SLB entry and hash the page for the patching_addr
   when using Hash on book3s64 to avoid taking an SLB and Hash fault
   during patching. The previous implementation was a hack which
   changed current->mm to allow the SLB and Hash fault handlers to
   work with the temporary mm since both of those code-paths always
   assume mm == current->mm.
 * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
   have to manage the mm->context.active_cpus counter and mm cpumask
   since they determine (via mm_is_thread_local()) if the TLB flush
   in pte_clear() is local or not - it should always be local when
   we're using the temporary mm. On book3s64's Radix MMU we can
   just call local_flush_tlb_mm().
 * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
   KUAP.
---
 arch/powerpc/lib/code-patching.c | 209 ++-
 1 file changed, 121 insertions(+), 88 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index cbdfba8a39360..7e15abc09ec04 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -19,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int __patch_instruction(struct ppc_inst *exec_addr, struct ppc_inst 
instr,
   struct ppc_inst *patch_addr)
@@ -113,113 +116,142 @@ static inline void unuse_temporary_mm(struct temp_mm 
*temp_mm)
}
 }
 
-static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+static struct mm_struct *patching_mm __ro_after_init;
+static unsigned long patching_addr __ro_after_init;
+
+void __init poking_i

[RESEND PATCH v4 04/11] lkdtm/x86_64: Add test to hijack a patch mapping

2021-05-05 Thread Christopher M. Riedl
A previous commit implemented an LKDTM test on powerpc to exploit the
temporary mapping established when patching code with STRICT_KERNEL_RWX
enabled. Extend the test to work on x86_64 as well.

Signed-off-by: Christopher M. Riedl 
---
 drivers/misc/lkdtm/perms.c | 29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index c6f96ebffccfd..55c3bec6d3b72 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -224,7 +224,7 @@ void lkdtm_ACCESS_NULL(void)
 }
 
 #if (IS_BUILTIN(CONFIG_LKDTM) && defined(CONFIG_STRICT_KERNEL_RWX) && \
-   defined(CONFIG_PPC))
+   (defined(CONFIG_PPC) || defined(CONFIG_X86_64)))
 /*
  * This is just a dummy location to patch-over.
  */
@@ -233,28 +233,51 @@ static void patching_target(void)
return;
 }
 
+#ifdef CONFIG_PPC
 #include 
 struct ppc_inst * const patch_site = (struct ppc_inst *)&patching_target;
+#endif
+
+#ifdef CONFIG_X86_64
+#include 
+u32 * const patch_site = (u32 *)&patching_target;
+#endif
 
 static inline int lkdtm_do_patch(u32 data)
 {
+#ifdef CONFIG_PPC
return patch_instruction(patch_site, ppc_inst(data));
+#endif
+#ifdef CONFIG_X86_64
+   text_poke(patch_site, &data, sizeof(u32));
+   return 0;
+#endif
 }
 
 static inline u32 lkdtm_read_patch_site(void)
 {
+#ifdef CONFIG_PPC
struct ppc_inst inst = READ_ONCE(*patch_site);
return ppc_inst_val(ppc_inst_read(&inst));
+#endif
+#ifdef CONFIG_X86_64
+   return READ_ONCE(*patch_site);
+#endif
 }
 
 /* Returns True if the write succeeds */
 static inline bool lkdtm_try_write(u32 data, u32 *addr)
 {
+#ifdef CONFIG_PPC
__put_kernel_nofault(addr, &data, u32, err);
return true;
 
 err:
return false;
+#endif
+#ifdef CONFIG_X86_64
+   return !__put_user(data, addr);
+#endif
 }
 
 static int lkdtm_patching_cpu(void *data)
@@ -347,8 +370,8 @@ void lkdtm_HIJACK_PATCH(void)
 
 void lkdtm_HIJACK_PATCH(void)
 {
-   if (!IS_ENABLED(CONFIG_PPC))
-   pr_err("XFAIL: this test only runs on powerpc\n");
+   if (!IS_ENABLED(CONFIG_PPC) && !IS_ENABLED(CONFIG_X86_64))
+   pr_err("XFAIL: this test only runs on powerpc and x86_64\n");
if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
pr_err("XFAIL: this test requires CONFIG_STRICT_KERNEL_RWX\n");
if (!IS_BUILTIN(CONFIG_LKDTM))
-- 
2.26.1



[RESEND PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-05-05 Thread Christopher M. Riedl
Switching to a different mm with Hash translation causes SLB entries to
be preloaded from the current thread_info. This reduces SLB faults, for
example when threads share a common mm but operate on different address
ranges.

Preloading entries from the thread_info struct may not always be
appropriate - such as when switching to a temporary mm. Introduce a new
boolean in mm_context_t to skip the SLB preload entirely. Also move the
SLB preload code into a separate function since switch_slb() is already
quite long. The default behavior (preloading SLB entries from the
current thread_info struct) remains unchanged.

Signed-off-by: Christopher M. Riedl 

---

v4:  * New to series.
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/include/asm/mmu_context.h   | 13 ++
 arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
 arch/powerpc/mm/book3s64/slb.c   | 56 ++--
 4 files changed, 50 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index eace8c3f7b0a1..b23a9dcdee5af 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -130,6 +130,9 @@ typedef struct {
u32 pkey_allocation_map;
s16 execute_only_pkey; /* key holding execute-only protection */
 #endif
+
+   /* Do not preload SLB entries from thread_info during switch_slb() */
+   bool skip_slb_preload;
 } mm_context_t;
 
 static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 4bc45d3ed8b0e..264787e90b1a1 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
return 0;
 }
 
+#ifdef CONFIG_PPC_BOOK3S_64
+
+static inline void skip_slb_preload_mm(struct mm_struct *mm)
+{
+   mm->context.skip_slb_preload = true;
+}
+
+#else
+
+static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
+
+#endif /* CONFIG_PPC_BOOK3S_64 */
+
 #include 
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
b/arch/powerpc/mm/book3s64/mmu_context.c
index c10fc8a72fb37..3479910264c59 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
atomic_set(&mm->context.active_cpus, 0);
atomic_set(&mm->context.copros, 0);
 
+   mm->context.skip_slb_preload = false;
+
return 0;
 }
 
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index c91bd85eb90e3..da0836cb855af 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
asm volatile("slbie %0" : : "r" (slbie_data));
 }
 
+static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
+{
+   struct thread_info *ti = task_thread_info(tsk);
+   unsigned char i;
+
+   /*
+* We gradually age out SLBs after a number of context switches to
+* reduce reload overhead of unused entries (like we do with FP/VEC
+* reload). Each time we wrap 256 switches, take an entry out of the
+* SLB preload cache.
+*/
+   tsk->thread.load_slb++;
+   if (!tsk->thread.load_slb) {
+   unsigned long pc = KSTK_EIP(tsk);
+
+   preload_age(ti);
+   preload_add(ti, pc);
+   }
+
+   for (i = 0; i < ti->slb_preload_nr; i++) {
+   unsigned char idx;
+   unsigned long ea;
+
+   idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
+   ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
+
+   slb_allocate_user(mm, ea);
+   }
+}
+
 /* Flush all user entries from the segment table of the current processor. */
 void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
 {
-   struct thread_info *ti = task_thread_info(tsk);
unsigned char i;
 
/*
@@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
 
copy_mm_to_paca(mm);
 
-   /*
-* We gradually age out SLBs after a number of context switches to
-* reduce reload overhead of unused entries (like we do with FP/VEC
-* reload). Each time we wrap 256 switches, take an entry out of the
-* SLB preload cache.
-*/
-   tsk->thread.load_slb++;
-   if (!tsk->thread.load_slb) {
-   unsigned long pc = KSTK_EIP(tsk);
-
-   preload_age(ti);
-   preload_add(ti, pc);
-   }
-
-   for (i = 0; i < ti->slb_preload_nr; i++) {
-   unsigned char idx;
-   unsigned long ea;
-
-

[RESEND PATCH v4 01/11] powerpc: Add LKDTM accessor for patching addr

2021-05-05 Thread Christopher M. Riedl
When live patching with STRICT_KERNEL_RWX a mapping is installed at a
"patching address" with temporary write permissions. Provide a
LKDTM-only accessor function for this address in preparation for a LKDTM
test which attempts to "hijack" this mapping by writing to it from
another CPU.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/code-patching.h | 4 
 arch/powerpc/lib/code-patching.c | 7 +++
 2 files changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h
index f1d029bf906e5..e51c81e4a9bda 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -188,4 +188,8 @@ static inline unsigned long ppc_kallsyms_lookup_name(const 
char *name)
 ___PPC_RA(__REG_R1) | PPC_LR_STKOFF)
 #endif /* CONFIG_PPC64 */
 
+#if IS_BUILTIN(CONFIG_LKDTM) && IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)
+unsigned long read_cpu_patching_addr(unsigned int cpu);
+#endif
+
 #endif /* _ASM_POWERPC_CODE_PATCHING_H */
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 870b30d9be2f8..2b1b3e9043ade 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -48,6 +48,13 @@ int raw_patch_instruction(struct ppc_inst *addr, struct 
ppc_inst instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+   return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
+}
+#endif
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;
-- 
2.26.1



[RESEND PATCH v4 09/11] lkdtm/powerpc: Fix code patching hijack test

2021-05-05 Thread Christopher M. Riedl
Code patching on powerpc with a STRICT_KERNEL_RWX uses a userspace
address in a temporary mm now. Use __put_user() to avoid write failures
due to KUAP when attempting a "hijack" on the patching address.

Signed-off-by: Christopher M. Riedl 
---
 drivers/misc/lkdtm/perms.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 55c3bec6d3b72..af9bf285fe326 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -268,16 +268,7 @@ static inline u32 lkdtm_read_patch_site(void)
 /* Returns True if the write succeeds */
 static inline bool lkdtm_try_write(u32 data, u32 *addr)
 {
-#ifdef CONFIG_PPC
-   __put_kernel_nofault(addr, &data, u32, err);
-   return true;
-
-err:
-   return false;
-#endif
-#ifdef CONFIG_X86_64
return !__put_user(data, addr);
-#endif
 }
 
 static int lkdtm_patching_cpu(void *data)
-- 
2.26.1



[RESEND PATCH v4 02/11] lkdtm/powerpc: Add test to hijack a patch mapping

2021-05-05 Thread Christopher M. Riedl
When live patching with STRICT_KERNEL_RWX the CPU doing the patching
must temporarily remap the page(s) containing the patch site with +W
permissions. While this temporary mapping is in use, another CPU could
write to the same mapping and maliciously alter kernel text. Implement a
LKDTM test to attempt to exploit such an opening during code patching.
The test is implemented on powerpc and requires LKDTM built into the
kernel (building LKDTM as a module is insufficient).

The LKDTM "hijack" test works as follows:

  1. A CPU executes an infinite loop to patch an instruction. This is
 the "patching" CPU.
  2. Another CPU attempts to write to the address of the temporary
 mapping used by the "patching" CPU. This other CPU is the
 "hijacker" CPU. The hijack either fails with a fault/error or
 succeeds, in which case some kernel text is now overwritten.

The virtual address of the temporary patch mapping is provided via an
LKDTM-specific accessor to the hijacker CPU. This test assumes a
hypothetical situation where this address was leaked previously.

How to run the test:

mount -t debugfs none /sys/kernel/debug
(echo HIJACK_PATCH > /sys/kernel/debug/provoke-crash/DIRECT)

A passing test indicates that it is not possible to overwrite kernel
text from another CPU by using the temporary mapping established by
a CPU for patching.

Signed-off-by: Christopher M. Riedl 

---

v4:  * Separate the powerpc and x86_64 bits into individual patches.
 * Use __put_kernel_nofault() when attempting to hijack the mapping
 * Use raw_smp_processor_id() to avoid triggering the BUG() when
   calling smp_processor_id() in preemptible code - the only thing
   that matters is that one of the threads is bound to a different
   CPU - we are not using smp_processor_id() to access any per-cpu
   data or similar where preemption should be disabled.
 * Rework the patching_cpu() kthread stop condition to avoid:
   https://lwn.net/Articles/628628/
---
 drivers/misc/lkdtm/core.c  |   1 +
 drivers/misc/lkdtm/lkdtm.h |   1 +
 drivers/misc/lkdtm/perms.c | 135 +
 3 files changed, 137 insertions(+)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index b2aff4d87c014..857d218840eb8 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -146,6 +146,7 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
CRASHTYPE(WRITE_KERN),
+   CRASHTYPE(HIJACK_PATCH),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
CRASHTYPE(REFCOUNT_INC_NOT_ZERO_OVERFLOW),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 5ae48c64df24d..c8de54d189c27 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -61,6 +61,7 @@ void lkdtm_EXEC_USERSPACE(void);
 void lkdtm_EXEC_NULL(void);
 void lkdtm_ACCESS_USERSPACE(void);
 void lkdtm_ACCESS_NULL(void);
+void lkdtm_HIJACK_PATCH(void);
 
 /* refcount.c */
 void lkdtm_REFCOUNT_INC_OVERFLOW(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 2dede2ef658f3..c6f96ebffccfd 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -222,6 +223,140 @@ void lkdtm_ACCESS_NULL(void)
pr_err("FAIL: survived bad write\n");
 }
 
+#if (IS_BUILTIN(CONFIG_LKDTM) && defined(CONFIG_STRICT_KERNEL_RWX) && \
+   defined(CONFIG_PPC))
+/*
+ * This is just a dummy location to patch-over.
+ */
+static void patching_target(void)
+{
+   return;
+}
+
+#include 
+struct ppc_inst * const patch_site = (struct ppc_inst *)&patching_target;
+
+static inline int lkdtm_do_patch(u32 data)
+{
+   return patch_instruction(patch_site, ppc_inst(data));
+}
+
+static inline u32 lkdtm_read_patch_site(void)
+{
+   struct ppc_inst inst = READ_ONCE(*patch_site);
+   return ppc_inst_val(ppc_inst_read(&inst));
+}
+
+/* Returns True if the write succeeds */
+static inline bool lkdtm_try_write(u32 data, u32 *addr)
+{
+   __put_kernel_nofault(addr, &data, u32, err);
+   return true;
+
+err:
+   return false;
+}
+
+static int lkdtm_patching_cpu(void *data)
+{
+   int err = 0;
+   u32 val = 0xdeadbeef;
+
+   pr_info("starting patching_cpu=%d\n", raw_smp_processor_id());
+
+   do {
+   err = lkdtm_do_patch(val);
+   } while (lkdtm_read_patch_site() == val && !err && 
!kthread_should_stop());
+
+   if (err)
+   pr_warn("XFAIL: patch_instruction returned error: %d\n", err);
+
+   while (!kthread_should_stop()) {
+   set_current_state(TASK_INTERRUPTIBLE);
+   schedule

[RESEND PATCH v4 03/11] x86_64: Add LKDTM accessor for patching addr

2021-05-05 Thread Christopher M. Riedl
When live patching with STRICT_KERNEL_RWX a mapping is installed at a
"patching address" with temporary write permissions. Provide a
LKDTM-only accessor function for this address in preparation for a LKDTM
test which attempts to "hijack" this mapping by writing to it from
another CPU.

Signed-off-by: Christopher M. Riedl 
---
 arch/x86/include/asm/text-patching.h | 4 
 arch/x86/kernel/alternative.c| 7 +++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/text-patching.h 
b/arch/x86/include/asm/text-patching.h
index b7421780e4e92..f0caf9ee13bd8 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -167,4 +167,8 @@ void int3_emulate_ret(struct pt_regs *regs)
 }
 #endif /* !CONFIG_UML_X86 */
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu);
+#endif
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8d778e46725d2..4c95fdd9b1965 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -852,6 +852,13 @@ static inline void unuse_temporary_mm(temp_mm_state_t 
prev_state)
 __ro_after_init struct mm_struct *poking_mm;
 __ro_after_init unsigned long poking_addr;
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+   return poking_addr;
+}
+#endif
+
 static void *__text_poke(void *addr, const void *opcode, size_t len)
 {
bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
-- 
2.26.1



[PATCH v4 06/11] powerpc: Introduce temporary mm

2021-04-29 Thread Christopher M. Riedl
x86 supports the notion of a temporary mm which restricts access to
temporary PTEs to a single CPU. A temporary mm is useful for situations
where a CPU needs to perform sensitive operations (such as patching a
STRICT_KERNEL_RWX kernel) requiring temporary mappings without exposing
said mappings to other CPUs. A side benefit is that other CPU TLBs do
not need to be flushed when the temporary mm is torn down.

Mappings in the temporary mm can be set in the userspace portion of the
address-space.

Interrupts must be disabled while the temporary mm is in use. HW
breakpoints, which may have been set by userspace as watchpoints on
addresses now within the temporary mm, are saved and disabled when
loading the temporary mm. The HW breakpoints are restored when unloading
the temporary mm. All HW breakpoints are indiscriminately disabled while
the temporary mm is in use.

With the Book3s64 Hash MMU the SLB is preloaded with entries from the
current thread_info struct during switch_slb(). This could cause a
Machine Check (MCE) due to an SLB Multihit when creating arbitrary
userspace mappings in the temporary mm later. Disable SLB preload from
the thread_info struct for any temporary mm to avoid this.

Based on x86 implementation:

commit cefa929c034e
("x86/mm: Introduce temporary mm structs")

Signed-off-by: Christopher M. Riedl 

---

v4:  * Pass the prev mm instead of NULL to switch_mm_irqs_off() when
   using/unusing the temp mm as suggested by Jann Horn to keep
   the context.active counter in-sync on mm/nohash.
 * Disable SLB preload in the temporary mm when initializing the
   temp_mm struct.
 * Include asm/debug.h header to fix build issue with
   ppc44x_defconfig.
---
 arch/powerpc/include/asm/debug.h |  1 +
 arch/powerpc/kernel/process.c|  5 +++
 arch/powerpc/lib/code-patching.c | 67 
 3 files changed, 73 insertions(+)

diff --git a/arch/powerpc/include/asm/debug.h b/arch/powerpc/include/asm/debug.h
index 86a14736c76c3..dfd82635ea8b3 100644
--- a/arch/powerpc/include/asm/debug.h
+++ b/arch/powerpc/include/asm/debug.h
@@ -46,6 +46,7 @@ static inline int debugger_fault_handler(struct pt_regs 
*regs) { return 0; }
 #endif
 
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk);
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk);
 bool ppc_breakpoint_available(void);
 #ifdef CONFIG_PPC_ADV_DEBUG_REGS
 extern void do_send_trap(struct pt_regs *regs, unsigned long address,
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 89e34aa273e21..8e94cabaea3c3 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -864,6 +864,11 @@ static inline int set_breakpoint_8xx(struct 
arch_hw_breakpoint *brk)
return 0;
 }
 
+void __get_breakpoint(int nr, struct arch_hw_breakpoint *brk)
+{
+   memcpy(brk, this_cpu_ptr(¤t_brk[nr]), sizeof(*brk));
+}
+
 void __set_breakpoint(int nr, struct arch_hw_breakpoint *brk)
 {
memcpy(this_cpu_ptr(¤t_brk[nr]), brk, sizeof(*brk));
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 2b1b3e9043ade..cbdfba8a39360 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -17,6 +17,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 static int __patch_instruction(struct ppc_inst *exec_addr, struct ppc_inst 
instr,
   struct ppc_inst *patch_addr)
@@ -46,6 +48,71 @@ int raw_patch_instruction(struct ppc_inst *addr, struct 
ppc_inst instr)
 }
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
+
+struct temp_mm {
+   struct mm_struct *temp;
+   struct mm_struct *prev;
+   struct arch_hw_breakpoint brk[HBP_NUM_MAX];
+};
+
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+{
+   /* Do not preload SLB entries from the thread_info struct */
+   if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled())
+   skip_slb_preload_mm(mm);
+
+   temp_mm->temp = mm;
+   temp_mm->prev = NULL;
+   memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
+}
+
+static inline void use_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   temp_mm->prev = current->active_mm;
+   switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
+
+   WARN_ON(!mm_is_thread_local(temp_mm->temp));
+
+   if (ppc_breakpoint_available()) {
+   struct arch_hw_breakpoint null_brk = {0};
+   int i = 0;
+
+   for (; i < nr_wp_slots(); ++i) {
+   __get_breakpoint(i, &temp_mm->brk[i]);
+   if (temp_mm->brk[i].type != 0)
+   __set_breakpoint(i, &null_brk);
+   }
+   }
+}
+
+static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
+{
+   lockdep_assert_irqs_disabled();
+
+   switch_mm_irqs_off(

[PATCH v4 08/11] powerpc: Initialize and use a temporary mm for patching

2021-04-29 Thread Christopher M. Riedl
When code patching a STRICT_KERNEL_RWX kernel the page containing the
address to be patched is temporarily mapped as writeable. Currently, a
per-cpu vmalloc patch area is used for this purpose. While the patch
area is per-cpu, the temporary page mapping is inserted into the kernel
page tables for the duration of patching. The mapping is exposed to CPUs
other than the patching CPU - this is undesirable from a hardening
perspective. Use a temporary mm instead which keeps the mapping local to
the CPU doing the patching.

Use the `poking_init` init hook to prepare a temporary mm and patching
address. Initialize the temporary mm by copying the init mm. Choose a
randomized patching address inside the temporary mm userspace address
space. The patching address is randomized between PAGE_SIZE and
DEFAULT_MAP_WINDOW-PAGE_SIZE. The upper limit is necessary due to how
the Book3s64 Hash MMU operates - by default the space above
DEFAULT_MAP_WINDOW is not available. For now, the patching address for
all platforms/MMUs is randomized inside this range.  The number of
possible random addresses is dependent on PAGE_SIZE and limited by
DEFAULT_MAP_WINDOW.

Bits of entropy with 64K page size on BOOK3S_64:

bits of entropy = log2(DEFAULT_MAP_WINDOW_USER64 / PAGE_SIZE)

PAGE_SIZE=64K, DEFAULT_MAP_WINDOW_USER64=128TB
bits of entropy = log2(128TB / 64K) bits of entropy = 31

Randomization occurs only once during initialization at boot.

Introduce two new functions, map_patch() and unmap_patch(), to
respectively create and remove the temporary mapping with write
permissions at patching_addr. The Hash MMU on Book3s64 requires mapping
the page for patching with PAGE_SHARED since the kernel cannot access
userspace pages with the PAGE_PRIVILEGED (PAGE_KERNEL) bit set.

Also introduce hash_prefault_mapping() to preload the SLB entry and HPTE
for the patching_addr when using the Hash MMU on Book3s64 to avoid
taking an SLB and Hash fault during patching.

Since patching_addr is now a userspace address, lock/unlock KUAP on
non-Book3s64 platforms. On Book3s64 with a Radix MMU, mapping the page
with PAGE_KERNEL sets EAA[0] for the PTE which ignores the AMR (KUAP)
according to PowerISA v3.0b Figure 35. On Book3s64 with a Hash MMU, the
hash PTE for the mapping is inserted with HPTE_USE_KERNEL_KEY which
similarly avoids the need for switching KUAP.

Finally, add a new WARN_ON() to check that the instruction was patched
as intended after the temporary mapping is torn down.

Based on x86 implementation:

commit 4fc19708b165
("x86/alternatives: Initialize temporary mm for patching")

and:

commit b3fd8e83ada0
("x86/alternatives: Use temporary mm for text poking")

Signed-off-by: Christopher M. Riedl 

---

v4:  * In the previous series this was two separate patches: one to init
   the temporary mm in poking_init() (unused in powerpc at the time)
   and the other to use it for patching (which removed all the
   per-cpu vmalloc code). Now that we use poking_init() in the
   existing per-cpu vmalloc approach, that separation doesn't work
   as nicely anymore so I just merged the two patches into one.
 * Preload the SLB entry and hash the page for the patching_addr
   when using Hash on book3s64 to avoid taking an SLB and Hash fault
   during patching. The previous implementation was a hack which
   changed current->mm to allow the SLB and Hash fault handlers to
   work with the temporary mm since both of those code-paths always
   assume mm == current->mm.
 * Also (hmm - seeing a trend here) with the book3s64 Hash MMU we
   have to manage the mm->context.active_cpus counter and mm cpumask
   since they determine (via mm_is_thread_local()) if the TLB flush
   in pte_clear() is local or not - it should always be local when
   we're using the temporary mm. On book3s64's Radix MMU we can
   just call local_flush_tlb_mm().
 * Use HPTE_USE_KERNEL_KEY on Hash to avoid costly lock/unlock of
   KUAP.
---
 arch/powerpc/lib/code-patching.c | 209 ++-
 1 file changed, 121 insertions(+), 88 deletions(-)

diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index cbdfba8a39360..7e15abc09ec04 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -11,6 +11,8 @@
 #include 
 #include 
 #include 
+#include 
+#include 
 
 #include 
 #include 
@@ -19,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 
 static int __patch_instruction(struct ppc_inst *exec_addr, struct ppc_inst 
instr,
   struct ppc_inst *patch_addr)
@@ -113,113 +116,142 @@ static inline void unuse_temporary_mm(struct temp_mm 
*temp_mm)
}
 }
 
-static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
+static struct mm_struct *patching_mm __ro_after_init;
+static unsigned long patching_addr __ro_after_init;
+
+void __init poking_i

[PATCH v4 03/11] x86_64: Add LKDTM accessor for patching addr

2021-04-29 Thread Christopher M. Riedl
When live patching with STRICT_KERNEL_RWX a mapping is installed at a
"patching address" with temporary write permissions. Provide a
LKDTM-only accessor function for this address in preparation for a LKDTM
test which attempts to "hijack" this mapping by writing to it from
another CPU.

Signed-off-by: Christopher M. Riedl 
---
 arch/x86/include/asm/text-patching.h | 4 
 arch/x86/kernel/alternative.c| 7 +++
 2 files changed, 11 insertions(+)

diff --git a/arch/x86/include/asm/text-patching.h 
b/arch/x86/include/asm/text-patching.h
index b7421780e4e92..f0caf9ee13bd8 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -167,4 +167,8 @@ void int3_emulate_ret(struct pt_regs *regs)
 }
 #endif /* !CONFIG_UML_X86 */
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu);
+#endif
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 8d778e46725d2..4c95fdd9b1965 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -852,6 +852,13 @@ static inline void unuse_temporary_mm(temp_mm_state_t 
prev_state)
 __ro_after_init struct mm_struct *poking_mm;
 __ro_after_init unsigned long poking_addr;
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+   return poking_addr;
+}
+#endif
+
 static void *__text_poke(void *addr, const void *opcode, size_t len)
 {
bool cross_page_boundary = offset_in_page(addr) + len > PAGE_SIZE;
-- 
2.26.1



[PATCH v4 07/11] powerpc/64s: Make slb_allocate_user() non-static

2021-04-29 Thread Christopher M. Riedl
With Book3s64 Hash translation, manually inserting a PTE requires
updating the Linux PTE, inserting a SLB entry, and inserting the hashed
page. The first is handled via the usual kernel abstractions, the second
requires slb_allocate_user() which is currently 'static', and the third
is available via hash_page_mm() already.

Make slb_allocate_user() non-static and add a prototype so the next
patch can use it during code-patching.

Signed-off-by: Christopher M. Riedl 

---

v4:  * New to series.
---
 arch/powerpc/include/asm/book3s/64/mmu-hash.h | 1 +
 arch/powerpc/mm/book3s64/slb.c| 4 +---
 2 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu-hash.h 
b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
index 3004f3323144d..189854eebba77 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu-hash.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu-hash.h
@@ -525,6 +525,7 @@ void slb_dump_contents(struct slb_entry *slb_ptr);
 extern void slb_vmalloc_update(void);
 extern void slb_set_size(u16 size);
 void preload_new_slb_context(unsigned long start, unsigned long sp);
+long slb_allocate_user(struct mm_struct *mm, unsigned long ea);
 #endif /* __ASSEMBLY__ */
 
 /*
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index da0836cb855af..532eb51bc5211 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -29,8 +29,6 @@
 #include "internal.h"
 
 
-static long slb_allocate_user(struct mm_struct *mm, unsigned long ea);
-
 bool stress_slb_enabled __initdata;
 
 static int __init parse_stress_slb(char *p)
@@ -791,7 +789,7 @@ static long slb_allocate_kernel(unsigned long ea, unsigned 
long id)
return slb_insert_entry(ea, context, flags, ssize, true);
 }
 
-static long slb_allocate_user(struct mm_struct *mm, unsigned long ea)
+long slb_allocate_user(struct mm_struct *mm, unsigned long ea)
 {
unsigned long context;
unsigned long flags;
-- 
2.26.1



[PATCH v4 05/11] powerpc/64s: Add ability to skip SLB preload

2021-04-29 Thread Christopher M. Riedl
Switching to a different mm with Hash translation causes SLB entries to
be preloaded from the current thread_info. This reduces SLB faults, for
example when threads share a common mm but operate on different address
ranges.

Preloading entries from the thread_info struct may not always be
appropriate - such as when switching to a temporary mm. Introduce a new
boolean in mm_context_t to skip the SLB preload entirely. Also move the
SLB preload code into a separate function since switch_slb() is already
quite long. The default behavior (preloading SLB entries from the
current thread_info struct) remains unchanged.

Signed-off-by: Christopher M. Riedl 

---

v4:  * New to series.
---
 arch/powerpc/include/asm/book3s/64/mmu.h |  3 ++
 arch/powerpc/include/asm/mmu_context.h   | 13 ++
 arch/powerpc/mm/book3s64/mmu_context.c   |  2 +
 arch/powerpc/mm/book3s64/slb.c   | 56 ++--
 4 files changed, 50 insertions(+), 24 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/mmu.h 
b/arch/powerpc/include/asm/book3s/64/mmu.h
index eace8c3f7b0a1..b23a9dcdee5af 100644
--- a/arch/powerpc/include/asm/book3s/64/mmu.h
+++ b/arch/powerpc/include/asm/book3s/64/mmu.h
@@ -130,6 +130,9 @@ typedef struct {
u32 pkey_allocation_map;
s16 execute_only_pkey; /* key holding execute-only protection */
 #endif
+
+   /* Do not preload SLB entries from thread_info during switch_slb() */
+   bool skip_slb_preload;
 } mm_context_t;
 
 static inline u16 mm_ctx_user_psize(mm_context_t *ctx)
diff --git a/arch/powerpc/include/asm/mmu_context.h 
b/arch/powerpc/include/asm/mmu_context.h
index 4bc45d3ed8b0e..264787e90b1a1 100644
--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -298,6 +298,19 @@ static inline int arch_dup_mmap(struct mm_struct *oldmm,
return 0;
 }
 
+#ifdef CONFIG_PPC_BOOK3S_64
+
+static inline void skip_slb_preload_mm(struct mm_struct *mm)
+{
+   mm->context.skip_slb_preload = true;
+}
+
+#else
+
+static inline void skip_slb_preload_mm(struct mm_struct *mm) {}
+
+#endif /* CONFIG_PPC_BOOK3S_64 */
+
 #include 
 
 #endif /* __KERNEL__ */
diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
b/arch/powerpc/mm/book3s64/mmu_context.c
index c10fc8a72fb37..3479910264c59 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -202,6 +202,8 @@ int init_new_context(struct task_struct *tsk, struct 
mm_struct *mm)
atomic_set(&mm->context.active_cpus, 0);
atomic_set(&mm->context.copros, 0);
 
+   mm->context.skip_slb_preload = false;
+
return 0;
 }
 
diff --git a/arch/powerpc/mm/book3s64/slb.c b/arch/powerpc/mm/book3s64/slb.c
index c91bd85eb90e3..da0836cb855af 100644
--- a/arch/powerpc/mm/book3s64/slb.c
+++ b/arch/powerpc/mm/book3s64/slb.c
@@ -441,10 +441,39 @@ static void slb_cache_slbie_user(unsigned int index)
asm volatile("slbie %0" : : "r" (slbie_data));
 }
 
+static void preload_slb_entries(struct task_struct *tsk, struct mm_struct *mm)
+{
+   struct thread_info *ti = task_thread_info(tsk);
+   unsigned char i;
+
+   /*
+* We gradually age out SLBs after a number of context switches to
+* reduce reload overhead of unused entries (like we do with FP/VEC
+* reload). Each time we wrap 256 switches, take an entry out of the
+* SLB preload cache.
+*/
+   tsk->thread.load_slb++;
+   if (!tsk->thread.load_slb) {
+   unsigned long pc = KSTK_EIP(tsk);
+
+   preload_age(ti);
+   preload_add(ti, pc);
+   }
+
+   for (i = 0; i < ti->slb_preload_nr; i++) {
+   unsigned char idx;
+   unsigned long ea;
+
+   idx = (ti->slb_preload_tail + i) % SLB_PRELOAD_NR;
+   ea = (unsigned long)ti->slb_preload_esid[idx] << SID_SHIFT;
+
+   slb_allocate_user(mm, ea);
+   }
+}
+
 /* Flush all user entries from the segment table of the current processor. */
 void switch_slb(struct task_struct *tsk, struct mm_struct *mm)
 {
-   struct thread_info *ti = task_thread_info(tsk);
unsigned char i;
 
/*
@@ -502,29 +531,8 @@ void switch_slb(struct task_struct *tsk, struct mm_struct 
*mm)
 
copy_mm_to_paca(mm);
 
-   /*
-* We gradually age out SLBs after a number of context switches to
-* reduce reload overhead of unused entries (like we do with FP/VEC
-* reload). Each time we wrap 256 switches, take an entry out of the
-* SLB preload cache.
-*/
-   tsk->thread.load_slb++;
-   if (!tsk->thread.load_slb) {
-   unsigned long pc = KSTK_EIP(tsk);
-
-   preload_age(ti);
-   preload_add(ti, pc);
-   }
-
-   for (i = 0; i < ti->slb_preload_nr; i++) {
-   unsigned char idx;
-   unsigned long ea;
-
-

[PATCH v4 02/11] lkdtm/powerpc: Add test to hijack a patch mapping

2021-04-29 Thread Christopher M. Riedl
When live patching with STRICT_KERNEL_RWX the CPU doing the patching
must temporarily remap the page(s) containing the patch site with +W
permissions. While this temporary mapping is in use, another CPU could
write to the same mapping and maliciously alter kernel text. Implement a
LKDTM test to attempt to exploit such an opening during code patching.
The test is implemented on powerpc and requires LKDTM built into the
kernel (building LKDTM as a module is insufficient).

The LKDTM "hijack" test works as follows:

  1. A CPU executes an infinite loop to patch an instruction. This is
 the "patching" CPU.
  2. Another CPU attempts to write to the address of the temporary
 mapping used by the "patching" CPU. This other CPU is the
 "hijacker" CPU. The hijack either fails with a fault/error or
 succeeds, in which case some kernel text is now overwritten.

The virtual address of the temporary patch mapping is provided via an
LKDTM-specific accessor to the hijacker CPU. This test assumes a
hypothetical situation where this address was leaked previously.

How to run the test:

mount -t debugfs none /sys/kernel/debug
(echo HIJACK_PATCH > /sys/kernel/debug/provoke-crash/DIRECT)

A passing test indicates that it is not possible to overwrite kernel
text from another CPU by using the temporary mapping established by
a CPU for patching.

Signed-off-by: Christopher M. Riedl 

---

v4:  * Separate the powerpc and x86_64 bits into individual patches.
 * Use __put_kernel_nofault() when attempting to hijack the mapping
 * Use raw_smp_processor_id() to avoid triggering the BUG() when
   calling smp_processor_id() in preemptible code - the only thing
   that matters is that one of the threads is bound to a different
   CPU - we are not using smp_processor_id() to access any per-cpu
   data or similar where preemption should be disabled.
 * Rework the patching_cpu() kthread stop condition to avoid:
   https://lwn.net/Articles/628628/
---
 drivers/misc/lkdtm/core.c  |   1 +
 drivers/misc/lkdtm/lkdtm.h |   1 +
 drivers/misc/lkdtm/perms.c | 135 +
 3 files changed, 137 insertions(+)

diff --git a/drivers/misc/lkdtm/core.c b/drivers/misc/lkdtm/core.c
index b2aff4d87c014..857d218840eb8 100644
--- a/drivers/misc/lkdtm/core.c
+++ b/drivers/misc/lkdtm/core.c
@@ -146,6 +146,7 @@ static const struct crashtype crashtypes[] = {
CRASHTYPE(WRITE_RO),
CRASHTYPE(WRITE_RO_AFTER_INIT),
CRASHTYPE(WRITE_KERN),
+   CRASHTYPE(HIJACK_PATCH),
CRASHTYPE(REFCOUNT_INC_OVERFLOW),
CRASHTYPE(REFCOUNT_ADD_OVERFLOW),
CRASHTYPE(REFCOUNT_INC_NOT_ZERO_OVERFLOW),
diff --git a/drivers/misc/lkdtm/lkdtm.h b/drivers/misc/lkdtm/lkdtm.h
index 5ae48c64df24d..c8de54d189c27 100644
--- a/drivers/misc/lkdtm/lkdtm.h
+++ b/drivers/misc/lkdtm/lkdtm.h
@@ -61,6 +61,7 @@ void lkdtm_EXEC_USERSPACE(void);
 void lkdtm_EXEC_NULL(void);
 void lkdtm_ACCESS_USERSPACE(void);
 void lkdtm_ACCESS_NULL(void);
+void lkdtm_HIJACK_PATCH(void);
 
 /* refcount.c */
 void lkdtm_REFCOUNT_INC_OVERFLOW(void);
diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 2dede2ef658f3..c6f96ebffccfd 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -9,6 +9,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 /* Whether or not to fill the target memory area with do_nothing(). */
@@ -222,6 +223,140 @@ void lkdtm_ACCESS_NULL(void)
pr_err("FAIL: survived bad write\n");
 }
 
+#if (IS_BUILTIN(CONFIG_LKDTM) && defined(CONFIG_STRICT_KERNEL_RWX) && \
+   defined(CONFIG_PPC))
+/*
+ * This is just a dummy location to patch-over.
+ */
+static void patching_target(void)
+{
+   return;
+}
+
+#include 
+struct ppc_inst * const patch_site = (struct ppc_inst *)&patching_target;
+
+static inline int lkdtm_do_patch(u32 data)
+{
+   return patch_instruction(patch_site, ppc_inst(data));
+}
+
+static inline u32 lkdtm_read_patch_site(void)
+{
+   struct ppc_inst inst = READ_ONCE(*patch_site);
+   return ppc_inst_val(ppc_inst_read(&inst));
+}
+
+/* Returns True if the write succeeds */
+static inline bool lkdtm_try_write(u32 data, u32 *addr)
+{
+   __put_kernel_nofault(addr, &data, u32, err);
+   return true;
+
+err:
+   return false;
+}
+
+static int lkdtm_patching_cpu(void *data)
+{
+   int err = 0;
+   u32 val = 0xdeadbeef;
+
+   pr_info("starting patching_cpu=%d\n", raw_smp_processor_id());
+
+   do {
+   err = lkdtm_do_patch(val);
+   } while (lkdtm_read_patch_site() == val && !err && 
!kthread_should_stop());
+
+   if (err)
+   pr_warn("XFAIL: patch_instruction returned error: %d\n", err);
+
+   while (!kthread_should_stop()) {
+   set_current_state(TASK_INTERRUPTIBLE);
+   schedule

[PATCH v4 11/11] powerpc: Use patch_instruction_unlocked() in loops

2021-04-29 Thread Christopher M. Riedl
Now that patching requires a lock to prevent concurrent access to
patching_mm, every call to patch_instruction() acquires and releases a
spinlock. There are several places where patch_instruction() is called
in a loop. Convert these to acquire the lock once before the loop, call
patch_instruction_unlocked() in the loop body, and then release the lock
again after the loop terminates - as in:

for (i = 0; i < n; ++i)
patch_instruction(...); <-- lock/unlock every iteration

changes to:

flags = lock_patching(); <-- lock once

for (i = 0; i < n; ++i)
patch_instruction_unlocked(...);

unlock_patching(flags); <-- unlock once

Signed-off-by: Christopher M. Riedl 

---

v4:  * New to series.
---
 arch/powerpc/kernel/epapr_paravirt.c |   9 ++-
 arch/powerpc/kernel/optprobes.c  |  22 --
 arch/powerpc/lib/feature-fixups.c| 114 +++
 arch/powerpc/xmon/xmon.c |  22 --
 4 files changed, 120 insertions(+), 47 deletions(-)

diff --git a/arch/powerpc/kernel/epapr_paravirt.c 
b/arch/powerpc/kernel/epapr_paravirt.c
index 2ed14d4a47f59..b639e71cf9dec 100644
--- a/arch/powerpc/kernel/epapr_paravirt.c
+++ b/arch/powerpc/kernel/epapr_paravirt.c
@@ -28,6 +28,7 @@ static int __init early_init_dt_scan_epapr(unsigned long node,
const u32 *insts;
int len;
int i;
+   unsigned long flags;
 
insts = of_get_flat_dt_prop(node, "hcall-instructions", &len);
if (!insts)
@@ -36,14 +37,18 @@ static int __init early_init_dt_scan_epapr(unsigned long 
node,
if (len % 4 || len > (4 * 4))
return -1;
 
+   flags = lock_patching();
+
for (i = 0; i < (len / 4); i++) {
struct ppc_inst inst = ppc_inst(be32_to_cpu(insts[i]));
-   patch_instruction((struct ppc_inst *)(epapr_hypercall_start + 
i), inst);
+   patch_instruction_unlocked((struct ppc_inst 
*)(epapr_hypercall_start + i), inst);
 #if !defined(CONFIG_64BIT) || defined(CONFIG_PPC_BOOK3E_64)
-   patch_instruction((struct ppc_inst *)(epapr_ev_idle_start + i), 
inst);
+   patch_instruction_unlocked((struct ppc_inst 
*)(epapr_ev_idle_start + i), inst);
 #endif
}
 
+   unlock_patching(flags);
+
 #if !defined(CONFIG_64BIT) || defined(CONFIG_PPC_BOOK3E_64)
if (of_get_flat_dt_prop(node, "has-idle", NULL))
epapr_has_idle = true;
diff --git a/arch/powerpc/kernel/optprobes.c b/arch/powerpc/kernel/optprobes.c
index cdf87086fa33a..deaeb6e8d1a00 100644
--- a/arch/powerpc/kernel/optprobes.c
+++ b/arch/powerpc/kernel/optprobes.c
@@ -200,7 +200,7 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
struct ppc_inst branch_op_callback, branch_emulate_step, temp;
kprobe_opcode_t *op_callback_addr, *emulate_step_addr, *buff;
long b_offset;
-   unsigned long nip, size;
+   unsigned long nip, size, flags;
int rc, i;
 
kprobe_ppc_optinsn_slots.insn_size = MAX_OPTINSN_SIZE;
@@ -237,13 +237,20 @@ int arch_prepare_optimized_kprobe(struct optimized_kprobe 
*op, struct kprobe *p)
/* We can optimize this via patch_instruction_window later */
size = (TMPL_END_IDX * sizeof(kprobe_opcode_t)) / sizeof(int);
pr_devel("Copying template to %p, size %lu\n", buff, size);
+
+   flags = lock_patching();
+
for (i = 0; i < size; i++) {
-   rc = patch_instruction((struct ppc_inst *)(buff + i),
-  ppc_inst(*(optprobe_template_entry + 
i)));
-   if (rc < 0)
+   rc = patch_instruction_unlocked((struct ppc_inst *)(buff + i),
+   
ppc_inst(*(optprobe_template_entry + i)));
+   if (rc < 0) {
+   unlock_patching(flags);
goto error;
+   }
}
 
+   unlock_patching(flags);
+
/*
 * Fixup the template with instructions to:
 * 1. load the address of the actual probepoint
@@ -322,6 +329,9 @@ void arch_optimize_kprobes(struct list_head *oplist)
struct ppc_inst instr;
struct optimized_kprobe *op;
struct optimized_kprobe *tmp;
+   unsigned long flags;
+
+   flags = lock_patching();
 
list_for_each_entry_safe(op, tmp, oplist, list) {
/*
@@ -333,9 +343,11 @@ void arch_optimize_kprobes(struct list_head *oplist)
create_branch(&instr,
  (struct ppc_inst *)op->kp.addr,
  (unsigned long)op->optinsn.insn, 0);
-   patch_instruction((struct ppc_inst *)op->kp.addr, instr);
+   patch_instruction_unlocked((struct ppc_inst *)op->kp.addr, 
instr);
list_del_init(&op->list);
}
+
+   unlock_pat

[PATCH v4 00/11] Use per-CPU temporary mappings for patching

2021-04-29 Thread Christopher M. Riedl
When compiled with CONFIG_STRICT_KERNEL_RWX, the kernel must create
temporary mappings when patching itself. These mappings temporarily
override the strict RWX text protections to permit a write. Currently,
powerpc allocates a per-CPU VM area for patching. Patching occurs as
follows:

1. Map page in per-CPU VM area w/ PAGE_KERNEL protection
2. Patch text
3. Remove the temporary mapping

While the VM area is per-CPU, the mapping is actually inserted into the
kernel page tables. Presumably, this could allow another CPU to access
the normally write-protected text - either malicously or accidentally -
via this same mapping if the address of the VM area is known. Ideally,
the mapping should be kept local to the CPU doing the patching [0].

x86 introduced "temporary mm" structs which allow the creation of
mappings local to a particular CPU [1]. This series intends to bring the
notion of a temporary mm to powerpc and harden powerpc by using such a
mapping for patching a kernel with strict RWX permissions.

The first four patches implement an LKDTM test "proof-of-concept" which
exploits the potential vulnerability (ie. the temporary mapping during
patching is exposed in the kernel page tables and accessible by other
CPUs) using a simple brute-force approach. This test is implemented for
both powerpc and x86_64. The test passes on powerpc with this new
series, fails on upstream powerpc, passes on upstream x86_64, and fails
on an older (ancient) x86_64 tree without the x86_64 temporary mm
patches. The remaining patches add support for and use a temporary mm
for code patching on powerpc.

Tested boot, ftrace, and repeated LKDTM "hijack":
- QEMU+KVM (host: POWER9 Blackbird): Radix MMU w/ KUAP
- QEMU+KVM (host: POWER9 Blackbird): Hash MMU w/o KUAP
- QEMU+KVM (host: POWER9 Blackbird): Hash MMU w/ KUAP

Tested repeated LKDTM "hijack":
- QEMU+KVM (host: AMD desktop): x86_64 upstream
- QEMU+KVM (host: AMD desktop): x86_64 w/o percpu temp mm to
  verify the LKDTM "hijack" fails

Tested boot and ftrace:
- QEMU+TCG: ppc44x (bamboo)
- QEMU+TCG: g5 (mac99)

I also tested with various extra config options enabled as suggested in
section 12) in Documentation/process/submit-checklist.rst.

v4: * It's time to revisit this series again since @jpn and @mpe fixed
  our known STRICT_*_RWX bugs on powerpc/64s.
* Rebase on linuxppc/next:
  commit ee1bc694fbaec ("powerpc/kvm: Fix build error when 
PPC_MEM_KEYS/PPC_PSERIES=n")
* Completely rework how map_patch() works on book3s64 Hash MMU
* Split the LKDTM x86_64 and powerpc bits into separate patches
* Annotate commit messages with changes from v3 instead of
  listing them here completely out-of context...

v3: * Rebase on linuxppc/next: commit 9123e3a74ec7 ("Linux 5.9-rc1")
* Move temporary mm implementation into code-patching.c where it
  belongs
* Implement LKDTM hijacker test on x86_64 (on IBM time oof) Do
* not use address zero for the patching address in the
  temporary mm (thanks @dja for pointing this out!)
* Wrap the LKDTM test w/ CONFIG_SMP as suggested by Christophe
  Leroy
* Comments to clarify PTE pre-allocation and patching addr
  selection

v2: * Rebase on linuxppc/next:
  commit 105fb38124a4 ("powerpc/8xx: Modify ptep_get()")
* Always dirty pte when mapping patch
* Use `ppc_inst_len` instead of `sizeof` on instructions
* Declare LKDTM patching addr accessor in header where it belongs   

v1: * Rebase on linuxppc/next (4336b9337824)
* Save and restore second hw watchpoint
* Use new ppc_inst_* functions for patching check and in LKDTM test

rfc-v2: * Many fixes and improvements mostly based on extensive feedback
  and testing by Christophe Leroy (thanks!).
* Make patching_mm and patching_addr static and move
  '__ro_after_init' to after the variable name (more common in
  other parts of the kernel)
* Use 'asm/debug.h' header instead of 'asm/hw_breakpoint.h' to
  fix PPC64e compile
* Add comment explaining why we use BUG_ON() during the init
  call to setup for patching later
* Move ptep into patch_mapping to avoid walking page tables a
  second time when unmapping the temporary mapping
* Use KUAP under non-radix, also manually dirty the PTE for patch
  mapping on non-BOOK3S_64 platforms
* Properly return any error from __patch_instruction
* Do not use 'memcmp' where a simple comparison is appropriate
* Simplify expression for patch address by removing pointer maths
* Add LKDTM test

[0]: https://github.com/linuxppc/issues/issues/224
[1]: 
htt

[PATCH v4 10/11] powerpc: Protect patching_mm with a lock

2021-04-29 Thread Christopher M. Riedl
Powerpc allows for multiple CPUs to patch concurrently. When patching
with STRICT_KERNEL_RWX a single patching_mm is allocated for use by all
CPUs for the few times that patching occurs. Use a spinlock to protect
the patching_mm from concurrent use.

Modify patch_instruction() to acquire the lock, perform the patch op,
and then release the lock.

Also introduce {lock,unlock}_patching() along with
patch_instruction_unlocked() to avoid per-iteration lock overhead when
patch_instruction() is called in a loop. A follow-up patch converts some
uses of patch_instruction() to use patch_instruction_unlocked() instead.

Signed-off-by: Christopher M. Riedl 

---

v4:  * New to series.
---
 arch/powerpc/include/asm/code-patching.h |  4 ++
 arch/powerpc/lib/code-patching.c | 85 +---
 2 files changed, 79 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h
index e51c81e4a9bda..2efa11b68cd8f 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -28,8 +28,12 @@ int create_branch(struct ppc_inst *instr, const struct 
ppc_inst *addr,
 int create_cond_branch(struct ppc_inst *instr, const struct ppc_inst *addr,
   unsigned long target, int flags);
 int patch_branch(struct ppc_inst *addr, unsigned long target, int flags);
+int patch_branch_unlocked(struct ppc_inst *addr, unsigned long target, int 
flags);
 int patch_instruction(struct ppc_inst *addr, struct ppc_inst instr);
+int patch_instruction_unlocked(struct ppc_inst *addr, struct ppc_inst instr);
 int raw_patch_instruction(struct ppc_inst *addr, struct ppc_inst instr);
+unsigned long lock_patching(void);
+void unlock_patching(unsigned long flags);
 
 static inline unsigned long patch_site_addr(s32 *site)
 {
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 7e15abc09ec04..0a496bb52bbf4 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -52,13 +52,17 @@ int raw_patch_instruction(struct ppc_inst *addr, struct 
ppc_inst instr)
 
 #ifdef CONFIG_STRICT_KERNEL_RWX
 
+static DEFINE_SPINLOCK(patching_lock);
+
 struct temp_mm {
struct mm_struct *temp;
struct mm_struct *prev;
struct arch_hw_breakpoint brk[HBP_NUM_MAX];
+   spinlock_t *lock; /* protect access to the temporary mm */
 };
 
-static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm)
+static inline void init_temp_mm(struct temp_mm *temp_mm, struct mm_struct *mm,
+   spinlock_t *lock)
 {
/* Do not preload SLB entries from the thread_info struct */
if (IS_ENABLED(CONFIG_PPC_BOOK3S_64) && !radix_enabled())
@@ -66,12 +70,14 @@ static inline void init_temp_mm(struct temp_mm *temp_mm, 
struct mm_struct *mm)
 
temp_mm->temp = mm;
temp_mm->prev = NULL;
+   temp_mm->lock = lock;
memset(&temp_mm->brk, 0, sizeof(temp_mm->brk));
 }
 
 static inline void use_temporary_mm(struct temp_mm *temp_mm)
 {
lockdep_assert_irqs_disabled();
+   lockdep_assert_held(temp_mm->lock);
 
temp_mm->prev = current->active_mm;
switch_mm_irqs_off(temp_mm->prev, temp_mm->temp, current);
@@ -93,11 +99,13 @@ static inline void use_temporary_mm(struct temp_mm *temp_mm)
 static inline void unuse_temporary_mm(struct temp_mm *temp_mm)
 {
lockdep_assert_irqs_disabled();
+   lockdep_assert_held(temp_mm->lock);
 
switch_mm_irqs_off(temp_mm->temp, temp_mm->prev, current);
 
/*
-* On book3s64 the active_cpus counter increments in
+* The temporary mm can only be in use on a single CPU at a time due to
+* the temp_mm->lock. On book3s64 the active_cpus counter increments in
 * switch_mm_irqs_off(). With the Hash MMU this counter affects if TLB
 * flushes are local. We have to manually decrement that counter here
 * along with removing our current CPU from the mm's cpumask so that in
@@ -230,7 +238,7 @@ static int map_patch(const void *addr, struct patch_mapping 
*patch_mapping)
pte = pte_mkdirty(pte);
set_pte_at(patching_mm, patching_addr, patch_mapping->ptep, pte);
 
-   init_temp_mm(&patch_mapping->temp_mm, patching_mm);
+   init_temp_mm(&patch_mapping->temp_mm, patching_mm, &patching_lock);
use_temporary_mm(&patch_mapping->temp_mm);
 
/*
@@ -258,7 +266,6 @@ static int do_patch_instruction(struct ppc_inst *addr, 
struct ppc_inst instr)
 {
int err;
struct ppc_inst *patch_addr = NULL;
-   unsigned long flags;
struct patch_mapping patch_mapping;
 
/*
@@ -269,11 +276,12 @@ static int do_patch_instruction(struct ppc_inst *addr, 
struct ppc_inst instr)
if (!patching_mm)
return raw_patch_instruction(ad

[PATCH v4 04/11] lkdtm/x86_64: Add test to hijack a patch mapping

2021-04-29 Thread Christopher M. Riedl
A previous commit implemented an LKDTM test on powerpc to exploit the
temporary mapping established when patching code with STRICT_KERNEL_RWX
enabled. Extend the test to work on x86_64 as well.

Signed-off-by: Christopher M. Riedl 
---
 drivers/misc/lkdtm/perms.c | 29 ++---
 1 file changed, 26 insertions(+), 3 deletions(-)

diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index c6f96ebffccfd..55c3bec6d3b72 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -224,7 +224,7 @@ void lkdtm_ACCESS_NULL(void)
 }
 
 #if (IS_BUILTIN(CONFIG_LKDTM) && defined(CONFIG_STRICT_KERNEL_RWX) && \
-   defined(CONFIG_PPC))
+   (defined(CONFIG_PPC) || defined(CONFIG_X86_64)))
 /*
  * This is just a dummy location to patch-over.
  */
@@ -233,28 +233,51 @@ static void patching_target(void)
return;
 }
 
+#ifdef CONFIG_PPC
 #include 
 struct ppc_inst * const patch_site = (struct ppc_inst *)&patching_target;
+#endif
+
+#ifdef CONFIG_X86_64
+#include 
+u32 * const patch_site = (u32 *)&patching_target;
+#endif
 
 static inline int lkdtm_do_patch(u32 data)
 {
+#ifdef CONFIG_PPC
return patch_instruction(patch_site, ppc_inst(data));
+#endif
+#ifdef CONFIG_X86_64
+   text_poke(patch_site, &data, sizeof(u32));
+   return 0;
+#endif
 }
 
 static inline u32 lkdtm_read_patch_site(void)
 {
+#ifdef CONFIG_PPC
struct ppc_inst inst = READ_ONCE(*patch_site);
return ppc_inst_val(ppc_inst_read(&inst));
+#endif
+#ifdef CONFIG_X86_64
+   return READ_ONCE(*patch_site);
+#endif
 }
 
 /* Returns True if the write succeeds */
 static inline bool lkdtm_try_write(u32 data, u32 *addr)
 {
+#ifdef CONFIG_PPC
__put_kernel_nofault(addr, &data, u32, err);
return true;
 
 err:
return false;
+#endif
+#ifdef CONFIG_X86_64
+   return !__put_user(data, addr);
+#endif
 }
 
 static int lkdtm_patching_cpu(void *data)
@@ -347,8 +370,8 @@ void lkdtm_HIJACK_PATCH(void)
 
 void lkdtm_HIJACK_PATCH(void)
 {
-   if (!IS_ENABLED(CONFIG_PPC))
-   pr_err("XFAIL: this test only runs on powerpc\n");
+   if (!IS_ENABLED(CONFIG_PPC) && !IS_ENABLED(CONFIG_X86_64))
+   pr_err("XFAIL: this test only runs on powerpc and x86_64\n");
if (!IS_ENABLED(CONFIG_STRICT_KERNEL_RWX))
pr_err("XFAIL: this test requires CONFIG_STRICT_KERNEL_RWX\n");
if (!IS_BUILTIN(CONFIG_LKDTM))
-- 
2.26.1



[PATCH v4 01/11] powerpc: Add LKDTM accessor for patching addr

2021-04-29 Thread Christopher M. Riedl
When live patching with STRICT_KERNEL_RWX a mapping is installed at a
"patching address" with temporary write permissions. Provide a
LKDTM-only accessor function for this address in preparation for a LKDTM
test which attempts to "hijack" this mapping by writing to it from
another CPU.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/code-patching.h | 4 
 arch/powerpc/lib/code-patching.c | 7 +++
 2 files changed, 11 insertions(+)

diff --git a/arch/powerpc/include/asm/code-patching.h 
b/arch/powerpc/include/asm/code-patching.h
index f1d029bf906e5..e51c81e4a9bda 100644
--- a/arch/powerpc/include/asm/code-patching.h
+++ b/arch/powerpc/include/asm/code-patching.h
@@ -188,4 +188,8 @@ static inline unsigned long ppc_kallsyms_lookup_name(const 
char *name)
 ___PPC_RA(__REG_R1) | PPC_LR_STKOFF)
 #endif /* CONFIG_PPC64 */
 
+#if IS_BUILTIN(CONFIG_LKDTM) && IS_ENABLED(CONFIG_STRICT_KERNEL_RWX)
+unsigned long read_cpu_patching_addr(unsigned int cpu);
+#endif
+
 #endif /* _ASM_POWERPC_CODE_PATCHING_H */
diff --git a/arch/powerpc/lib/code-patching.c b/arch/powerpc/lib/code-patching.c
index 870b30d9be2f8..2b1b3e9043ade 100644
--- a/arch/powerpc/lib/code-patching.c
+++ b/arch/powerpc/lib/code-patching.c
@@ -48,6 +48,13 @@ int raw_patch_instruction(struct ppc_inst *addr, struct 
ppc_inst instr)
 #ifdef CONFIG_STRICT_KERNEL_RWX
 static DEFINE_PER_CPU(struct vm_struct *, text_poke_area);
 
+#if IS_BUILTIN(CONFIG_LKDTM)
+unsigned long read_cpu_patching_addr(unsigned int cpu)
+{
+   return (unsigned long)(per_cpu(text_poke_area, cpu))->addr;
+}
+#endif
+
 static int text_area_cpu_up(unsigned int cpu)
 {
struct vm_struct *area;
-- 
2.26.1



[PATCH v4 09/11] lkdtm/powerpc: Fix code patching hijack test

2021-04-29 Thread Christopher M. Riedl
Code patching on powerpc with a STRICT_KERNEL_RWX uses a userspace
address in a temporary mm now. Use __put_user() to avoid write failures
due to KUAP when attempting a "hijack" on the patching address.

Signed-off-by: Christopher M. Riedl 
---
 drivers/misc/lkdtm/perms.c | 9 -
 1 file changed, 9 deletions(-)

diff --git a/drivers/misc/lkdtm/perms.c b/drivers/misc/lkdtm/perms.c
index 55c3bec6d3b72..af9bf285fe326 100644
--- a/drivers/misc/lkdtm/perms.c
+++ b/drivers/misc/lkdtm/perms.c
@@ -268,16 +268,7 @@ static inline u32 lkdtm_read_patch_site(void)
 /* Returns True if the write succeeds */
 static inline bool lkdtm_try_write(u32 data, u32 *addr)
 {
-#ifdef CONFIG_PPC
-   __put_kernel_nofault(addr, &data, u32, err);
-   return true;
-
-err:
-   return false;
-#endif
-#ifdef CONFIG_X86_64
return !__put_user(data, addr);
-#endif
 }
 
 static int lkdtm_patching_cpu(void *data)
-- 
2.26.1



[PATCH v7 10/10] powerpc/signal: Use __get_user() to copy sigset_t

2021-02-26 Thread Christopher M. Riedl
Usually sigset_t is exactly 8B which is a "trivial" size and does not
warrant using __copy_from_user(). Use __get_user() directly in
anticipation of future work to remove the trivial size optimizations
from __copy_from_user().

The ppc32 implementation of get_sigset_t() previously called
copy_from_user() which, unlike __copy_from_user(), calls access_ok().
Replacing this w/ __get_user() (no access_ok()) is fine here since both
callsites in signal_32.c are preceded by an earlier access_ok().

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal.h| 7 +++
 arch/powerpc/kernel/signal_32.c | 2 +-
 arch/powerpc/kernel/signal_64.c | 4 ++--
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
index d8dd76b1dc94..1393876f3814 100644
--- a/arch/powerpc/kernel/signal.h
+++ b/arch/powerpc/kernel/signal.h
@@ -19,6 +19,13 @@ extern int handle_signal32(struct ksignal *ksig, sigset_t 
*oldset,
 extern int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
  struct task_struct *tsk);
 
+static inline int __get_user_sigset(sigset_t *dst, const sigset_t __user *src)
+{
+   BUILD_BUG_ON(sizeof(sigset_t) != sizeof(u64));
+
+   return __get_user(dst->sig[0], (u64 __user *)&src->sig[0]);
+}
+
 #ifdef CONFIG_VSX
 extern unsigned long copy_vsx_to_user(void __user *to,
  struct task_struct *task);
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 75ee918a120a..c505b444a613 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -144,7 +144,7 @@ static inline int restore_general_regs(struct pt_regs *regs,
 
 static inline int get_sigset_t(sigset_t *set, const sigset_t __user *uset)
 {
-   return copy_from_user(set, uset, sizeof(*uset));
+   return __get_user_sigset(set, uset);
 }
 
 #define to_user_ptr(p) ((unsigned long)(p))
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 00c907022707..debe88055f38 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -708,7 +708,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
 * We kill the task with a SIGSEGV in this situation.
 */
 
-   if (__copy_from_user(&set, &new_ctx->uc_sigmask, sizeof(set)))
+   if (__get_user_sigset(&set, &new_ctx->uc_sigmask))
do_exit(SIGSEGV);
set_current_blocked(&set);
 
@@ -747,7 +747,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (!access_ok(uc, sizeof(*uc)))
goto badframe;
 
-   if (__copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
+   if (__get_user_sigset(&set, &uc->uc_sigmask))
goto badframe;
set_current_blocked(&set);
 
-- 
2.26.1



[PATCH v7 06/10] powerpc/signal64: Replace setup_sigcontext() w/ unsafe_setup_sigcontext()

2021-02-26 Thread Christopher M. Riedl
Previously setup_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite setup_sigcontext() to assume that a userspace write access window
is open by replacing all uaccess functions with their 'unsafe' versions.
Modify the callers to first open, call unsafe_setup_sigcontext() and
then close the uaccess window.

Signed-off-by: Christopher M. Riedl 
---
v7: * Don't use unsafe_op_wrap() since Christophe indicates this
  macro may go away in the future.
---
 arch/powerpc/kernel/signal_64.c | 72 -
 1 file changed, 45 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index bd8d210c9115..78ae4bb4e590 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -101,9 +101,14 @@ static void prepare_setup_sigcontext(struct task_struct 
*tsk)
  * Set up the sigcontext for the signal frame.
  */
 
-static long setup_sigcontext(struct sigcontext __user *sc,
-   struct task_struct *tsk, int signr, sigset_t *set,
-   unsigned long handler, int ctx_has_vsx_region)
+#define unsafe_setup_sigcontext(sc, tsk, signr, set, handler, 
ctx_has_vsx_region, label)\
+do {   
\
+   if (__unsafe_setup_sigcontext(sc, tsk, signr, set, handler, 
ctx_has_vsx_region))\
+   goto label; 
\
+} while (0)
+static long notrace __unsafe_setup_sigcontext(struct sigcontext __user *sc,
+   struct task_struct *tsk, int signr, 
sigset_t *set,
+   unsigned long handler, int 
ctx_has_vsx_region)
 {
/* When CONFIG_ALTIVEC is set, we _always_ setup v_regs even if the
 * process never used altivec yet (MSR_VEC is zero in pt_regs of
@@ -118,20 +123,19 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 #endif
struct pt_regs *regs = tsk->thread.regs;
unsigned long msr = regs->msr;
-   long err = 0;
/* Force usr to alway see softe as 1 (interrupts enabled) */
unsigned long softe = 0x1;
 
BUG_ON(tsk != current);
 
 #ifdef CONFIG_ALTIVEC
-   err |= __put_user(v_regs, &sc->v_regs);
+   unsafe_put_user(v_regs, &sc->v_regs, efault_out);
 
/* save altivec registers */
if (tsk->thread.used_vr) {
/* Copy 33 vec registers (vr0..31 and vscr) to the stack */
-   err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
- 33 * sizeof(vector128));
+   unsafe_copy_to_user(v_regs, &tsk->thread.vr_state,
+   33 * sizeof(vector128), efault_out);
/* set MSR_VEC in the MSR value in the frame to indicate that 
sc->v_reg)
 * contains valid data.
 */
@@ -140,12 +144,12 @@ static long setup_sigcontext(struct sigcontext __user *sc,
/* We always copy to/from vrsave, it's 0 if we don't have or don't
 * use altivec.
 */
-   err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
+   unsafe_put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33], 
efault_out);
 #else /* CONFIG_ALTIVEC */
-   err |= __put_user(0, &sc->v_regs);
+   unsafe_put_user(0, &sc->v_regs, efault_out);
 #endif /* CONFIG_ALTIVEC */
/* copy fpr regs and fpscr */
-   err |= copy_fpr_to_user(&sc->fp_regs, tsk);
+   unsafe_copy_fpr_to_user(&sc->fp_regs, tsk, efault_out);
 
/*
 * Clear the MSR VSX bit to indicate there is no valid state attached
@@ -160,24 +164,27 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 */
if (tsk->thread.used_vsr && ctx_has_vsx_region) {
v_regs += ELF_NVRREG;
-   err |= copy_vsx_to_user(v_regs, tsk);
+   unsafe_copy_vsx_to_user(v_regs, tsk, efault_out);
/* set MSR_VSX in the MSR value in the frame to
 * indicate that sc->vs_reg) contains valid data.
 */
msr |= MSR_VSX;
}
 #endif /* CONFIG_VSX */
-   err |= __put_user(&sc->gp_regs, &sc->regs);
+   unsafe_put_user(&sc->gp_regs, &sc->regs, efault_out);
WARN_ON(!FULL_REGS(regs));
-   err |= __copy_to_user(&sc->gp_regs, regs, GP_REGS_SIZE);
-   err |= __put_user(msr, &sc->gp_regs[PT_MSR]);
-   err |= __put_user(softe, &sc->gp_regs[PT_SOFTE]);
-   err |= __put_user(signr, &sc->signal);
-   err |= __put_user(handler, &sc->handler);
+   unsafe_copy_to_user(&sc->gp_regs, reg

[PATCH v7 07/10] powerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()

2021-02-26 Thread Christopher M. Riedl
Previously restore_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite restore_sigcontext() to assume that a userspace read access
window is open by replacing all uaccess functions with their 'unsafe'
versions. Modify the callers to first open, call
unsafe_restore_sigcontext(), and then close the uaccess window.

Signed-off-by: Christopher M. Riedl 
---
v7: * Don't use unsafe_op_wrap() since Christophe indicates this
  macro may go away in the future.
---
 arch/powerpc/kernel/signal_64.c | 68 -
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 78ae4bb4e590..23a44ec3ac01 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -327,14 +327,16 @@ static long setup_tm_sigcontexts(struct sigcontext __user 
*sc,
 /*
  * Restore the sigcontext from the signal frame.
  */
-
-static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
- struct sigcontext __user *sc)
+#define unsafe_restore_sigcontext(tsk, set, sig, sc, label) do {   \
+   if (__unsafe_restore_sigcontext(tsk, set, sig, sc)) \
+   goto label; \
+} while (0)
+static long notrace __unsafe_restore_sigcontext(struct task_struct *tsk, 
sigset_t *set,
+   int sig, struct sigcontext 
__user *sc)
 {
 #ifdef CONFIG_ALTIVEC
elf_vrreg_t __user *v_regs;
 #endif
-   unsigned long err = 0;
unsigned long save_r13 = 0;
unsigned long msr;
struct pt_regs *regs = tsk->thread.regs;
@@ -349,27 +351,27 @@ static long restore_sigcontext(struct task_struct *tsk, 
sigset_t *set, int sig,
save_r13 = regs->gpr[13];
 
/* copy the GPRs */
-   err |= __copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr));
-   err |= __get_user(regs->nip, &sc->gp_regs[PT_NIP]);
+   unsafe_copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr), 
efault_out);
+   unsafe_get_user(regs->nip, &sc->gp_regs[PT_NIP], efault_out);
/* get MSR separately, transfer the LE bit if doing signal return */
-   err |= __get_user(msr, &sc->gp_regs[PT_MSR]);
+   unsafe_get_user(msr, &sc->gp_regs[PT_MSR], efault_out);
if (sig)
regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
-   err |= __get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3]);
-   err |= __get_user(regs->ctr, &sc->gp_regs[PT_CTR]);
-   err |= __get_user(regs->link, &sc->gp_regs[PT_LNK]);
-   err |= __get_user(regs->xer, &sc->gp_regs[PT_XER]);
-   err |= __get_user(regs->ccr, &sc->gp_regs[PT_CCR]);
+   unsafe_get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3], efault_out);
+   unsafe_get_user(regs->ctr, &sc->gp_regs[PT_CTR], efault_out);
+   unsafe_get_user(regs->link, &sc->gp_regs[PT_LNK], efault_out);
+   unsafe_get_user(regs->xer, &sc->gp_regs[PT_XER], efault_out);
+   unsafe_get_user(regs->ccr, &sc->gp_regs[PT_CCR], efault_out);
/* Don't allow userspace to set SOFTE */
set_trap_norestart(regs);
-   err |= __get_user(regs->dar, &sc->gp_regs[PT_DAR]);
-   err |= __get_user(regs->dsisr, &sc->gp_regs[PT_DSISR]);
-   err |= __get_user(regs->result, &sc->gp_regs[PT_RESULT]);
+   unsafe_get_user(regs->dar, &sc->gp_regs[PT_DAR], efault_out);
+   unsafe_get_user(regs->dsisr, &sc->gp_regs[PT_DSISR], efault_out);
+   unsafe_get_user(regs->result, &sc->gp_regs[PT_RESULT], efault_out);
 
if (!sig)
regs->gpr[13] = save_r13;
if (set != NULL)
-   err |=  __get_user(set->sig[0], &sc->oldmask);
+   unsafe_get_user(set->sig[0], &sc->oldmask, efault_out);
 
/*
 * Force reload of FP/VEC.
@@ -379,29 +381,27 @@ static long restore_sigcontext(struct task_struct *tsk, 
sigset_t *set, int sig,
regs->msr &= ~(MSR_FP | MSR_FE0 | MSR_FE1 | MSR_VEC | MSR_VSX);
 
 #ifdef CONFIG_ALTIVEC
-   err |= __get_user(v_regs, &sc->v_regs);
-   if (err)
-   return err;
+   unsafe_get_user(v_regs, &sc->v_regs, efault_out);
if (v_regs && !access_ok(v_regs, 34 * sizeof(vector128)))
return -EFAULT;
/* Copy 33 vec registers (vr0..31 and vscr) from the stack */
if (v_regs != NULL && (msr & MSR_VEC) != 0) {
-   err |= __copy_from_user(&tsk->thread.vr_state, v_regs,
-  

[PATCH v7 08/10] powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess switches

2021-02-26 Thread Christopher M. Riedl
From: Daniel Axtens 

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

There is no 'unsafe' version of copy_siginfo_to_user, so move it
slightly to allow for a "longer" uaccess block.

Signed-off-by: Daniel Axtens 
Co-developed-by: Christopher M. Riedl 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 56 -
 1 file changed, 35 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 23a44ec3ac01..788854734b9a 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -854,45 +854,52 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
*set,
unsigned long msr = regs->msr;
 
frame = get_sigframe(ksig, tsk, sizeof(*frame), 0);
-   if (!access_ok(frame, sizeof(*frame)))
-   goto badframe;
 
-   err |= __put_user(&frame->info, &frame->pinfo);
-   err |= __put_user(&frame->uc, &frame->puc);
-   err |= copy_siginfo_to_user(&frame->info, &ksig->info);
-   if (err)
+   /* This only applies when calling unsafe_setup_sigcontext() and must be
+* called before opening the uaccess window.
+*/
+   if (!MSR_TM_ACTIVE(msr))
+   prepare_setup_sigcontext(tsk);
+
+   if (!user_write_access_begin(frame, sizeof(*frame)))
goto badframe;
 
+   unsafe_put_user(&frame->info, &frame->pinfo, badframe_block);
+   unsafe_put_user(&frame->uc, &frame->puc, badframe_block);
+
/* Create the ucontext.  */
-   err |= __put_user(0, &frame->uc.uc_flags);
-   err |= __save_altstack(&frame->uc.uc_stack, regs->gpr[1]);
+   unsafe_put_user(0, &frame->uc.uc_flags, badframe_block);
+   unsafe_save_altstack(&frame->uc.uc_stack, regs->gpr[1], badframe_block);
 
if (MSR_TM_ACTIVE(msr)) {
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
/* The ucontext_t passed to userland points to the second
 * ucontext_t (for transactional state) with its uc_link ptr.
 */
-   err |= __put_user(&frame->uc_transact, &frame->uc.uc_link);
+   unsafe_put_user(&frame->uc_transact, &frame->uc.uc_link, 
badframe_block);
+
+   user_write_access_end();
+
err |= setup_tm_sigcontexts(&frame->uc.uc_mcontext,
&frame->uc_transact.uc_mcontext,
tsk, ksig->sig, NULL,
(unsigned 
long)ksig->ka.sa.sa_handler,
msr);
+
+   if (!user_write_access_begin(&frame->uc.uc_sigmask,
+sizeof(frame->uc.uc_sigmask)))
+   goto badframe;
+
 #endif
} else {
-   err |= __put_user(0, &frame->uc.uc_link);
-   prepare_setup_sigcontext(tsk);
-   if (!user_write_access_begin(&frame->uc.uc_mcontext,
-sizeof(frame->uc.uc_mcontext)))
-   return -EFAULT;
-   err |= __unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk,
-   ksig->sig, NULL,
-   (unsigned 
long)ksig->ka.sa.sa_handler, 1);
-   user_write_access_end();
+   unsafe_put_user(0, &frame->uc.uc_link, badframe_block);
+   unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
+   NULL, (unsigned 
long)ksig->ka.sa.sa_handler,
+   1, badframe_block);
}
-   err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
-   if (err)
-   goto badframe;
+
+   unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), 
badframe_block);
+   user_write_access_end();
 
/* Make sure signal handler doesn't get spurious FP exceptions */
tsk->thread.fp_state.fpscr = 0;
@@ -907,6 +914,11 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
regs->nip = (unsigned long) &frame->tramp[0];
}
 
+
+   /* Save the siginfo outside of the unsafe block. */
+   if (copy_siginfo_to_user(&frame->info, &ksig->info))
+   goto badframe;
+
/* Allocate a dummy caller frame for the signal handler. */
newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);
@@ -946,6 +958,8 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 
return 0;
 
+badframe_block:
+   user_write_access_end();
 badframe:
signal_fault(current, regs, "handle_rt_signal64", frame);
 
-- 
2.26.1



[PATCH v7 00/10] Improve signal performance on PPC64 with KUAP

2021-02-26 Thread Christopher M. Riedl
As reported by Anton, there is a large penalty to signal handling
performance on radix systems using KUAP. The signal handling code
performs many user access operations, each of which needs to switch the
KUAP permissions bit to open and then close user access. This involves a
costly 'mtspr' operation [0].

There is existing work done on x86 and by Christophe Leroy for PPC32 to
instead open up user access in "blocks" using user_*_access_{begin,end}.
We can do the same in PPC64 to bring performance back up on KUAP-enabled
radix and now also hash MMU systems [1].

Hash MMU KUAP support along with uaccess flush has landed in linuxppc/next
since the last revision. This series also provides a large benefit on hash
with KUAP. However, in the hash implementation of KUAP the user AMR is
always restored during system_call_exception() which cannot be avoided.
Fewer user access switches naturally also result in less uaccess flushing.

The first two patches add some needed 'unsafe' versions of copy-from
functions. While these do not make use of asm-goto they still allow for
avoiding the repeated uaccess switches.

The third patch moves functions called by setup_sigcontext() into a new
prepare_setup_sigcontext() to simplify converting setup_sigcontext()
into an 'unsafe' version which assumes an open uaccess window later.

The fourth and fifths patches clean-up some of the Transactional Memory
ifdef stuff to simplify using uaccess blocks later.

The next two patches rewrite some of the signal64 helper functions to
be 'unsafe'. Finally, the last three patches update the main signal
handling functions to make use of the new 'unsafe' helpers and eliminate
some additional uaccess switching.

I used the will-it-scale signal1 benchmark to measure and compare
performance [2]. The below results are from running a minimal
kernel+initramfs QEMU/KVM guest on a POWER9 Blackbird:

signal1_threads -t1 -s10

|  | hash   | radix  |
|  | -- | -- |
| linuxppc/next| 117898 | 135884 |
| linuxppc/next w/o KUAP+KUEP  | 225502 | 227509 |
| unsafe-signal64  | 195351 | 230922 |

[0]: https://github.com/linuxppc/issues/issues/277
[1]: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=196278
[2]: https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c

v7: * Address feedback from Christophe Leroy

v6: * Rebase on latest linuxppc/next and address feedback comments from
  Daniel Axtens and friends (also pick up some Reviewed-by tags)
* Simplify __get_user_sigset(), fix sparse warnings, and use it
  in ppc32 signal handling
* Remove ctx_has_vsx_region arg to prepare_setup_sigcontext()
* Remove local buffer in copy_{fpr,vsx}_from_user()
* Rework the TM ifdefery-removal and remove one of the ifdef
  pairs altogether

v5: * Use sizeof(buf) in copy_{vsx,fpr}_from_user() (Thanks David Laight)
* Rebase on latest linuxppc/next

v4: * Fix issues identified by Christophe Leroy (Thanks for review)
* Use __get_user() directly to copy the 8B sigset_t

v3: * Rebase on latest linuxppc/next
* Reword confusing commit messages
* Add missing comma in macro in signal.h which broke compiles without
  CONFIG_ALTIVEC
* Validate hash KUAP signal performance improvements

v2: * Rebase on latest linuxppc/next + Christophe Leroy's PPC32
  signal series
* Simplify/remove TM ifdefery similar to PPC32 series and clean
  up the uaccess begin/end calls
* Isolate non-inline functions so they are not called when
  uaccess window is open

Christopher M. Riedl (8):
  powerpc/uaccess: Add unsafe_copy_from_user()
  powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()
  powerpc/signal64: Remove non-inline calls from setup_sigcontext()
  powerpc: Reference parameter in MSR_TM_ACTIVE() macro
  powerpc/signal64: Remove TM ifdefery in middle of if/else block
  powerpc/signal64: Replace setup_sigcontext() w/
unsafe_setup_sigcontext()
  powerpc/signal64: Replace restore_sigcontext() w/
unsafe_restore_sigcontext()
  powerpc/signal: Use __get_user() to copy sigset_t

Daniel Axtens (2):
  powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess
switches
  powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

 arch/powerpc/include/asm/reg.h |   2 +-
 arch/powerpc/include/asm/uaccess.h |  21 ++
 arch/powerpc/kernel/process.c  |   3 +-
 arch/powerpc/kernel/signal.h   |  33 +++
 arch/powerpc/kernel/signal_32.c|   2 +-
 arch/powerpc/kernel/signal_64.c| 316 +
 6 files changed, 246 insertions(+), 131 deletions(-)

-- 
2.26.1



[PATCH v7 05/10] powerpc/signal64: Remove TM ifdefery in middle of if/else block

2021-02-26 Thread Christopher M. Riedl
Both rt_sigreturn() and handle_rt_signal_64() contain TM-related ifdefs
which break-up an if/else block. Provide stubs for the ifdef-guarded TM
functions and remove the need for an ifdef in rt_sigreturn().

Rework the remaining TM ifdef in handle_rt_signal64() similar to
commit f1cf4f93de2f ("powerpc/signal32: Remove ifdefery in middle of if/else").

Unlike in the commit for ppc32, the ifdef can't be removed entirely
since uc_transact in sigframe depends on CONFIG_PPC_TRANSACTIONAL_MEM.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/process.c   |   3 +-
 arch/powerpc/kernel/signal_64.c | 102 
 2 files changed, 54 insertions(+), 51 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 924d023dad0a..08c3fbe45921 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1117,9 +1117,10 @@ void restore_tm_state(struct pt_regs *regs)
regs->msr |= msr_diff;
 }
 
-#else
+#else /* !CONFIG_PPC_TRANSACTIONAL_MEM */
 #define tm_recheckpoint_new_task(new)
 #define __switch_to_tm(prev, new)
+void tm_reclaim_current(uint8_t cause) {}
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
 
 static inline void save_sprs(struct thread_struct *t)
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 6ca546192cbf..bd8d210c9115 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -594,6 +594,12 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
 
return err;
 }
+#else /* !CONFIG_PPC_TRANSACTIONAL_MEM */
+static long restore_tm_sigcontexts(struct task_struct *tsk, struct sigcontext 
__user *sc,
+  struct sigcontext __user *tm_sc)
+{
+   return -EINVAL;
+}
 #endif
 
 /*
@@ -710,9 +716,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
struct pt_regs *regs = current_pt_regs();
struct ucontext __user *uc = (struct ucontext __user *)regs->gpr[1];
sigset_t set;
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
unsigned long msr;
-#endif
 
/* Always make any pending restarted system calls return -EINTR */
current->restart_block.fn = do_no_restart_syscall;
@@ -724,48 +728,50 @@ SYSCALL_DEFINE0(rt_sigreturn)
goto badframe;
set_current_blocked(&set);
 
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-   /*
-* If there is a transactional state then throw it away.
-* The purpose of a sigreturn is to destroy all traces of the
-* signal frame, this includes any transactional state created
-* within in. We only check for suspended as we can never be
-* active in the kernel, we are active, there is nothing better to
-* do than go ahead and Bad Thing later.
-* The cause is not important as there will never be a
-* recheckpoint so it's not user visible.
-*/
-   if (MSR_TM_SUSPENDED(mfmsr()))
-   tm_reclaim_current(0);
+   if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM)) {
+   /*
+* If there is a transactional state then throw it away.
+* The purpose of a sigreturn is to destroy all traces of the
+* signal frame, this includes any transactional state created
+* within in. We only check for suspended as we can never be
+* active in the kernel, we are active, there is nothing better 
to
+* do than go ahead and Bad Thing later.
+* The cause is not important as there will never be a
+* recheckpoint so it's not user visible.
+*/
+   if (MSR_TM_SUSPENDED(mfmsr()))
+   tm_reclaim_current(0);
 
-   /*
-* Disable MSR[TS] bit also, so, if there is an exception in the
-* code below (as a page fault in copy_ckvsx_to_user()), it does
-* not recheckpoint this task if there was a context switch inside
-* the exception.
-*
-* A major page fault can indirectly call schedule(). A reschedule
-* process in the middle of an exception can have a side effect
-* (Changing the CPU MSR[TS] state), since schedule() is called
-* with the CPU MSR[TS] disable and returns with MSR[TS]=Suspended
-* (switch_to() calls tm_recheckpoint() for the 'new' process). In
-* this case, the process continues to be the same in the CPU, but
-* the CPU state just changed.
-*
-* This can cause a TM Bad Thing, since the MSR in the stack will
-* have the MSR[TS]=0, and this is what will be used to RFID.
-*
-* Clearing MSR[TS] state here will avoid a recheckpoint if there
-* is any process reschedule in kernel space. The MSR[TS] state
-* does not need to be saved also, since it will be replaced with
-* the MSR[TS] that came from user con

[PATCH v7 03/10] powerpc/signal64: Remove non-inline calls from setup_sigcontext()

2021-02-26 Thread Christopher M. Riedl
The majority of setup_sigcontext() can be refactored to execute in an
"unsafe" context assuming an open uaccess window except for some
non-inline function calls. Move these out into a separate
prepare_setup_sigcontext() function which must be called first and
before opening up a uaccess window. Non-inline function calls should be
avoided during a uaccess window for a few reasons:

- KUAP should be enabled for as much kernel code as possible.
  Opening a uaccess window disables KUAP which means any code
  executed during this time contributes to a potential attack
  surface.

- Non-inline functions default to traceable which means they are
  instrumented for ftrace. This adds more code which could run
  with KUAP disabled.

- Powerpc does not currently support the objtool UACCESS checks.
  All code running with uaccess must be audited manually which
  means: less code -> less work -> fewer problems (in theory).

A follow-up commit converts setup_sigcontext() to be "unsafe".

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index f9e4a1ac440f..6ca546192cbf 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -79,6 +79,24 @@ static elf_vrreg_t __user *sigcontext_vmx_regs(struct 
sigcontext __user *sc)
 }
 #endif
 
+static void prepare_setup_sigcontext(struct task_struct *tsk)
+{
+#ifdef CONFIG_ALTIVEC
+   /* save altivec registers */
+   if (tsk->thread.used_vr)
+   flush_altivec_to_thread(tsk);
+   if (cpu_has_feature(CPU_FTR_ALTIVEC))
+   tsk->thread.vrsave = mfspr(SPRN_VRSAVE);
+#endif /* CONFIG_ALTIVEC */
+
+   flush_fp_to_thread(tsk);
+
+#ifdef CONFIG_VSX
+   if (tsk->thread.used_vsr)
+   flush_vsx_to_thread(tsk);
+#endif /* CONFIG_VSX */
+}
+
 /*
  * Set up the sigcontext for the signal frame.
  */
@@ -97,7 +115,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 */
 #ifdef CONFIG_ALTIVEC
elf_vrreg_t __user *v_regs = sigcontext_vmx_regs(sc);
-   unsigned long vrsave;
 #endif
struct pt_regs *regs = tsk->thread.regs;
unsigned long msr = regs->msr;
@@ -112,7 +129,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 
/* save altivec registers */
if (tsk->thread.used_vr) {
-   flush_altivec_to_thread(tsk);
/* Copy 33 vec registers (vr0..31 and vscr) to the stack */
err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
  33 * sizeof(vector128));
@@ -124,17 +140,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
/* We always copy to/from vrsave, it's 0 if we don't have or don't
 * use altivec.
 */
-   vrsave = 0;
-   if (cpu_has_feature(CPU_FTR_ALTIVEC)) {
-   vrsave = mfspr(SPRN_VRSAVE);
-   tsk->thread.vrsave = vrsave;
-   }
-
-   err |= __put_user(vrsave, (u32 __user *)&v_regs[33]);
+   err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
 #else /* CONFIG_ALTIVEC */
err |= __put_user(0, &sc->v_regs);
 #endif /* CONFIG_ALTIVEC */
-   flush_fp_to_thread(tsk);
/* copy fpr regs and fpscr */
err |= copy_fpr_to_user(&sc->fp_regs, tsk);
 
@@ -150,7 +159,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 * VMX data.
 */
if (tsk->thread.used_vsr && ctx_has_vsx_region) {
-   flush_vsx_to_thread(tsk);
v_regs += ELF_NVRREG;
err |= copy_vsx_to_user(v_regs, tsk);
/* set MSR_VSX in the MSR value in the frame to
@@ -655,6 +663,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
ctx_has_vsx_region = 1;
 
if (old_ctx != NULL) {
+   prepare_setup_sigcontext(current);
if (!access_ok(old_ctx, ctx_size)
|| setup_sigcontext(&old_ctx->uc_mcontext, current, 0, 
NULL, 0,
ctx_has_vsx_region)
@@ -842,6 +851,7 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 #endif
{
err |= __put_user(0, &frame->uc.uc_link);
+   prepare_setup_sigcontext(tsk);
err |= setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
NULL, (unsigned 
long)ksig->ka.sa.sa_handler,
1);
-- 
2.26.1



[PATCH v7 04/10] powerpc: Reference parameter in MSR_TM_ACTIVE() macro

2021-02-26 Thread Christopher M. Riedl
Unlike the other MSR_TM_* macros, MSR_TM_ACTIVE does not reference or
use its parameter unless CONFIG_PPC_TRANSACTIONAL_MEM is defined. This
causes an 'unused variable' compile warning unless the variable is also
guarded with CONFIG_PPC_TRANSACTIONAL_MEM.

Reference but do nothing with the argument in the macro to avoid a
potential compile warning.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Daniel Axtens 
---
 arch/powerpc/include/asm/reg.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index da103e92c112..1be20bc8dce2 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -124,7 +124,7 @@
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 #define MSR_TM_ACTIVE(x) (((x) & MSR_TS_MASK) != 0) /* Transaction active? */
 #else
-#define MSR_TM_ACTIVE(x) 0
+#define MSR_TM_ACTIVE(x) ((void)(x), 0)
 #endif
 
 #if defined(CONFIG_PPC_BOOK3S_64)
-- 
2.26.1



[PATCH v7 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

2021-02-26 Thread Christopher M. Riedl
Reuse the "safe" implementation from signal.c but call unsafe_get_user()
directly in a loop to avoid the intermediate copy into a local buffer.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Daniel Axtens 
---
 arch/powerpc/kernel/signal.h | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
index 2559a681536e..d8dd76b1dc94 100644
--- a/arch/powerpc/kernel/signal.h
+++ b/arch/powerpc/kernel/signal.h
@@ -53,6 +53,26 @@ unsigned long copy_ckfpr_from_user(struct task_struct *task, 
void __user *from);
&buf[i], label);\
 } while (0)
 
+#define unsafe_copy_fpr_from_user(task, from, label)   do {\
+   struct task_struct *__t = task; \
+   u64 __user *buf = (u64 __user *)from;   \
+   int i;  \
+   \
+   for (i = 0; i < ELF_NFPREG - 1; i++)\
+   unsafe_get_user(__t->thread.TS_FPR(i), &buf[i], label); \
+   unsafe_get_user(__t->thread.fp_state.fpscr, &buf[i], label);\
+} while (0)
+
+#define unsafe_copy_vsx_from_user(task, from, label)   do {\
+   struct task_struct *__t = task; \
+   u64 __user *buf = (u64 __user *)from;   \
+   int i;  \
+   \
+   for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
+   unsafe_get_user(__t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET], \
+   &buf[i], label);\
+} while (0)
+
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 #define unsafe_copy_ckfpr_to_user(to, task, label) do {\
struct task_struct *__t = task; \
@@ -80,6 +100,10 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
*task, void __user *from);
unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\
ELF_NFPREG * sizeof(double), label)
 
+#define unsafe_copy_fpr_from_user(task, from, label)   \
+   unsafe_copy_from_user((task)->thread.fp_state.fpr, from,\
+   ELF_NFPREG * sizeof(double), label)
+
 static inline unsigned long
 copy_fpr_to_user(void __user *to, struct task_struct *task)
 {
@@ -115,6 +139,8 @@ copy_ckfpr_from_user(struct task_struct *task, void __user 
*from)
 #else
 #define unsafe_copy_fpr_to_user(to, task, label) do { } while (0)
 
+#define unsafe_copy_fpr_from_user(task, from, label) do { } while (0)
+
 static inline unsigned long
 copy_fpr_to_user(void __user *to, struct task_struct *task)
 {
-- 
2.26.1



[PATCH v7 09/10] powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

2021-02-26 Thread Christopher M. Riedl
From: Daniel Axtens 

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

Signed-off-by: Daniel Axtens 
Co-developed-by: Christopher M. Riedl 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 788854734b9a..00c907022707 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -822,11 +822,11 @@ SYSCALL_DEFINE0(rt_sigreturn)
 */
current->thread.regs->msr &= ~MSR_TS_MASK;
if (!user_read_access_begin(&uc->uc_mcontext, 
sizeof(uc->uc_mcontext)))
-   return -EFAULT;
-   if (__unsafe_restore_sigcontext(current, NULL, 1, 
&uc->uc_mcontext)) {
-   user_read_access_end();
goto badframe;
-   }
+
+   unsafe_restore_sigcontext(current, NULL, 1, &uc->uc_mcontext,
+ badframe_block);
+
user_read_access_end();
}
 
@@ -836,6 +836,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
set_thread_flag(TIF_RESTOREALL);
return 0;
 
+badframe_block:
+   user_read_access_end();
 badframe:
signal_fault(current, regs, "rt_sigreturn", uc);
 
-- 
2.26.1



[PATCH v7 01/10] powerpc/uaccess: Add unsafe_copy_from_user()

2021-02-26 Thread Christopher M. Riedl
Use the same approach as unsafe_copy_to_user() but instead call
unsafe_get_user() in a loop.

Signed-off-by: Christopher M. Riedl 
---
v7: * Change implementation to call unsafe_get_user() and remove
  dja's 'Reviewed-by' tag
---
 arch/powerpc/include/asm/uaccess.h | 21 +
 1 file changed, 21 insertions(+)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 78e2a3990eab..ef5978b73a8d 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -487,6 +487,27 @@ user_write_access_begin(const void __user *ptr, size_t len)
 #define unsafe_put_user(x, p, e) \
__unsafe_put_user_goto((__typeof__(*(p)))(x), (p), sizeof(*(p)), e)
 
+#define unsafe_copy_from_user(d, s, l, e) \
+do {   
\
+   u8 *_dst = (u8 *)(d);   
\
+   const u8 __user *_src = (const u8 __user *)(s); 
\
+   size_t _len = (l);  
\
+   int _i; 
\
+   
\
+   for (_i = 0; _i < (_len & ~(sizeof(long) - 1)); _i += sizeof(long)) 
\
+   unsafe_get_user(*(long *)(_dst + _i), (long __user *)(_src + 
_i), e);   \
+   if (IS_ENABLED(CONFIG_PPC64) && (_len & 4)) {   
\
+   unsafe_get_user(*(u32 *)(_dst + _i), (u32 __user *)(_src + _i), 
e); \
+   _i += 4;
\
+   }   
\
+   if (_len & 2) { 
\
+   unsafe_get_user(*(u16 *)(_dst + _i), (u16 __user *)(_src + _i), 
e); \
+   _i += 2;
\
+   }   
\
+   if (_len & 1)   
\
+   unsafe_get_user(*(u8 *)(_dst + _i), (u8 __user *)(_src + _i), 
e);   \
+} while (0)
+
 #define unsafe_copy_to_user(d, s, l, e) \
 do {   \
u8 __user *_dst = (u8 __user *)(d); \
-- 
2.26.1



Re: [PATCH v6 07/10] powerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()

2021-02-24 Thread Christopher M. Riedl
On Tue Feb 23, 2021 at 11:36 AM CST, Christophe Leroy wrote:
>
>
> Le 21/02/2021 à 02:23, Christopher M. Riedl a écrit :
> > Previously restore_sigcontext() performed a costly KUAP switch on every
> > uaccess operation. These repeated uaccess switches cause a significant
> > drop in signal handling performance.
> > 
> > Rewrite restore_sigcontext() to assume that a userspace read access
> > window is open by replacing all uaccess functions with their 'unsafe'
> > versions. Modify the callers to first open, call
> > unsafe_restore_sigcontext(), and then close the uaccess window.
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/kernel/signal_64.c | 68 -
> >   1 file changed, 41 insertions(+), 27 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/signal_64.c 
> > b/arch/powerpc/kernel/signal_64.c
> > index 3faaa736ed62..76b525261f61 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -326,14 +326,14 @@ static long setup_tm_sigcontexts(struct sigcontext 
> > __user *sc,
> >   /*
> >* Restore the sigcontext from the signal frame.
> >*/
> > -
> > -static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int 
> > sig,
> > - struct sigcontext __user *sc)
> > +#define unsafe_restore_sigcontext(tsk, set, sig, sc, e) \
> > +   unsafe_op_wrap(__unsafe_restore_sigcontext(tsk, set, sig, sc), e)
>
> unsafe_op_wrap() was not initially meant to be used outside of uaccess.h
>
> In the begining, it has been copied from include/linux/uaccess.h and was
> used
> for unsafe_put_user(), unsafe_get_user() and unsafe_copy_to_user().
> After other changes, only
> unsafe_get_user() is still using it and I'm going to drop
> unsafe_op_wrap() soon.
>
> I'd prefer if you can do the same as unsafe_save_general_regs() and
> others in signal_32.c

Sounds good, will change this in the next version (and also the wrapper
around unsafe_setup_sigcontext()).

>
> > +static long notrace __unsafe_restore_sigcontext(struct task_struct *tsk, 
> > sigset_t *set,
> > +   int sig, struct sigcontext 
> > __user *sc)
> >   {
> >   #ifdef CONFIG_ALTIVEC
> > elf_vrreg_t __user *v_regs;
> >   #endif
> > -   unsigned long err = 0;
> > unsigned long save_r13 = 0;
> > unsigned long msr;
> > struct pt_regs *regs = tsk->thread.regs;
> > @@ -348,27 +348,28 @@ static long restore_sigcontext(struct task_struct 
> > *tsk, sigset_t *set, int sig,
> > save_r13 = regs->gpr[13];
> >   
> > /* copy the GPRs */
> > -   err |= __copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr));
> > -   err |= __get_user(regs->nip, &sc->gp_regs[PT_NIP]);
> > +   unsafe_copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr),
> > + efault_out);
>
> I think it would be better to keep the above on a single line for
> readability.
> Nowadays we tolerate 100 chars lines for cases like this one.

Ok, changed this (and the line you mention further below) in the next
version.

>
> > +   unsafe_get_user(regs->nip, &sc->gp_regs[PT_NIP], efault_out);
> > /* get MSR separately, transfer the LE bit if doing signal return */
> > -   err |= __get_user(msr, &sc->gp_regs[PT_MSR]);
> > +   unsafe_get_user(msr, &sc->gp_regs[PT_MSR], efault_out);
> > if (sig)
> > regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
> > -   err |= __get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3]);
> > -   err |= __get_user(regs->ctr, &sc->gp_regs[PT_CTR]);
> > -   err |= __get_user(regs->link, &sc->gp_regs[PT_LNK]);
> > -   err |= __get_user(regs->xer, &sc->gp_regs[PT_XER]);
> > -   err |= __get_user(regs->ccr, &sc->gp_regs[PT_CCR]);
> > +   unsafe_get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3], efault_out);
> > +   unsafe_get_user(regs->ctr, &sc->gp_regs[PT_CTR], efault_out);
> > +   unsafe_get_user(regs->link, &sc->gp_regs[PT_LNK], efault_out);
> > +   unsafe_get_user(regs->xer, &sc->gp_regs[PT_XER], efault_out);
> > +   unsafe_get_user(regs->ccr, &sc->gp_regs[PT_CCR], efault_out);
> > /* Don't allow userspace to set SOFTE */
> > set_trap_norestart(regs);
> > -   err |= __get_user(regs->dar, &sc->gp_regs[PT_DAR]);
> > -   err |= __get_user(regs->dsis

Re: [PATCH v6 06/10] powerpc/signal64: Replace setup_sigcontext() w/ unsafe_setup_sigcontext()

2021-02-24 Thread Christopher M. Riedl
On Tue Feb 23, 2021 at 11:12 AM CST, Christophe Leroy wrote:
>
>
> Le 21/02/2021 à 02:23, Christopher M. Riedl a écrit :
> > Previously setup_sigcontext() performed a costly KUAP switch on every
> > uaccess operation. These repeated uaccess switches cause a significant
> > drop in signal handling performance.
> > 
> > Rewrite setup_sigcontext() to assume that a userspace write access window
> > is open by replacing all uaccess functions with their 'unsafe' versions.
> > Modify the callers to first open, call unsafe_setup_sigcontext() and
> > then close the uaccess window.
>
> Do you plan to also convert setup_tm_sigcontexts() ?
> It would allow to then remove copy_fpr_to_user() and
> copy_ckfpr_to_user() and maybe other functions too.

I don't intend to convert the TM functions as part of this series.
Partially because I've been "threatened" with TM ownership for touching
the code :) and also because TM enhancements are a pretty low priority I
think.

>
> Christophe
>
> > 
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >   arch/powerpc/kernel/signal_64.c | 71 -
> >   1 file changed, 44 insertions(+), 27 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/signal_64.c 
> > b/arch/powerpc/kernel/signal_64.c
> > index bd8d210c9115..3faaa736ed62 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -101,9 +101,13 @@ static void prepare_setup_sigcontext(struct 
> > task_struct *tsk)
> >* Set up the sigcontext for the signal frame.
> >*/
> >   
> > -static long setup_sigcontext(struct sigcontext __user *sc,
> > -   struct task_struct *tsk, int signr, sigset_t *set,
> > -   unsigned long handler, int ctx_has_vsx_region)
> > +#define unsafe_setup_sigcontext(sc, tsk, signr, set, handler,  
> > \
> > +   ctx_has_vsx_region, e)  \
> > +   unsafe_op_wrap(__unsafe_setup_sigcontext(sc, tsk, signr, set,   \
> > +   handler, ctx_has_vsx_region), e)
> > +static long notrace __unsafe_setup_sigcontext(struct sigcontext __user *sc,
> > +   struct task_struct *tsk, int signr, 
> > sigset_t *set,
> > +   unsigned long handler, int 
> > ctx_has_vsx_region)
> >   {
> > /* When CONFIG_ALTIVEC is set, we _always_ setup v_regs even if the
> >  * process never used altivec yet (MSR_VEC is zero in pt_regs of
> > @@ -118,20 +122,19 @@ static long setup_sigcontext(struct sigcontext __user 
> > *sc,
> >   #endif
> > struct pt_regs *regs = tsk->thread.regs;
> > unsigned long msr = regs->msr;
> > -   long err = 0;
> > /* Force usr to alway see softe as 1 (interrupts enabled) */
> > unsigned long softe = 0x1;
> >   
> > BUG_ON(tsk != current);
> >   
> >   #ifdef CONFIG_ALTIVEC
> > -   err |= __put_user(v_regs, &sc->v_regs);
> > +   unsafe_put_user(v_regs, &sc->v_regs, efault_out);
> >   
> > /* save altivec registers */
> > if (tsk->thread.used_vr) {
> > /* Copy 33 vec registers (vr0..31 and vscr) to the stack */
> > -   err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
> > - 33 * sizeof(vector128));
> > +   unsafe_copy_to_user(v_regs, &tsk->thread.vr_state,
> > +   33 * sizeof(vector128), efault_out);
> > /* set MSR_VEC in the MSR value in the frame to indicate that 
> > sc->v_reg)
> >  * contains valid data.
> >  */
> > @@ -140,12 +143,12 @@ static long setup_sigcontext(struct sigcontext __user 
> > *sc,
> > /* We always copy to/from vrsave, it's 0 if we don't have or don't
> >  * use altivec.
> >  */
> > -   err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
> > +   unsafe_put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33], 
> > efault_out);
> >   #else /* CONFIG_ALTIVEC */
> > -   err |= __put_user(0, &sc->v_regs);
> > +   unsafe_put_user(0, &sc->v_regs, efault_out);
> >   #endif /* CONFIG_ALTIVEC */
> > /* copy fpr regs and fpscr */
> > -   err |= copy_fpr_to_user(&sc->fp_regs, tsk);
> > +   unsafe_copy_fpr_to_user(&sc->fp_regs, tsk, efault_out);
> >   
> > /*
> >  * Clear the MSR VSX bit to indicate there is no valid state attached
> >

Re: [PATCH v6 01/10] powerpc/uaccess: Add unsafe_copy_from_user

2021-02-24 Thread Christopher M. Riedl
On Tue Feb 23, 2021 at 11:15 AM CST, Christophe Leroy wrote:
>
>
> Le 21/02/2021 à 02:23, Christopher M. Riedl a écrit :
> > Just wrap __copy_tofrom_user() for the usual 'unsafe' pattern which
> > accepts a label to goto on error.
> > 
> > Signed-off-by: Christopher M. Riedl 
> > Reviewed-by: Daniel Axtens 
> > ---
> >   arch/powerpc/include/asm/uaccess.h | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/powerpc/include/asm/uaccess.h 
> > b/arch/powerpc/include/asm/uaccess.h
> > index 78e2a3990eab..33b2de642120 100644
> > --- a/arch/powerpc/include/asm/uaccess.h
> > +++ b/arch/powerpc/include/asm/uaccess.h
> > @@ -487,6 +487,9 @@ user_write_access_begin(const void __user *ptr, size_t 
> > len)
> >   #define unsafe_put_user(x, p, e) \
> > __unsafe_put_user_goto((__typeof__(*(p)))(x), (p), sizeof(*(p)), e)
> >   
> > +#define unsafe_copy_from_user(d, s, l, e) \
> > +   unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
> > +
>
> Could we perform same as unsafe_copy_to_user() instead of calling an
> external function which is
> banned in principle inside uaccess blocks ?

Yup, with your patch to move the barrier_nospec() into the allowance
helpers this makes sense now. I just tried it and performance does not
change significantly w/ either radix or hash translation. I will include
this change in the next spin - thanks!

>
>
> >   #define unsafe_copy_to_user(d, s, l, e) \
> >   do {  
> > \
> > u8 __user *_dst = (u8 __user *)(d); \
> > 



[PATCH v6 06/10] powerpc/signal64: Replace setup_sigcontext() w/ unsafe_setup_sigcontext()

2021-02-20 Thread Christopher M. Riedl
Previously setup_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite setup_sigcontext() to assume that a userspace write access window
is open by replacing all uaccess functions with their 'unsafe' versions.
Modify the callers to first open, call unsafe_setup_sigcontext() and
then close the uaccess window.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 71 -
 1 file changed, 44 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index bd8d210c9115..3faaa736ed62 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -101,9 +101,13 @@ static void prepare_setup_sigcontext(struct task_struct 
*tsk)
  * Set up the sigcontext for the signal frame.
  */
 
-static long setup_sigcontext(struct sigcontext __user *sc,
-   struct task_struct *tsk, int signr, sigset_t *set,
-   unsigned long handler, int ctx_has_vsx_region)
+#define unsafe_setup_sigcontext(sc, tsk, signr, set, handler,  \
+   ctx_has_vsx_region, e)  \
+   unsafe_op_wrap(__unsafe_setup_sigcontext(sc, tsk, signr, set,   \
+   handler, ctx_has_vsx_region), e)
+static long notrace __unsafe_setup_sigcontext(struct sigcontext __user *sc,
+   struct task_struct *tsk, int signr, 
sigset_t *set,
+   unsigned long handler, int 
ctx_has_vsx_region)
 {
/* When CONFIG_ALTIVEC is set, we _always_ setup v_regs even if the
 * process never used altivec yet (MSR_VEC is zero in pt_regs of
@@ -118,20 +122,19 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 #endif
struct pt_regs *regs = tsk->thread.regs;
unsigned long msr = regs->msr;
-   long err = 0;
/* Force usr to alway see softe as 1 (interrupts enabled) */
unsigned long softe = 0x1;
 
BUG_ON(tsk != current);
 
 #ifdef CONFIG_ALTIVEC
-   err |= __put_user(v_regs, &sc->v_regs);
+   unsafe_put_user(v_regs, &sc->v_regs, efault_out);
 
/* save altivec registers */
if (tsk->thread.used_vr) {
/* Copy 33 vec registers (vr0..31 and vscr) to the stack */
-   err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
- 33 * sizeof(vector128));
+   unsafe_copy_to_user(v_regs, &tsk->thread.vr_state,
+   33 * sizeof(vector128), efault_out);
/* set MSR_VEC in the MSR value in the frame to indicate that 
sc->v_reg)
 * contains valid data.
 */
@@ -140,12 +143,12 @@ static long setup_sigcontext(struct sigcontext __user *sc,
/* We always copy to/from vrsave, it's 0 if we don't have or don't
 * use altivec.
 */
-   err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
+   unsafe_put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33], 
efault_out);
 #else /* CONFIG_ALTIVEC */
-   err |= __put_user(0, &sc->v_regs);
+   unsafe_put_user(0, &sc->v_regs, efault_out);
 #endif /* CONFIG_ALTIVEC */
/* copy fpr regs and fpscr */
-   err |= copy_fpr_to_user(&sc->fp_regs, tsk);
+   unsafe_copy_fpr_to_user(&sc->fp_regs, tsk, efault_out);
 
/*
 * Clear the MSR VSX bit to indicate there is no valid state attached
@@ -160,24 +163,27 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 */
if (tsk->thread.used_vsr && ctx_has_vsx_region) {
v_regs += ELF_NVRREG;
-   err |= copy_vsx_to_user(v_regs, tsk);
+   unsafe_copy_vsx_to_user(v_regs, tsk, efault_out);
/* set MSR_VSX in the MSR value in the frame to
 * indicate that sc->vs_reg) contains valid data.
 */
msr |= MSR_VSX;
}
 #endif /* CONFIG_VSX */
-   err |= __put_user(&sc->gp_regs, &sc->regs);
+   unsafe_put_user(&sc->gp_regs, &sc->regs, efault_out);
WARN_ON(!FULL_REGS(regs));
-   err |= __copy_to_user(&sc->gp_regs, regs, GP_REGS_SIZE);
-   err |= __put_user(msr, &sc->gp_regs[PT_MSR]);
-   err |= __put_user(softe, &sc->gp_regs[PT_SOFTE]);
-   err |= __put_user(signr, &sc->signal);
-   err |= __put_user(handler, &sc->handler);
+   unsafe_copy_to_user(&sc->gp_regs, regs, GP_REGS_SIZE, efault_out);
+   unsafe_put_user(msr, &sc->gp_regs[PT_MSR], efault_out);
+   unsafe_put_user(softe, &sc->gp_regs[PT_SOFTE], efault_out);
+   unsafe_put_user(signr, &sc->

[PATCH v6 05/10] powerpc/signal64: Remove TM ifdefery in middle of if/else block

2021-02-20 Thread Christopher M. Riedl
Both rt_sigreturn() and handle_rt_signal_64() contain TM-related ifdefs
which break-up an if/else block. Provide stubs for the ifdef-guarded TM
functions and remove the need for an ifdef in rt_sigreturn().

Rework the remaining TM ifdef in handle_rt_signal64() similar to
commit f1cf4f93de2f ("powerpc/signal32: Remove ifdefery in middle of if/else").

Unlike in the commit for ppc32, the ifdef can't be removed entirely
since uc_transact in sigframe depends on CONFIG_PPC_TRANSACTIONAL_MEM.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/process.c   |   3 +-
 arch/powerpc/kernel/signal_64.c | 102 
 2 files changed, 54 insertions(+), 51 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 924d023dad0a..08c3fbe45921 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -1117,9 +1117,10 @@ void restore_tm_state(struct pt_regs *regs)
regs->msr |= msr_diff;
 }
 
-#else
+#else /* !CONFIG_PPC_TRANSACTIONAL_MEM */
 #define tm_recheckpoint_new_task(new)
 #define __switch_to_tm(prev, new)
+void tm_reclaim_current(uint8_t cause) {}
 #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
 
 static inline void save_sprs(struct thread_struct *t)
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 6ca546192cbf..bd8d210c9115 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -594,6 +594,12 @@ static long restore_tm_sigcontexts(struct task_struct *tsk,
 
return err;
 }
+#else /* !CONFIG_PPC_TRANSACTIONAL_MEM */
+static long restore_tm_sigcontexts(struct task_struct *tsk, struct sigcontext 
__user *sc,
+  struct sigcontext __user *tm_sc)
+{
+   return -EINVAL;
+}
 #endif
 
 /*
@@ -710,9 +716,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
struct pt_regs *regs = current_pt_regs();
struct ucontext __user *uc = (struct ucontext __user *)regs->gpr[1];
sigset_t set;
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
unsigned long msr;
-#endif
 
/* Always make any pending restarted system calls return -EINTR */
current->restart_block.fn = do_no_restart_syscall;
@@ -724,48 +728,50 @@ SYSCALL_DEFINE0(rt_sigreturn)
goto badframe;
set_current_blocked(&set);
 
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-   /*
-* If there is a transactional state then throw it away.
-* The purpose of a sigreturn is to destroy all traces of the
-* signal frame, this includes any transactional state created
-* within in. We only check for suspended as we can never be
-* active in the kernel, we are active, there is nothing better to
-* do than go ahead and Bad Thing later.
-* The cause is not important as there will never be a
-* recheckpoint so it's not user visible.
-*/
-   if (MSR_TM_SUSPENDED(mfmsr()))
-   tm_reclaim_current(0);
+   if (IS_ENABLED(CONFIG_PPC_TRANSACTIONAL_MEM)) {
+   /*
+* If there is a transactional state then throw it away.
+* The purpose of a sigreturn is to destroy all traces of the
+* signal frame, this includes any transactional state created
+* within in. We only check for suspended as we can never be
+* active in the kernel, we are active, there is nothing better 
to
+* do than go ahead and Bad Thing later.
+* The cause is not important as there will never be a
+* recheckpoint so it's not user visible.
+*/
+   if (MSR_TM_SUSPENDED(mfmsr()))
+   tm_reclaim_current(0);
 
-   /*
-* Disable MSR[TS] bit also, so, if there is an exception in the
-* code below (as a page fault in copy_ckvsx_to_user()), it does
-* not recheckpoint this task if there was a context switch inside
-* the exception.
-*
-* A major page fault can indirectly call schedule(). A reschedule
-* process in the middle of an exception can have a side effect
-* (Changing the CPU MSR[TS] state), since schedule() is called
-* with the CPU MSR[TS] disable and returns with MSR[TS]=Suspended
-* (switch_to() calls tm_recheckpoint() for the 'new' process). In
-* this case, the process continues to be the same in the CPU, but
-* the CPU state just changed.
-*
-* This can cause a TM Bad Thing, since the MSR in the stack will
-* have the MSR[TS]=0, and this is what will be used to RFID.
-*
-* Clearing MSR[TS] state here will avoid a recheckpoint if there
-* is any process reschedule in kernel space. The MSR[TS] state
-* does not need to be saved also, since it will be replaced with
-* the MSR[TS] that came from user con

[PATCH v6 00/10] Improve signal performance on PPC64 with KUAP

2021-02-20 Thread Christopher M. Riedl
As reported by Anton, there is a large penalty to signal handling
performance on radix systems using KUAP. The signal handling code
performs many user access operations, each of which needs to switch the
KUAP permissions bit to open and then close user access. This involves a
costly 'mtspr' operation [0].

There is existing work done on x86 and by Christophe Leroy for PPC32 to
instead open up user access in "blocks" using user_*_access_{begin,end}.
We can do the same in PPC64 to bring performance back up on KUAP-enabled
radix and now also hash MMU systems [1].

Hash MMU KUAP support along with uaccess flush has landed in linuxppc/next
since the last revision. This series also provides a large benefit on hash
with KUAP. However, in the hash implementation of KUAP the user AMR is
always restored during system_call_exception() which cannot be avoided.
Fewer user access switches naturally also result in less uaccess flushing.

The first two patches add some needed 'unsafe' versions of copy-from
functions. While these do not make use of asm-goto they still allow for
avoiding the repeated uaccess switches.

The third patch moves functions called by setup_sigcontext() into a new
prepare_setup_sigcontext() to simplify converting setup_sigcontext()
into an 'unsafe' version which assumes an open uaccess window later.

The fourth and fifths patches clean-up some of the Transactional Memory
ifdef stuff to simplify using uaccess blocks later.

The next two patches rewrite some of the signal64 helper functions to
be 'unsafe'. Finally, the last three patches update the main signal
handling functions to make use of the new 'unsafe' helpers and eliminate
some additional uaccess switching.

I used the will-it-scale signal1 benchmark to measure and compare
performance [2]. The below results are from running a minimal
kernel+initramfs QEMU/KVM guest on a POWER9 Blackbird:

signal1_threads -t1 -s10

|  | hash   | radix  |
|  | -- | -- |
| linuxppc/next| 117288 | 13 |
| linuxppc/next w/o KUAP+KUEP  | 224759 | 227886 |
| unsafe-signal64  | 197378 | 233528 |

[0]: https://github.com/linuxppc/issues/issues/277
[1]: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=196278
[2]: https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c

v6: * Rebase on latest linuxppc/next and address feedback comments from
  Daniel Axtens and friends (also pick up some Reviewed-by tags)
* Simplify __get_user_sigset(), fix sparse warnings, and use it
  in ppc32 signal handling
* Remove ctx_has_vsx_region arg to prepare_setup_sigcontext()
* Remove local buffer in copy_{fpr,vsx}_from_user()
* Rework the TM ifdefery-removal and remove one of the ifdef
  pairs altogether

v5: * Use sizeof(buf) in copy_{vsx,fpr}_from_user() (Thanks David Laight)
* Rebase on latest linuxppc/next

v4: * Fix issues identified by Christophe Leroy (Thanks for review)
* Use __get_user() directly to copy the 8B sigset_t

v3: * Rebase on latest linuxppc/next
* Reword confusing commit messages
* Add missing comma in macro in signal.h which broke compiles without
  CONFIG_ALTIVEC
* Validate hash KUAP signal performance improvements

v2: * Rebase on latest linuxppc/next + Christophe Leroy's PPC32
  signal series
* Simplify/remove TM ifdefery similar to PPC32 series and clean
  up the uaccess begin/end calls
* Isolate non-inline functions so they are not called when
  uaccess window is open

Christopher M. Riedl (8):
  powerpc/uaccess: Add unsafe_copy_from_user
  powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()
  powerpc/signal64: Remove non-inline calls from setup_sigcontext()
  powerpc: Reference parameter in MSR_TM_ACTIVE() macro
  powerpc/signal64: Remove TM ifdefery in middle of if/else block
  powerpc/signal64: Replace setup_sigcontext() w/
unsafe_setup_sigcontext()
  powerpc/signal64: Replace restore_sigcontext() w/
unsafe_restore_sigcontext()
  powerpc/signal: Use __get_user() to copy sigset_t

Daniel Axtens (2):
  powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess
switches
  powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

 arch/powerpc/include/asm/reg.h |   2 +-
 arch/powerpc/include/asm/uaccess.h |   3 +
 arch/powerpc/kernel/process.c  |   3 +-
 arch/powerpc/kernel/signal.h   |  33 +++
 arch/powerpc/kernel/signal_32.c|   2 +-
 arch/powerpc/kernel/signal_64.c| 315 +
 6 files changed, 227 insertions(+), 131 deletions(-)

-- 
2.26.1



[PATCH v6 08/10] powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess switches

2021-02-20 Thread Christopher M. Riedl
From: Daniel Axtens 

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

There is no 'unsafe' version of copy_siginfo_to_user, so move it
slightly to allow for a "longer" uaccess block.

Signed-off-by: Daniel Axtens 
Co-developed-by: Christopher M. Riedl 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 56 -
 1 file changed, 35 insertions(+), 21 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 76b525261f61..4bf73731533f 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -853,45 +853,52 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t 
*set,
unsigned long msr = regs->msr;
 
frame = get_sigframe(ksig, tsk, sizeof(*frame), 0);
-   if (!access_ok(frame, sizeof(*frame)))
-   goto badframe;
 
-   err |= __put_user(&frame->info, &frame->pinfo);
-   err |= __put_user(&frame->uc, &frame->puc);
-   err |= copy_siginfo_to_user(&frame->info, &ksig->info);
-   if (err)
+   /* This only applies when calling unsafe_setup_sigcontext() and must be
+* called before opening the uaccess window.
+*/
+   if (!MSR_TM_ACTIVE(msr))
+   prepare_setup_sigcontext(tsk);
+
+   if (!user_write_access_begin(frame, sizeof(*frame)))
goto badframe;
 
+   unsafe_put_user(&frame->info, &frame->pinfo, badframe_block);
+   unsafe_put_user(&frame->uc, &frame->puc, badframe_block);
+
/* Create the ucontext.  */
-   err |= __put_user(0, &frame->uc.uc_flags);
-   err |= __save_altstack(&frame->uc.uc_stack, regs->gpr[1]);
+   unsafe_put_user(0, &frame->uc.uc_flags, badframe_block);
+   unsafe_save_altstack(&frame->uc.uc_stack, regs->gpr[1], badframe_block);
 
if (MSR_TM_ACTIVE(msr)) {
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
/* The ucontext_t passed to userland points to the second
 * ucontext_t (for transactional state) with its uc_link ptr.
 */
-   err |= __put_user(&frame->uc_transact, &frame->uc.uc_link);
+   unsafe_put_user(&frame->uc_transact, &frame->uc.uc_link, 
badframe_block);
+
+   user_write_access_end();
+
err |= setup_tm_sigcontexts(&frame->uc.uc_mcontext,
&frame->uc_transact.uc_mcontext,
tsk, ksig->sig, NULL,
(unsigned 
long)ksig->ka.sa.sa_handler,
msr);
+
+   if (!user_write_access_begin(&frame->uc.uc_sigmask,
+sizeof(frame->uc.uc_sigmask)))
+   goto badframe;
+
 #endif
} else {
-   err |= __put_user(0, &frame->uc.uc_link);
-   prepare_setup_sigcontext(tsk);
-   if (!user_write_access_begin(&frame->uc.uc_mcontext,
-sizeof(frame->uc.uc_mcontext)))
-   return -EFAULT;
-   err |= __unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk,
-   ksig->sig, NULL,
-   (unsigned 
long)ksig->ka.sa.sa_handler, 1);
-   user_write_access_end();
+   unsafe_put_user(0, &frame->uc.uc_link, badframe_block);
+   unsafe_setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
+   NULL, (unsigned 
long)ksig->ka.sa.sa_handler,
+   1, badframe_block);
}
-   err |= __copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set));
-   if (err)
-   goto badframe;
+
+   unsafe_copy_to_user(&frame->uc.uc_sigmask, set, sizeof(*set), 
badframe_block);
+   user_write_access_end();
 
/* Make sure signal handler doesn't get spurious FP exceptions */
tsk->thread.fp_state.fpscr = 0;
@@ -906,6 +913,11 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
regs->nip = (unsigned long) &frame->tramp[0];
}
 
+
+   /* Save the siginfo outside of the unsafe block. */
+   if (copy_siginfo_to_user(&frame->info, &ksig->info))
+   goto badframe;
+
/* Allocate a dummy caller frame for the signal handler. */
newsp = ((unsigned long)frame) - __SIGNAL_FRAMESIZE;
err |= put_user(regs->gpr[1], (unsigned long __user *)newsp);
@@ -945,6 +957,8 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 
return 0;
 
+badframe_block:
+   user_write_access_end();
 badframe:
signal_fault(current, regs, "handle_rt_signal64", frame);
 
-- 
2.26.1



[PATCH v6 10/10] powerpc/signal: Use __get_user() to copy sigset_t

2021-02-20 Thread Christopher M. Riedl
Usually sigset_t is exactly 8B which is a "trivial" size and does not
warrant using __copy_from_user(). Use __get_user() directly in
anticipation of future work to remove the trivial size optimizations
from __copy_from_user().

The ppc32 implementation of get_sigset_t() previously called
copy_from_user() which, unlike __copy_from_user(), calls access_ok().
Replacing this w/ __get_user() (no access_ok()) is fine here since both
callsites in signal_32.c are preceded by an earlier access_ok().

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal.h| 7 +++
 arch/powerpc/kernel/signal_32.c | 2 +-
 arch/powerpc/kernel/signal_64.c | 4 ++--
 3 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
index d8dd76b1dc94..1393876f3814 100644
--- a/arch/powerpc/kernel/signal.h
+++ b/arch/powerpc/kernel/signal.h
@@ -19,6 +19,13 @@ extern int handle_signal32(struct ksignal *ksig, sigset_t 
*oldset,
 extern int handle_rt_signal32(struct ksignal *ksig, sigset_t *oldset,
  struct task_struct *tsk);
 
+static inline int __get_user_sigset(sigset_t *dst, const sigset_t __user *src)
+{
+   BUILD_BUG_ON(sizeof(sigset_t) != sizeof(u64));
+
+   return __get_user(dst->sig[0], (u64 __user *)&src->sig[0]);
+}
+
 #ifdef CONFIG_VSX
 extern unsigned long copy_vsx_to_user(void __user *to,
  struct task_struct *task);
diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index 75ee918a120a..c505b444a613 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -144,7 +144,7 @@ static inline int restore_general_regs(struct pt_regs *regs,
 
 static inline int get_sigset_t(sigset_t *set, const sigset_t __user *uset)
 {
-   return copy_from_user(set, uset, sizeof(*uset));
+   return __get_user_sigset(set, uset);
 }
 
 #define to_user_ptr(p) ((unsigned long)(p))
diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 3dd89f99e26f..dc5bac727a62 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -707,7 +707,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
 * We kill the task with a SIGSEGV in this situation.
 */
 
-   if (__copy_from_user(&set, &new_ctx->uc_sigmask, sizeof(set)))
+   if (__get_user_sigset(&set, &new_ctx->uc_sigmask))
do_exit(SIGSEGV);
set_current_blocked(&set);
 
@@ -746,7 +746,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (!access_ok(uc, sizeof(*uc)))
goto badframe;
 
-   if (__copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
+   if (__get_user_sigset(&set, &uc->uc_sigmask))
goto badframe;
set_current_blocked(&set);
 
-- 
2.26.1



[PATCH v6 09/10] powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

2021-02-20 Thread Christopher M. Riedl
From: Daniel Axtens 

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

Signed-off-by: Daniel Axtens 
Co-developed-by: Christopher M. Riedl 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 10 ++
 1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 4bf73731533f..3dd89f99e26f 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -821,11 +821,11 @@ SYSCALL_DEFINE0(rt_sigreturn)
 */
current->thread.regs->msr &= ~MSR_TS_MASK;
if (!user_read_access_begin(&uc->uc_mcontext, 
sizeof(uc->uc_mcontext)))
-   return -EFAULT;
-   if (__unsafe_restore_sigcontext(current, NULL, 1, 
&uc->uc_mcontext)) {
-   user_read_access_end();
goto badframe;
-   }
+
+   unsafe_restore_sigcontext(current, NULL, 1, &uc->uc_mcontext,
+ badframe_block);
+
user_read_access_end();
}
 
@@ -835,6 +835,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
set_thread_flag(TIF_RESTOREALL);
return 0;
 
+badframe_block:
+   user_read_access_end();
 badframe:
signal_fault(current, regs, "rt_sigreturn", uc);
 
-- 
2.26.1



[PATCH v6 07/10] powerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()

2021-02-20 Thread Christopher M. Riedl
Previously restore_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite restore_sigcontext() to assume that a userspace read access
window is open by replacing all uaccess functions with their 'unsafe'
versions. Modify the callers to first open, call
unsafe_restore_sigcontext(), and then close the uaccess window.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 68 -
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 3faaa736ed62..76b525261f61 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -326,14 +326,14 @@ static long setup_tm_sigcontexts(struct sigcontext __user 
*sc,
 /*
  * Restore the sigcontext from the signal frame.
  */
-
-static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
- struct sigcontext __user *sc)
+#define unsafe_restore_sigcontext(tsk, set, sig, sc, e) \
+   unsafe_op_wrap(__unsafe_restore_sigcontext(tsk, set, sig, sc), e)
+static long notrace __unsafe_restore_sigcontext(struct task_struct *tsk, 
sigset_t *set,
+   int sig, struct sigcontext 
__user *sc)
 {
 #ifdef CONFIG_ALTIVEC
elf_vrreg_t __user *v_regs;
 #endif
-   unsigned long err = 0;
unsigned long save_r13 = 0;
unsigned long msr;
struct pt_regs *regs = tsk->thread.regs;
@@ -348,27 +348,28 @@ static long restore_sigcontext(struct task_struct *tsk, 
sigset_t *set, int sig,
save_r13 = regs->gpr[13];
 
/* copy the GPRs */
-   err |= __copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr));
-   err |= __get_user(regs->nip, &sc->gp_regs[PT_NIP]);
+   unsafe_copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr),
+ efault_out);
+   unsafe_get_user(regs->nip, &sc->gp_regs[PT_NIP], efault_out);
/* get MSR separately, transfer the LE bit if doing signal return */
-   err |= __get_user(msr, &sc->gp_regs[PT_MSR]);
+   unsafe_get_user(msr, &sc->gp_regs[PT_MSR], efault_out);
if (sig)
regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
-   err |= __get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3]);
-   err |= __get_user(regs->ctr, &sc->gp_regs[PT_CTR]);
-   err |= __get_user(regs->link, &sc->gp_regs[PT_LNK]);
-   err |= __get_user(regs->xer, &sc->gp_regs[PT_XER]);
-   err |= __get_user(regs->ccr, &sc->gp_regs[PT_CCR]);
+   unsafe_get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3], efault_out);
+   unsafe_get_user(regs->ctr, &sc->gp_regs[PT_CTR], efault_out);
+   unsafe_get_user(regs->link, &sc->gp_regs[PT_LNK], efault_out);
+   unsafe_get_user(regs->xer, &sc->gp_regs[PT_XER], efault_out);
+   unsafe_get_user(regs->ccr, &sc->gp_regs[PT_CCR], efault_out);
/* Don't allow userspace to set SOFTE */
set_trap_norestart(regs);
-   err |= __get_user(regs->dar, &sc->gp_regs[PT_DAR]);
-   err |= __get_user(regs->dsisr, &sc->gp_regs[PT_DSISR]);
-   err |= __get_user(regs->result, &sc->gp_regs[PT_RESULT]);
+   unsafe_get_user(regs->dar, &sc->gp_regs[PT_DAR], efault_out);
+   unsafe_get_user(regs->dsisr, &sc->gp_regs[PT_DSISR], efault_out);
+   unsafe_get_user(regs->result, &sc->gp_regs[PT_RESULT], efault_out);
 
if (!sig)
regs->gpr[13] = save_r13;
if (set != NULL)
-   err |=  __get_user(set->sig[0], &sc->oldmask);
+   unsafe_get_user(set->sig[0], &sc->oldmask, efault_out);
 
/*
 * Force reload of FP/VEC.
@@ -378,29 +379,28 @@ static long restore_sigcontext(struct task_struct *tsk, 
sigset_t *set, int sig,
regs->msr &= ~(MSR_FP | MSR_FE0 | MSR_FE1 | MSR_VEC | MSR_VSX);
 
 #ifdef CONFIG_ALTIVEC
-   err |= __get_user(v_regs, &sc->v_regs);
-   if (err)
-   return err;
+   unsafe_get_user(v_regs, &sc->v_regs, efault_out);
if (v_regs && !access_ok(v_regs, 34 * sizeof(vector128)))
return -EFAULT;
/* Copy 33 vec registers (vr0..31 and vscr) from the stack */
if (v_regs != NULL && (msr & MSR_VEC) != 0) {
-   err |= __copy_from_user(&tsk->thread.vr_state, v_regs,
-   33 * sizeof(vector128));
+   unsafe_copy_from_user(&tsk->thread.vr_state, v_regs,
+ 33 * sizeof(vector128), ef

[PATCH v6 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

2021-02-20 Thread Christopher M. Riedl
Reuse the "safe" implementation from signal.c but call unsafe_get_user()
directly in a loop to avoid the intermediate copy into a local buffer.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Daniel Axtens 
---
 arch/powerpc/kernel/signal.h | 26 ++
 1 file changed, 26 insertions(+)

diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
index 2559a681536e..d8dd76b1dc94 100644
--- a/arch/powerpc/kernel/signal.h
+++ b/arch/powerpc/kernel/signal.h
@@ -53,6 +53,26 @@ unsigned long copy_ckfpr_from_user(struct task_struct *task, 
void __user *from);
&buf[i], label);\
 } while (0)
 
+#define unsafe_copy_fpr_from_user(task, from, label)   do {\
+   struct task_struct *__t = task; \
+   u64 __user *buf = (u64 __user *)from;   \
+   int i;  \
+   \
+   for (i = 0; i < ELF_NFPREG - 1; i++)\
+   unsafe_get_user(__t->thread.TS_FPR(i), &buf[i], label); \
+   unsafe_get_user(__t->thread.fp_state.fpscr, &buf[i], label);\
+} while (0)
+
+#define unsafe_copy_vsx_from_user(task, from, label)   do {\
+   struct task_struct *__t = task; \
+   u64 __user *buf = (u64 __user *)from;   \
+   int i;  \
+   \
+   for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
+   unsafe_get_user(__t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET], \
+   &buf[i], label);\
+} while (0)
+
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 #define unsafe_copy_ckfpr_to_user(to, task, label) do {\
struct task_struct *__t = task; \
@@ -80,6 +100,10 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
*task, void __user *from);
unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\
ELF_NFPREG * sizeof(double), label)
 
+#define unsafe_copy_fpr_from_user(task, from, label)   \
+   unsafe_copy_from_user((task)->thread.fp_state.fpr, from,\
+   ELF_NFPREG * sizeof(double), label)
+
 static inline unsigned long
 copy_fpr_to_user(void __user *to, struct task_struct *task)
 {
@@ -115,6 +139,8 @@ copy_ckfpr_from_user(struct task_struct *task, void __user 
*from)
 #else
 #define unsafe_copy_fpr_to_user(to, task, label) do { } while (0)
 
+#define unsafe_copy_fpr_from_user(task, from, label) do { } while (0)
+
 static inline unsigned long
 copy_fpr_to_user(void __user *to, struct task_struct *task)
 {
-- 
2.26.1



[PATCH v6 03/10] powerpc/signal64: Remove non-inline calls from setup_sigcontext()

2021-02-20 Thread Christopher M. Riedl
The majority of setup_sigcontext() can be refactored to execute in an
"unsafe" context assuming an open uaccess window except for some
non-inline function calls. Move these out into a separate
prepare_setup_sigcontext() function which must be called first and
before opening up a uaccess window. Non-inline function calls should be
avoided during a uaccess window for a few reasons:

- KUAP should be enabled for as much kernel code as possible.
  Opening a uaccess window disables KUAP which means any code
  executed during this time contributes to a potential attack
  surface.

- Non-inline functions default to traceable which means they are
  instrumented for ftrace. This adds more code which could run
  with KUAP disabled.

- Powerpc does not currently support the objtool UACCESS checks.
  All code running with uaccess must be audited manually which
  means: less code -> less work -> fewer problems (in theory).

A follow-up commit converts setup_sigcontext() to be "unsafe".

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index f9e4a1ac440f..6ca546192cbf 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -79,6 +79,24 @@ static elf_vrreg_t __user *sigcontext_vmx_regs(struct 
sigcontext __user *sc)
 }
 #endif
 
+static void prepare_setup_sigcontext(struct task_struct *tsk)
+{
+#ifdef CONFIG_ALTIVEC
+   /* save altivec registers */
+   if (tsk->thread.used_vr)
+   flush_altivec_to_thread(tsk);
+   if (cpu_has_feature(CPU_FTR_ALTIVEC))
+   tsk->thread.vrsave = mfspr(SPRN_VRSAVE);
+#endif /* CONFIG_ALTIVEC */
+
+   flush_fp_to_thread(tsk);
+
+#ifdef CONFIG_VSX
+   if (tsk->thread.used_vsr)
+   flush_vsx_to_thread(tsk);
+#endif /* CONFIG_VSX */
+}
+
 /*
  * Set up the sigcontext for the signal frame.
  */
@@ -97,7 +115,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 */
 #ifdef CONFIG_ALTIVEC
elf_vrreg_t __user *v_regs = sigcontext_vmx_regs(sc);
-   unsigned long vrsave;
 #endif
struct pt_regs *regs = tsk->thread.regs;
unsigned long msr = regs->msr;
@@ -112,7 +129,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 
/* save altivec registers */
if (tsk->thread.used_vr) {
-   flush_altivec_to_thread(tsk);
/* Copy 33 vec registers (vr0..31 and vscr) to the stack */
err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
  33 * sizeof(vector128));
@@ -124,17 +140,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
/* We always copy to/from vrsave, it's 0 if we don't have or don't
 * use altivec.
 */
-   vrsave = 0;
-   if (cpu_has_feature(CPU_FTR_ALTIVEC)) {
-   vrsave = mfspr(SPRN_VRSAVE);
-   tsk->thread.vrsave = vrsave;
-   }
-
-   err |= __put_user(vrsave, (u32 __user *)&v_regs[33]);
+   err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
 #else /* CONFIG_ALTIVEC */
err |= __put_user(0, &sc->v_regs);
 #endif /* CONFIG_ALTIVEC */
-   flush_fp_to_thread(tsk);
/* copy fpr regs and fpscr */
err |= copy_fpr_to_user(&sc->fp_regs, tsk);
 
@@ -150,7 +159,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 * VMX data.
 */
if (tsk->thread.used_vsr && ctx_has_vsx_region) {
-   flush_vsx_to_thread(tsk);
v_regs += ELF_NVRREG;
err |= copy_vsx_to_user(v_regs, tsk);
/* set MSR_VSX in the MSR value in the frame to
@@ -655,6 +663,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
ctx_has_vsx_region = 1;
 
if (old_ctx != NULL) {
+   prepare_setup_sigcontext(current);
if (!access_ok(old_ctx, ctx_size)
|| setup_sigcontext(&old_ctx->uc_mcontext, current, 0, 
NULL, 0,
ctx_has_vsx_region)
@@ -842,6 +851,7 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 #endif
{
err |= __put_user(0, &frame->uc.uc_link);
+   prepare_setup_sigcontext(tsk);
err |= setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
NULL, (unsigned 
long)ksig->ka.sa.sa_handler,
1);
-- 
2.26.1



[PATCH v6 04/10] powerpc: Reference parameter in MSR_TM_ACTIVE() macro

2021-02-20 Thread Christopher M. Riedl
Unlike the other MSR_TM_* macros, MSR_TM_ACTIVE does not reference or
use its parameter unless CONFIG_PPC_TRANSACTIONAL_MEM is defined. This
causes an 'unused variable' compile warning unless the variable is also
guarded with CONFIG_PPC_TRANSACTIONAL_MEM.

Reference but do nothing with the argument in the macro to avoid a
potential compile warning.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Daniel Axtens 
---
 arch/powerpc/include/asm/reg.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index da103e92c112..1be20bc8dce2 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -124,7 +124,7 @@
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 #define MSR_TM_ACTIVE(x) (((x) & MSR_TS_MASK) != 0) /* Transaction active? */
 #else
-#define MSR_TM_ACTIVE(x) 0
+#define MSR_TM_ACTIVE(x) ((void)(x), 0)
 #endif
 
 #if defined(CONFIG_PPC_BOOK3S_64)
-- 
2.26.1



[PATCH v6 01/10] powerpc/uaccess: Add unsafe_copy_from_user

2021-02-20 Thread Christopher M. Riedl
Just wrap __copy_tofrom_user() for the usual 'unsafe' pattern which
accepts a label to goto on error.

Signed-off-by: Christopher M. Riedl 
Reviewed-by: Daniel Axtens 
---
 arch/powerpc/include/asm/uaccess.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 78e2a3990eab..33b2de642120 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -487,6 +487,9 @@ user_write_access_begin(const void __user *ptr, size_t len)
 #define unsafe_put_user(x, p, e) \
__unsafe_put_user_goto((__typeof__(*(p)))(x), (p), sizeof(*(p)), e)
 
+#define unsafe_copy_from_user(d, s, l, e) \
+   unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
+
 #define unsafe_copy_to_user(d, s, l, e) \
 do {   \
u8 __user *_dst = (u8 __user *)(d); \
-- 
2.26.1



Re: [PATCH v5 10/10] powerpc/signal64: Use __get_user() to copy sigset_t

2021-02-17 Thread Christopher M. Riedl
On Fri Feb 12, 2021 at 3:21 PM CST, Daniel Axtens wrote:
> "Christopher M. Riedl"  writes:
>
> > Usually sigset_t is exactly 8B which is a "trivial" size and does not
> > warrant using __copy_from_user(). Use __get_user() directly in
> > anticipation of future work to remove the trivial size optimizations
> > from __copy_from_user(). Calling __get_user() also results in a small
> > boost to signal handling throughput here.
>
> Modulo the comments from Christophe, this looks good to me. It seems to
> do what it says on the tin.
>
> Reviewed-by: Daniel Axtens 
>
> Do you know if this patch is responsible for the slightly increase in
> radix performance you observed in your cover letter? The rest of the
> series shouldn't really make things faster than the no-KUAP case...

No, this patch just results in a really small improvement overall.

I looked over this again and I think the reason for the speedup is that
my implementation of unsafe_copy_from_user() in this series calls
__copy_tofrom_user() directly avoiding a barrier_nospec(). This speeds
up all the unsafe_get_from_user() calls in this version.

Skipping the barrier_nospec() like that is incorrect, but Christophe
recently sent a patch which moves barrier_nospec() into the uaccess
allowance helpers. It looks like mpe has already accepted that patch:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2021-February/223959.html

That means that my implementation of unsafe_copy_from_user() is _now_
correct _and_ faster. We do not need to call barrier_nospec() since the
precondition for the 'unsafe' routine is that we called one of the
helpers to open up a uaccess window earlier.

>
> Kind regards,
> Daniel
>
> >
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >  arch/powerpc/kernel/signal_64.c | 14 --
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c 
> > b/arch/powerpc/kernel/signal_64.c
> > index 817b64e1e409..42fdc4a7ff72 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -97,6 +97,14 @@ static void prepare_setup_sigcontext(struct task_struct 
> > *tsk, int ctx_has_vsx_re
> >  #endif /* CONFIG_VSX */
> >  }
> >  
> > +static inline int get_user_sigset(sigset_t *dst, const sigset_t *src)
> > +{
> > +   if (sizeof(sigset_t) <= 8)
> > +   return __get_user(dst->sig[0], &src->sig[0]);
> > +   else
> > +   return __copy_from_user(dst, src, sizeof(sigset_t));
> > +}
> > +
> >  /*
> >   * Set up the sigcontext for the signal frame.
> >   */
> > @@ -701,8 +709,9 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
> > old_ctx,
> >  * We kill the task with a SIGSEGV in this situation.
> >  */
> >  
> > -   if (__copy_from_user(&set, &new_ctx->uc_sigmask, sizeof(set)))
> > +   if (get_user_sigset(&set, &new_ctx->uc_sigmask))
> > do_exit(SIGSEGV);
> > +
> > set_current_blocked(&set);
> >  
> > if (!user_read_access_begin(new_ctx, ctx_size))
> > @@ -740,8 +749,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > if (!access_ok(uc, sizeof(*uc)))
> > goto badframe;
> >  
> > -   if (__copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
> > +   if (get_user_sigset(&set, &uc->uc_sigmask))
> > goto badframe;
> > +
> > set_current_blocked(&set);
> >  
> >  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > -- 
> > 2.26.1



Re: [PATCH v5 05/10] powerpc/signal64: Remove TM ifdefery in middle of if/else block

2021-02-17 Thread Christopher M. Riedl
On Thu Feb 11, 2021 at 11:21 PM CST, Daniel Axtens wrote:
> Hi Chris,
>
> > Rework the messy ifdef breaking up the if-else for TM similar to
> > commit f1cf4f93de2f ("powerpc/signal32: Remove ifdefery in middle of 
> > if/else").
>
> I'm not sure what 'the messy ifdef' and 'the if-else for TM' is (yet):
> perhaps you could start the commit message with a tiny bit of
> background.

Yup good point - I will reword this for the next spin.

>
> > Unlike that commit for ppc32, the ifdef can't be removed entirely since
> > uc_transact in sigframe depends on CONFIG_PPC_TRANSACTIONAL_MEM.
> >
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >  arch/powerpc/kernel/signal_64.c | 16 +++-
> >  1 file changed, 7 insertions(+), 9 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c 
> > b/arch/powerpc/kernel/signal_64.c
> > index b211a8ea4f6e..8e1d804ce552 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -710,9 +710,7 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > struct pt_regs *regs = current_pt_regs();
> > struct ucontext __user *uc = (struct ucontext __user *)regs->gpr[1];
> > sigset_t set;
> > -#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > unsigned long msr;
> > -#endif
> >  
> > /* Always make any pending restarted system calls return -EINTR */
> > current->restart_block.fn = do_no_restart_syscall;
> > @@ -765,7 +763,10 @@ SYSCALL_DEFINE0(rt_sigreturn)
> >  
> > if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
> > goto badframe;
> > +#endif
> > +
> > if (MSR_TM_ACTIVE(msr)) {
> > +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > /* We recheckpoint on return. */
> > struct ucontext __user *uc_transact;
> >  
> > @@ -778,9 +779,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > if (restore_tm_sigcontexts(current, &uc->uc_mcontext,
> >&uc_transact->uc_mcontext))
> > goto badframe;
> > -   } else
> >  #endif
> > -   {
> > +   } else {
> > /*
> >  * Fall through, for non-TM restore
> >  *
>
> I think you can get rid of all the ifdefs in _this function_ by
> providing a couple of stubs:
>
> diff --git a/arch/powerpc/kernel/process.c
> b/arch/powerpc/kernel/process.c
> index a66f435dabbf..19059a4b798f 100644
> --- a/arch/powerpc/kernel/process.c
> +++ b/arch/powerpc/kernel/process.c
> @@ -1120,6 +1120,7 @@ void restore_tm_state(struct pt_regs *regs)
> #else
> #define tm_recheckpoint_new_task(new)
> #define __switch_to_tm(prev, new)
> +void tm_reclaim_current(uint8_t cause) {}
> #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
>  
> static inline void save_sprs(struct thread_struct *t)
> diff --git a/arch/powerpc/kernel/signal_64.c
> b/arch/powerpc/kernel/signal_64.c
> index 8e1d804ce552..ed58ca019ad9 100644
> --- a/arch/powerpc/kernel/signal_64.c
> +++ b/arch/powerpc/kernel/signal_64.c
> @@ -594,6 +594,13 @@ static long restore_tm_sigcontexts(struct
> task_struct *tsk,
>  
> return err;
> }
> +#else
> +static long restore_tm_sigcontexts(struct task_struct *tsk,
> + struct sigcontext __user *sc,
> + struct sigcontext __user *tm_sc)
> +{
> + return -EINVAL;
> +}
> #endif
>  
> /*
> @@ -722,7 +729,6 @@ SYSCALL_DEFINE0(rt_sigreturn)
> goto badframe;
> set_current_blocked(&set);
>  
> -#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> /*
> * If there is a transactional state then throw it away.
> * The purpose of a sigreturn is to destroy all traces of the
> @@ -763,10 +769,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
>  
> if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
> goto badframe;
> -#endif
>  
> if (MSR_TM_ACTIVE(msr)) {
> -#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> /* We recheckpoint on return. */
> struct ucontext __user *uc_transact;
>  
> @@ -779,7 +783,6 @@ SYSCALL_DEFINE0(rt_sigreturn)
> if (restore_tm_sigcontexts(current, &uc->uc_mcontext,
> &uc_transact->uc_mcontext))
> goto badframe;
> -#endif
> } else {
> /*
> * Fall through, for non-TM restore
>
> My only concern here was whether it was valid to access
> if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
> if CONFIG_PPC_TRANSACTIONAL_MEM was not defined, but I didn't think of
> any obvious reason why it wouldn't be...

Hmm, we don't really need it for the non-TM case and it is another extra
uaccess. I will take your suggestion to remove

Re: [PATCH v5 07/10] powerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()

2021-02-17 Thread Christopher M. Riedl
On Fri Feb 12, 2021 at 3:17 PM CST, Daniel Axtens wrote:
> Hi Chris,
>
> > Previously restore_sigcontext() performed a costly KUAP switch on every
> > uaccess operation. These repeated uaccess switches cause a significant
> > drop in signal handling performance.
> >
> > Rewrite restore_sigcontext() to assume that a userspace read access
> > window is open. Replace all uaccess functions with their 'unsafe'
> > versions which avoid the repeated uaccess switches.
> >
>
> Much of the same comments apply here as to the last patch:
> - the commit message might be improved by adding that you are also
> changing the calling functions to open the uaccess window before
> calling into the new unsafe functions
>
> - I have checked that the safe to unsafe conversions look right.
>
> - I think you're opening too wide a window in user_read_access_begin,
> it seems to me that it could be reduced to just the
> uc_mcontext. (Again, not that it makes a difference with the current
> HW.)

Ok, I'll fix these in the next version as well.

>
> Kind regards,
> Daniel
>
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >  arch/powerpc/kernel/signal_64.c | 68 -
> >  1 file changed, 41 insertions(+), 27 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c 
> > b/arch/powerpc/kernel/signal_64.c
> > index 4248e4489ff1..d668f8af18fe 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -326,14 +326,14 @@ static long setup_tm_sigcontexts(struct sigcontext 
> > __user *sc,
> >  /*
> >   * Restore the sigcontext from the signal frame.
> >   */
> > -
> > -static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int 
> > sig,
> > - struct sigcontext __user *sc)
> > +#define unsafe_restore_sigcontext(tsk, set, sig, sc, e) \
> > +   unsafe_op_wrap(__unsafe_restore_sigcontext(tsk, set, sig, sc), e)
> > +static long notrace __unsafe_restore_sigcontext(struct task_struct *tsk, 
> > sigset_t *set,
> > +   int sig, struct sigcontext 
> > __user *sc)
> >  {
> >  #ifdef CONFIG_ALTIVEC
> > elf_vrreg_t __user *v_regs;
> >  #endif
> > -   unsigned long err = 0;
> > unsigned long save_r13 = 0;
> > unsigned long msr;
> > struct pt_regs *regs = tsk->thread.regs;
> > @@ -348,27 +348,28 @@ static long restore_sigcontext(struct task_struct 
> > *tsk, sigset_t *set, int sig,
> > save_r13 = regs->gpr[13];
> >  
> > /* copy the GPRs */
> > -   err |= __copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr));
> > -   err |= __get_user(regs->nip, &sc->gp_regs[PT_NIP]);
> > +   unsafe_copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr),
> > + efault_out);
> > +   unsafe_get_user(regs->nip, &sc->gp_regs[PT_NIP], efault_out);
> > /* get MSR separately, transfer the LE bit if doing signal return */
> > -   err |= __get_user(msr, &sc->gp_regs[PT_MSR]);
> > +   unsafe_get_user(msr, &sc->gp_regs[PT_MSR], efault_out);
> > if (sig)
> > regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
> > -   err |= __get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3]);
> > -   err |= __get_user(regs->ctr, &sc->gp_regs[PT_CTR]);
> > -   err |= __get_user(regs->link, &sc->gp_regs[PT_LNK]);
> > -   err |= __get_user(regs->xer, &sc->gp_regs[PT_XER]);
> > -   err |= __get_user(regs->ccr, &sc->gp_regs[PT_CCR]);
> > +   unsafe_get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3], efault_out);
> > +   unsafe_get_user(regs->ctr, &sc->gp_regs[PT_CTR], efault_out);
> > +   unsafe_get_user(regs->link, &sc->gp_regs[PT_LNK], efault_out);
> > +   unsafe_get_user(regs->xer, &sc->gp_regs[PT_XER], efault_out);
> > +   unsafe_get_user(regs->ccr, &sc->gp_regs[PT_CCR], efault_out);
> > /* Don't allow userspace to set SOFTE */
> > set_trap_norestart(regs);
> > -   err |= __get_user(regs->dar, &sc->gp_regs[PT_DAR]);
> > -   err |= __get_user(regs->dsisr, &sc->gp_regs[PT_DSISR]);
> > -   err |= __get_user(regs->result, &sc->gp_regs[PT_RESULT]);
> > +   unsafe_get_user(regs->dar, &sc->gp_regs[PT_DAR], efault_out);
> > +   unsafe_get_user(regs->dsisr, &sc->gp_regs[PT_DSISR], efault_out);
> > +   unsafe_get_user(regs->resul

Re: [PATCH v5 06/10] powerpc/signal64: Replace setup_sigcontext() w/ unsafe_setup_sigcontext()

2021-02-17 Thread Christopher M. Riedl
On Thu Feb 11, 2021 at 11:41 PM CST, Daniel Axtens wrote:
> Hi Chris,
>
> > Previously setup_sigcontext() performed a costly KUAP switch on every
> > uaccess operation. These repeated uaccess switches cause a significant
> > drop in signal handling performance.
> >
> > Rewrite setup_sigcontext() to assume that a userspace write access window
> > is open. Replace all uaccess functions with their 'unsafe' versions
> > which avoid the repeated uaccess switches.
>
> Just to clarify the commit message a bit: you're also changing the
> callers of the old safe versions to first open the window, then call the
> unsafe variants, then close the window again.

Noted!

>
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >  arch/powerpc/kernel/signal_64.c | 70 -
> >  1 file changed, 43 insertions(+), 27 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c 
> > b/arch/powerpc/kernel/signal_64.c
> > index 8e1d804ce552..4248e4489ff1 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -101,9 +101,13 @@ static void prepare_setup_sigcontext(struct 
> > task_struct *tsk, int ctx_has_vsx_re
> >   * Set up the sigcontext for the signal frame.
> >   */
> >  
> > -static long setup_sigcontext(struct sigcontext __user *sc,
> > -   struct task_struct *tsk, int signr, sigset_t *set,
> > -   unsigned long handler, int ctx_has_vsx_region)
> > +#define unsafe_setup_sigcontext(sc, tsk, signr, set, handler,  
> > \
> > +   ctx_has_vsx_region, e)  \
> > +   unsafe_op_wrap(__unsafe_setup_sigcontext(sc, tsk, signr, set,   \
> > +   handler, ctx_has_vsx_region), e)
> > +static long notrace __unsafe_setup_sigcontext(struct sigcontext __user *sc,
> > +   struct task_struct *tsk, int signr, 
> > sigset_t *set,
> > +   unsigned long handler, int 
> > ctx_has_vsx_region)
> >  {
> > /* When CONFIG_ALTIVEC is set, we _always_ setup v_regs even if the
> >  * process never used altivec yet (MSR_VEC is zero in pt_regs of
> > @@ -118,20 +122,19 @@ static long setup_sigcontext(struct sigcontext __user 
> > *sc,
> >  #endif
> > struct pt_regs *regs = tsk->thread.regs;
> > unsigned long msr = regs->msr;
> > -   long err = 0;
> > /* Force usr to alway see softe as 1 (interrupts enabled) */
> > unsigned long softe = 0x1;
> >  
> > BUG_ON(tsk != current);
> >  
> >  #ifdef CONFIG_ALTIVEC
> > -   err |= __put_user(v_regs, &sc->v_regs);
> > +   unsafe_put_user(v_regs, &sc->v_regs, efault_out);
> >  
> > /* save altivec registers */
> > if (tsk->thread.used_vr) {
> > /* Copy 33 vec registers (vr0..31 and vscr) to the stack */
> > -   err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
> > - 33 * sizeof(vector128));
> > +   unsafe_copy_to_user(v_regs, &tsk->thread.vr_state,
> > +   33 * sizeof(vector128), efault_out);
> > /* set MSR_VEC in the MSR value in the frame to indicate that 
> > sc->v_reg)
> >  * contains valid data.
> >  */
> > @@ -140,12 +143,12 @@ static long setup_sigcontext(struct sigcontext __user 
> > *sc,
> > /* We always copy to/from vrsave, it's 0 if we don't have or don't
> >  * use altivec.
> >  */
> > -   err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
> > +   unsafe_put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33], 
> > efault_out);
> >  #else /* CONFIG_ALTIVEC */
> > -   err |= __put_user(0, &sc->v_regs);
> > +   unsafe_put_user(0, &sc->v_regs, efault_out);
> >  #endif /* CONFIG_ALTIVEC */
> > /* copy fpr regs and fpscr */
> > -   err |= copy_fpr_to_user(&sc->fp_regs, tsk);
> > +   unsafe_copy_fpr_to_user(&sc->fp_regs, tsk, efault_out);
> >  
> > /*
> >  * Clear the MSR VSX bit to indicate there is no valid state attached
> > @@ -160,24 +163,27 @@ static long setup_sigcontext(struct sigcontext __user 
> > *sc,
> >  */
> > if (tsk->thread.used_vsr && ctx_has_vsx_region) {
> > v_regs += ELF_NVRREG;
> > -   err |= copy_vsx_to_user(v_regs, tsk);
> > +   unsafe_copy_vsx_to_user(v_regs, tsk, efault_out

Re: [PATCH v5 03/10] powerpc/signal64: Move non-inline functions out of setup_sigcontext()

2021-02-10 Thread Christopher M. Riedl
On Wed Feb 10, 2021 at 3:06 PM CST, Daniel Axtens wrote:
> "Christopher M. Riedl"  writes:
>
> > On Sun Feb 7, 2021 at 10:44 PM CST, Daniel Axtens wrote:
> >> Hi Chris,
> >>
> >> These two paragraphs are a little confusing and they seem slightly
> >> repetitive. But I get the general idea. Two specific comments below:
> >
> > Umm... yeah only one of those was supposed to be sent. I will reword
> > this for the next spin and address the comment below about how it is
> > not entirely clear that the inline functions are being moved out.
> >
> >>
> >> > There are non-inline functions which get called in setup_sigcontext() to
> >> > save register state to the thread struct. Move these functions into a
> >> > separate prepare_setup_sigcontext() function so that
> >> > setup_sigcontext() can be refactored later into an "unsafe" version
> >> > which assumes an open uaccess window. Non-inline functions should be
> >> > avoided when uaccess is open.
> >>
> >> Why do we want to avoid non-inline functions? We came up with:
> >>
> >> - we want KUAP protection for as much of the kernel as possible: each
> >> extra bit of code run with the window open is another piece of attack
> >> surface.
> >>
> >> - non-inline functions default to traceable, which means we could end
> >> up ftracing while uaccess is enabled. That's a pretty big hole in the
> >> defences that KUAP provides.
> >>
> >> I think we've also had problems with the window being opened or closed
> >> unexpectedly by various bits of code? So the less code runs in uaccess
> >> context the less likely that is to occur.
> >
> > That is my understanding as well.
> >
> >>  
> >> > The majority of setup_sigcontext() can be refactored to execute in an
> >> > "unsafe" context (uaccess window is opened) except for some non-inline
> >> > functions. Move these out into a separate prepare_setup_sigcontext()
> >> > function which must be called first and before opening up a uaccess
> >> > window. A follow-up commit converts setup_sigcontext() to be "unsafe".
> >>
> >> This was a bit confusing until we realise that you're moving the _calls_
> >> to the non-inline functions out, not the non-inline functions
> >> themselves.
> >>
> >> > Signed-off-by: Christopher M. Riedl 
> >> > ---
> >> >  arch/powerpc/kernel/signal_64.c | 32 +---
> >> >  1 file changed, 21 insertions(+), 11 deletions(-)
> >> >
> >> > diff --git a/arch/powerpc/kernel/signal_64.c 
> >> > b/arch/powerpc/kernel/signal_64.c
> >> > index f9e4a1ac440f..b211a8ea4f6e 100644
> >> > --- a/arch/powerpc/kernel/signal_64.c
> >> > +++ b/arch/powerpc/kernel/signal_64.c
> >> > @@ -79,6 +79,24 @@ static elf_vrreg_t __user *sigcontext_vmx_regs(struct 
> >> > sigcontext __user *sc)
> >> >  }
> >> >  #endif
> >> >  
> >> > +static void prepare_setup_sigcontext(struct task_struct *tsk, int 
> >> > ctx_has_vsx_region)
> >>
> >> ctx_has_vsx_region should probably be a bool? Although setup_sigcontext
> >> also has it as an int so I guess that's arguable, and maybe it's better
> >> to stick with this for constency.
> >
> > I've been told not to introduce unrelated changes in my patches before
> > so chose to keep this as an int for consistency.
>
> Seems reasonable.
>
> >
> >>
> >> > +{
> >> > +#ifdef CONFIG_ALTIVEC
> >> > +/* save altivec registers */
> >> > +if (tsk->thread.used_vr)
> >> > +flush_altivec_to_thread(tsk);
> >> > +if (cpu_has_feature(CPU_FTR_ALTIVEC))
> >> > +tsk->thread.vrsave = mfspr(SPRN_VRSAVE);
> >> > +#endif /* CONFIG_ALTIVEC */
> >> > +
> >> > +flush_fp_to_thread(tsk);
> >> > +
> >> > +#ifdef CONFIG_VSX
> >> > +if (tsk->thread.used_vsr && ctx_has_vsx_region)
> >> > +flush_vsx_to_thread(tsk);
> >> > +#endif /* CONFIG_VSX */
> >>
> >> Alternatively, given that this is the only use of ctx_has_vsx_region,
> >> mpe suggested that perhaps we could drop it entirely and always
> >> flush_vsx if used_vsr. The funct

Re: [PATCH v5 03/10] powerpc/signal64: Move non-inline functions out of setup_sigcontext()

2021-02-09 Thread Christopher M. Riedl
On Sun Feb 7, 2021 at 10:44 PM CST, Daniel Axtens wrote:
> Hi Chris,
>
> These two paragraphs are a little confusing and they seem slightly
> repetitive. But I get the general idea. Two specific comments below:

Umm... yeah only one of those was supposed to be sent. I will reword
this for the next spin and address the comment below about how it is
not entirely clear that the inline functions are being moved out.

>
> > There are non-inline functions which get called in setup_sigcontext() to
> > save register state to the thread struct. Move these functions into a
> > separate prepare_setup_sigcontext() function so that
> > setup_sigcontext() can be refactored later into an "unsafe" version
> > which assumes an open uaccess window. Non-inline functions should be
> > avoided when uaccess is open.
>
> Why do we want to avoid non-inline functions? We came up with:
>
> - we want KUAP protection for as much of the kernel as possible: each
> extra bit of code run with the window open is another piece of attack
> surface.
>
> - non-inline functions default to traceable, which means we could end
> up ftracing while uaccess is enabled. That's a pretty big hole in the
> defences that KUAP provides.
>
> I think we've also had problems with the window being opened or closed
> unexpectedly by various bits of code? So the less code runs in uaccess
> context the less likely that is to occur.

That is my understanding as well.

>  
> > The majority of setup_sigcontext() can be refactored to execute in an
> > "unsafe" context (uaccess window is opened) except for some non-inline
> > functions. Move these out into a separate prepare_setup_sigcontext()
> > function which must be called first and before opening up a uaccess
> > window. A follow-up commit converts setup_sigcontext() to be "unsafe".
>
> This was a bit confusing until we realise that you're moving the _calls_
> to the non-inline functions out, not the non-inline functions
> themselves.
>
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >  arch/powerpc/kernel/signal_64.c | 32 +---
> >  1 file changed, 21 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c 
> > b/arch/powerpc/kernel/signal_64.c
> > index f9e4a1ac440f..b211a8ea4f6e 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -79,6 +79,24 @@ static elf_vrreg_t __user *sigcontext_vmx_regs(struct 
> > sigcontext __user *sc)
> >  }
> >  #endif
> >  
> > +static void prepare_setup_sigcontext(struct task_struct *tsk, int 
> > ctx_has_vsx_region)
>
> ctx_has_vsx_region should probably be a bool? Although setup_sigcontext
> also has it as an int so I guess that's arguable, and maybe it's better
> to stick with this for constency.

I've been told not to introduce unrelated changes in my patches before
so chose to keep this as an int for consistency.

>
> > +{
> > +#ifdef CONFIG_ALTIVEC
> > +   /* save altivec registers */
> > +   if (tsk->thread.used_vr)
> > +   flush_altivec_to_thread(tsk);
> > +   if (cpu_has_feature(CPU_FTR_ALTIVEC))
> > +   tsk->thread.vrsave = mfspr(SPRN_VRSAVE);
> > +#endif /* CONFIG_ALTIVEC */
> > +
> > +   flush_fp_to_thread(tsk);
> > +
> > +#ifdef CONFIG_VSX
> > +   if (tsk->thread.used_vsr && ctx_has_vsx_region)
> > +   flush_vsx_to_thread(tsk);
> > +#endif /* CONFIG_VSX */
>
> Alternatively, given that this is the only use of ctx_has_vsx_region,
> mpe suggested that perhaps we could drop it entirely and always
> flush_vsx if used_vsr. The function is only ever called with either
> `current` or wth ctx_has_vsx_region set to 1, so in either case I think
> that's safe? I'm not sure if it would have performance implications.

I think that could work as long as we can guarantee that the context
passed to swapcontext will always be sufficiently sized if used_vsr,
which I think *has* to be the case?

>
> Should we move this and the altivec ifdef to IS_ENABLED(CONFIG_VSX) etc?
> I'm not sure if that runs into any problems with things like 'used_vsr'
> only being defined if CONFIG_VSX is set, but I thought I'd ask.

That's why I didn't use IS_ENABLED(CONFIG_...) here - all of these
field (used_vr, vrsave, used_vsr) declarations are guarded by #ifdefs :/

>
>
> > +}
> > +
> >  /*
> >   * Set up the sigcontext for the signal frame.
> >   */
> > @@ -97,7 +115,6 @@ static long setup_sigcontext(struct sigcontext __user 
> > 

Re: [PATCH v5 10/10] powerpc/signal64: Use __get_user() to copy sigset_t

2021-02-09 Thread Christopher M. Riedl
On Tue Feb 9, 2021 at 3:45 PM CST, Christophe Leroy wrote:
> "Christopher M. Riedl"  a écrit :
>
> > Usually sigset_t is exactly 8B which is a "trivial" size and does not
> > warrant using __copy_from_user(). Use __get_user() directly in
> > anticipation of future work to remove the trivial size optimizations
> > from __copy_from_user(). Calling __get_user() also results in a small
> > boost to signal handling throughput here.
> >
> > Signed-off-by: Christopher M. Riedl 
> > ---
> >  arch/powerpc/kernel/signal_64.c | 14 --
> >  1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/signal_64.c  
> > b/arch/powerpc/kernel/signal_64.c
> > index 817b64e1e409..42fdc4a7ff72 100644
> > --- a/arch/powerpc/kernel/signal_64.c
> > +++ b/arch/powerpc/kernel/signal_64.c
> > @@ -97,6 +97,14 @@ static void prepare_setup_sigcontext(struct  
> > task_struct *tsk, int ctx_has_vsx_re
> >  #endif /* CONFIG_VSX */
> >  }
> >
> > +static inline int get_user_sigset(sigset_t *dst, const sigset_t *src)
>
> Should be called __get_user_sigset() as it is a helper for __get_user()

Ok makes sense.

>
> > +{
> > +   if (sizeof(sigset_t) <= 8)
>
> We should always use __get_user(), see below.
>
> > +   return __get_user(dst->sig[0], &src->sig[0]);
>
> I think the above will not work on ppc32, it will only copy 4 bytes.
> You must cast the source to u64*

Well this is signal_64.c :) Looks like ppc32 needs the same thing so
I'll just move this into signal.h and use it for both. 

The only exception would be the COMPAT case in signal_32.c which ends up
calling the common get_compat_sigset(). Updating that is probably
outside the scope of this series.

>
> > +   else
> > +   return __copy_from_user(dst, src, sizeof(sigset_t));
>
> I see no point in keeping this alternative. Today sigset_ t is fixed.
> If you fear one day someone might change it to something different
> than a u64, just add a BUILD_BUG_ON(sizeof(sigset_t) != sizeof(u64));

Ah yes that is much better - thanks for the suggestion.

>
> > +}
> > +
> >  /*
> >   * Set up the sigcontext for the signal frame.
> >   */
> > @@ -701,8 +709,9 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext  
> > __user *, old_ctx,
> >  * We kill the task with a SIGSEGV in this situation.
> >  */
> >
> > -   if (__copy_from_user(&set, &new_ctx->uc_sigmask, sizeof(set)))
> > +   if (get_user_sigset(&set, &new_ctx->uc_sigmask))
> > do_exit(SIGSEGV);
> > +
>
> This white space is not part of the change, keep patches to the
> minimum, avoid cosmetic

Just a (bad?) habit on my part that I missed - I'll remove this one and
the one further below.

>
> > set_current_blocked(&set);
> >
> > if (!user_read_access_begin(new_ctx, ctx_size))
> > @@ -740,8 +749,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
> > if (!access_ok(uc, sizeof(*uc)))
> > goto badframe;
> >
> > -   if (__copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
> > +   if (get_user_sigset(&set, &uc->uc_sigmask))
> > goto badframe;
> > +
>
> Same
>
> > set_current_blocked(&set);
> >
> >  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
> > --
> > 2.26.1



Re: [PATCH 2/8] powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()

2021-02-08 Thread Christopher M. Riedl
On Sun Feb 7, 2021 at 4:12 AM CST, Christophe Leroy wrote:
>
>
> Le 06/02/2021 à 18:39, Christopher M. Riedl a écrit :
> > On Sat Feb 6, 2021 at 10:32 AM CST, Christophe Leroy wrote:
> >>
> >>
> >> Le 20/10/2020 à 04:01, Christopher M. Riedl a écrit :
> >>> On Fri Oct 16, 2020 at 10:48 AM CDT, Christophe Leroy wrote:
> >>>>
> >>>>
> >>>> Le 15/10/2020 à 17:01, Christopher M. Riedl a écrit :
> >>>>> Reuse the "safe" implementation from signal.c except for calling
> >>>>> unsafe_copy_from_user() to copy into a local buffer. Unlike the
> >>>>> unsafe_copy_{vsx,fpr}_to_user() functions the "copy from" functions
> >>>>> cannot use unsafe_get_user() directly to bypass the local buffer since
> >>>>> doing so significantly reduces signal handling performance.
> >>>>
> >>>> Why can't the functions use unsafe_get_user(), why does it significantly
> >>>> reduces signal handling
> >>>> performance ? How much significant ? I would expect that not going
> >>>> through an intermediate memory
> >>>> area would be more efficient
> >>>>
> >>>
> >>> Here is a comparison, 'unsafe-signal64-regs' avoids the intermediate 
> >>> buffer:
> >>>
> >>>   |  | hash   | radix  |
> >>>   |  | -- | -- |
> >>>   | linuxppc/next| 289014 | 158408 |
> >>>   | unsafe-signal64  | 298506 | 253053 |
> >>>   | unsafe-signal64-regs | 254898 | 220831 |
> >>>
> >>> I have not figured out the 'why' yet. As you mentioned in your series,
> >>> technically calling __copy_tofrom_user() is overkill for these
> >>> operations. The only obvious difference between unsafe_put_user() and
> >>> unsafe_get_user() is that we don't have asm-goto for the 'get' variant.
> >>> Instead we wrap with unsafe_op_wrap() which inserts a conditional and
> >>> then goto to the label.
> >>>
> >>> Implemenations:
> >>>
> >>>   #define unsafe_copy_fpr_from_user(task, from, label)   do {\
> >>>  struct task_struct *__t = task; \
> >>>  u64 __user *buf = (u64 __user *)from;   \
> >>>  int i;  \
> >>>  \
> >>>  for (i = 0; i < ELF_NFPREG - 1; i++)\
> >>>  unsafe_get_user(__t->thread.TS_FPR(i), &buf[i], label); \
> >>>  unsafe_get_user(__t->thread.fp_state.fpscr, &buf[i], label);\
> >>>   } while (0)
> >>>
> >>>   #define unsafe_copy_vsx_from_user(task, from, label)   do {\
> >>>  struct task_struct *__t = task; \
> >>>  u64 __user *buf = (u64 __user *)from;   \
> >>>  int i;  \
> >>>  \
> >>>  for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
> >>>  
> >>> unsafe_get_user(__t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET], \
> >>>  &buf[i], label);\
> >>>   } while (0)
> >>>
> >>
> >> Do you have CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP enabled in
> >> your config ?
> > 
> > I don't have these set in my config (ppc64le_defconfig). I think I
> > figured this out - the reason for the lower signal throughput is the
> > barrier_nospec() in __get_user_nocheck(). When looping we incur that
> > cost on every iteration. Commenting it out results in signal performance
> > of ~316K w/ hash on the unsafe-signal64-regs branch. Obviously the
> > barrier is there for a reason but it is quite costly.
>
> Interesting.
>
> Can you try with the patch I just sent out
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/c72f014730823b413528e90ab6c4d3bcb79f8497.1612692067.git.christophe.le...@csgroup.eu/

Yeah that patch solves the problem. Using unsafe_get_user() in a loop is
actually faster on radix than using the intermediary buffer step. A
summary of results below (unsafe-signal64-v6 uses unsafe_get_user() and
avoids the local buffer):

|  | hash   | radix  |
|  | -- | -- |
| unsafe-signal64-v5   | 194533 | 230089 |
| unsafe-signal64-v6   | 176739 | 202840 |
| unsafe-signal64-v5+barrier patch | 203037 | 234936 |
| unsafe-signal64-v6+barrier patch | 205484 | 241030 |

I am still expecting some comments/feedback on my v5 before sending out
v6. Should I include your patch in my series as well?

>
> Thanks
> Christophe



Re: [PATCH 2/8] powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()

2021-02-06 Thread Christopher M. Riedl
On Sat Feb 6, 2021 at 10:32 AM CST, Christophe Leroy wrote:
>
>
> Le 20/10/2020 à 04:01, Christopher M. Riedl a écrit :
> > On Fri Oct 16, 2020 at 10:48 AM CDT, Christophe Leroy wrote:
> >>
> >>
> >> Le 15/10/2020 à 17:01, Christopher M. Riedl a écrit :
> >>> Reuse the "safe" implementation from signal.c except for calling
> >>> unsafe_copy_from_user() to copy into a local buffer. Unlike the
> >>> unsafe_copy_{vsx,fpr}_to_user() functions the "copy from" functions
> >>> cannot use unsafe_get_user() directly to bypass the local buffer since
> >>> doing so significantly reduces signal handling performance.
> >>
> >> Why can't the functions use unsafe_get_user(), why does it significantly
> >> reduces signal handling
> >> performance ? How much significant ? I would expect that not going
> >> through an intermediate memory
> >> area would be more efficient
> >>
> > 
> > Here is a comparison, 'unsafe-signal64-regs' avoids the intermediate buffer:
> > 
> > |  | hash   | radix  |
> > |  | -- | -- |
> > | linuxppc/next| 289014 | 158408 |
> > | unsafe-signal64  | 298506 | 253053 |
> > | unsafe-signal64-regs | 254898 | 220831 |
> > 
> > I have not figured out the 'why' yet. As you mentioned in your series,
> > technically calling __copy_tofrom_user() is overkill for these
> > operations. The only obvious difference between unsafe_put_user() and
> > unsafe_get_user() is that we don't have asm-goto for the 'get' variant.
> > Instead we wrap with unsafe_op_wrap() which inserts a conditional and
> > then goto to the label.
> > 
> > Implemenations:
> > 
> > #define unsafe_copy_fpr_from_user(task, from, label)   do {\
> >struct task_struct *__t = task; \
> >u64 __user *buf = (u64 __user *)from;   \
> >int i;  \
> >\
> >for (i = 0; i < ELF_NFPREG - 1; i++)\
> >unsafe_get_user(__t->thread.TS_FPR(i), &buf[i], label); \
> >unsafe_get_user(__t->thread.fp_state.fpscr, &buf[i], label);\
> > } while (0)
> > 
> > #define unsafe_copy_vsx_from_user(task, from, label)   do {\
> >struct task_struct *__t = task; \
> >u64 __user *buf = (u64 __user *)from;   \
> >int i;  \
> >\
> >for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
> >
> > unsafe_get_user(__t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET], \
> >&buf[i], label);\
> > } while (0)
> > 
>
> Do you have CONFIG_PROVE_LOCKING or CONFIG_DEBUG_ATOMIC_SLEEP enabled in
> your config ?

I don't have these set in my config (ppc64le_defconfig). I think I
figured this out - the reason for the lower signal throughput is the
barrier_nospec() in __get_user_nocheck(). When looping we incur that
cost on every iteration. Commenting it out results in signal performance
of ~316K w/ hash on the unsafe-signal64-regs branch. Obviously the
barrier is there for a reason but it is quite costly.

This also explains why the copy_{fpr,vsx}_to_user() direction does not
suffer from the slowdown because there is no need for barrier_nospec().
>
> If yes, could you try together with the patch from Alexey
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20210204121612.32721-1-...@ozlabs.ru/
> ?
>
> Thanks
> Christophe



[PATCH v2] powerpc64/idle: Fix SP offsets when saving GPRs

2021-02-05 Thread Christopher M. Riedl
The idle entry/exit code saves/restores GPRs in the stack "red zone"
(Protected Zone according to PowerPC64 ELF ABI v2). However, the offset
used for the first GPR is incorrect and overwrites the back chain - the
Protected Zone actually starts below the current SP. In practice this is
probably not an issue, but it's still incorrect so fix it.

Also expand the comments to explain why using the stack "red zone"
instead of creating a new stackframe is appropriate here.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/idle_book3s.S | 138 --
 1 file changed, 73 insertions(+), 65 deletions(-)

diff --git a/arch/powerpc/kernel/idle_book3s.S 
b/arch/powerpc/kernel/idle_book3s.S
index 22f249b6f58d..f9e6d83e6720 100644
--- a/arch/powerpc/kernel/idle_book3s.S
+++ b/arch/powerpc/kernel/idle_book3s.S
@@ -52,28 +52,32 @@ _GLOBAL(isa300_idle_stop_mayloss)
std r1,PACAR1(r13)
mflrr4
mfcrr5
-   /* use stack red zone rather than a new frame for saving regs */
-   std r2,-8*0(r1)
-   std r14,-8*1(r1)
-   std r15,-8*2(r1)
-   std r16,-8*3(r1)
-   std r17,-8*4(r1)
-   std r18,-8*5(r1)
-   std r19,-8*6(r1)
-   std r20,-8*7(r1)
-   std r21,-8*8(r1)
-   std r22,-8*9(r1)
-   std r23,-8*10(r1)
-   std r24,-8*11(r1)
-   std r25,-8*12(r1)
-   std r26,-8*13(r1)
-   std r27,-8*14(r1)
-   std r28,-8*15(r1)
-   std r29,-8*16(r1)
-   std r30,-8*17(r1)
-   std r31,-8*18(r1)
-   std r4,-8*19(r1)
-   std r5,-8*20(r1)
+   /*
+* Use the stack red zone rather than a new frame for saving regs since
+* in the case of no GPR loss the wakeup code branches directly back to
+* the caller without deallocating the stack frame first.
+*/
+   std r2,-8*1(r1)
+   std r14,-8*2(r1)
+   std r15,-8*3(r1)
+   std r16,-8*4(r1)
+   std r17,-8*5(r1)
+   std r18,-8*6(r1)
+   std r19,-8*7(r1)
+   std r20,-8*8(r1)
+   std r21,-8*9(r1)
+   std r22,-8*10(r1)
+   std r23,-8*11(r1)
+   std r24,-8*12(r1)
+   std r25,-8*13(r1)
+   std r26,-8*14(r1)
+   std r27,-8*15(r1)
+   std r28,-8*16(r1)
+   std r29,-8*17(r1)
+   std r30,-8*18(r1)
+   std r31,-8*19(r1)
+   std r4,-8*20(r1)
+   std r5,-8*21(r1)
/* 168 bytes */
PPC_STOP
b   .   /* catch bugs */
@@ -89,8 +93,8 @@ _GLOBAL(isa300_idle_stop_mayloss)
  */
 _GLOBAL(idle_return_gpr_loss)
ld  r1,PACAR1(r13)
-   ld  r4,-8*19(r1)
-   ld  r5,-8*20(r1)
+   ld  r4,-8*20(r1)
+   ld  r5,-8*21(r1)
mtlrr4
mtcrr5
/*
@@ -98,25 +102,25 @@ _GLOBAL(idle_return_gpr_loss)
 * from PACATOC. This could be avoided for that less common case
 * if KVM saved its r2.
 */
-   ld  r2,-8*0(r1)
-   ld  r14,-8*1(r1)
-   ld  r15,-8*2(r1)
-   ld  r16,-8*3(r1)
-   ld  r17,-8*4(r1)
-   ld  r18,-8*5(r1)
-   ld  r19,-8*6(r1)
-   ld  r20,-8*7(r1)
-   ld  r21,-8*8(r1)
-   ld  r22,-8*9(r1)
-   ld  r23,-8*10(r1)
-   ld  r24,-8*11(r1)
-   ld  r25,-8*12(r1)
-   ld  r26,-8*13(r1)
-   ld  r27,-8*14(r1)
-   ld  r28,-8*15(r1)
-   ld  r29,-8*16(r1)
-   ld  r30,-8*17(r1)
-   ld  r31,-8*18(r1)
+   ld  r2,-8*1(r1)
+   ld  r14,-8*2(r1)
+   ld  r15,-8*3(r1)
+   ld  r16,-8*4(r1)
+   ld  r17,-8*5(r1)
+   ld  r18,-8*6(r1)
+   ld  r19,-8*7(r1)
+   ld  r20,-8*8(r1)
+   ld  r21,-8*9(r1)
+   ld  r22,-8*10(r1)
+   ld  r23,-8*11(r1)
+   ld  r24,-8*12(r1)
+   ld  r25,-8*13(r1)
+   ld  r26,-8*14(r1)
+   ld  r27,-8*15(r1)
+   ld  r28,-8*16(r1)
+   ld  r29,-8*17(r1)
+   ld  r30,-8*18(r1)
+   ld  r31,-8*19(r1)
blr
 
 /*
@@ -154,28 +158,32 @@ _GLOBAL(isa206_idle_insn_mayloss)
std r1,PACAR1(r13)
mflrr4
mfcrr5
-   /* use stack red zone rather than a new frame for saving regs */
-   std r2,-8*0(r1)
-   std r14,-8*1(r1)
-   std r15,-8*2(r1)
-   std r16,-8*3(r1)
-   std r17,-8*4(r1)
-   std r18,-8*5(r1)
-   std r19,-8*6(r1)
-   std r20,-8*7(r1)
-   std r21,-8*8(r1)
-   std r22,-8*9(r1)
-   std r23,-8*10(r1)
-   std r24,-8*11(r1)
-   std r25,-8*12(r1)
-   std r26,-8*13(r1)
-   std r27,-8*14(r1)
-   std r28,-8*15(r1)
-   std r29,-8*16(r1)
-   std r30,-8*17(r1)
-   std r31,-8*18(r1)
-   std r4,-8*

Re: [PATCH v5 10/10] powerpc/signal64: Use __get_user() to copy sigset_t

2021-02-04 Thread Christopher M. Riedl
On Wed Feb 3, 2021 at 12:43 PM CST, Christopher M. Riedl wrote:
> Usually sigset_t is exactly 8B which is a "trivial" size and does not
> warrant using __copy_from_user(). Use __get_user() directly in
> anticipation of future work to remove the trivial size optimizations
> from __copy_from_user(). Calling __get_user() also results in a small
> boost to signal handling throughput here.
>
> Signed-off-by: Christopher M. Riedl 

This patch triggered sparse warnings about 'different address spaces'.
This minor fixup cleans that up:

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 42fdc4a7ff72..1dfda6403e14 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -97,7 +97,7 @@ static void prepare_setup_sigcontext(struct task_struct *tsk, 
int ctx_has_vsx_re
 #endif /* CONFIG_VSX */
 }

-static inline int get_user_sigset(sigset_t *dst, const sigset_t *src)
+static inline int get_user_sigset(sigset_t *dst, const sigset_t __user *src)
 {
if (sizeof(sigset_t) <= 8)
return __get_user(dst->sig[0], &src->sig[0]);


Re: [PATCH] powerpc64/idle: Fix SP offsets when saving GPRs

2021-02-03 Thread Christopher M. Riedl
On Sat Jan 30, 2021 at 7:44 AM CST, Nicholas Piggin wrote:
> Excerpts from Michael Ellerman's message of January 30, 2021 9:32 pm:
> > "Christopher M. Riedl"  writes:
> >> The idle entry/exit code saves/restores GPRs in the stack "red zone"
> >> (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset
> >> used for the first GPR is incorrect and overwrites the back chain - the
> >> Protected Zone actually starts below the current SP. In practice this is
> >> probably not an issue, but it's still incorrect so fix it.
> > 
> > Nice catch.
> > 
> > Corrupting the back chain means you can't backtrace from there, which
> > could be confusing for debugging one day.
>
> Yeah, we seem to have got away without noticing because the CPU will
> wake up and return out of here before it tries to unwind the stack,
> but if you tried to walk it by hand if the CPU got stuck in idle or
> something, then we'd get confused.
>
> > It does make me wonder why we don't just create a stack frame and use
> > the normal macros? It would use a bit more stack space, but we shouldn't
> > be short of stack space when going idle.
> > 
> > Nick, was there a particular reason for using the red zone?
>
> I don't recall a particular reason, I think a normal stack frame is
> probably a good idea.

I'll send a version using STACKFRAMESIZE - I assume that's the "normal"
stack frame :)

I admit I am a bit confused when I saw the similar but much smaller
STACK_FRAME_OVERHEAD which is also used in _some_ cases to save/restore
a few registers.

>
> Thanks,
> Nick



[PATCH v5 00/10] Improve signal performance on PPC64 with KUAP

2021-02-03 Thread Christopher M. Riedl
As reported by Anton, there is a large penalty to signal handling
performance on radix systems using KUAP. The signal handling code
performs many user access operations, each of which needs to switch the
KUAP permissions bit to open and then close user access. This involves a
costly 'mtspr' operation [0].

There is existing work done on x86 and by Christophe Leroy for PPC32 to
instead open up user access in "blocks" using user_*_access_{begin,end}.
We can do the same in PPC64 to bring performance back up on KUAP-enabled
radix and now also hash MMU systems [1].

Hash MMU KUAP support along with uaccess flush has landed in linuxppc/next
since the last revision. This series also provides a large benefit on hash
with KUAP. However, in the hash implementation of KUAP the user AMR is
always restored during system_call_exception() which cannot be avoided.
Fewer user access switches naturally also result in less uaccess flushing.

The first two patches add some needed 'unsafe' versions of copy-from
functions. While these do not make use of asm-goto they still allow for
avoiding the repeated uaccess switches.

The third patch moves functions called by setup_sigcontext() into a new
prepare_setup_sigcontext() to simplify converting setup_sigcontext()
into an 'unsafe' version which assumes an open uaccess window later.

The fourth and fifths patches clean-up some of the Transactional Memory
ifdef stuff to simplify using uaccess blocks later.

The next two patches rewrite some of the signal64 helper functions to
be 'unsafe'. Finally, the last three patches update the main signal
handling functions to make use of the new 'unsafe' helpers and eliminate
some additional uaccess switching.

I used the will-it-scale signal1 benchmark to measure and compare
performance [2]. The below results are from running a minimal
kernel+initramfs QEMU/KVM guest on a POWER9 Blackbird:

signal1_threads -t1 -s10

|  | hash   | radix  |
|  | -- | -- |
| linuxppc/next| 117667 | 135752 |
| linuxppc/next w/o KUAP+KUEP  | 225273 | 227567 |
| unsafe-signal64  | 193402 | 230983 |

[0]: https://github.com/linuxppc/issues/issues/277
[1]: https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=196278
[2]: https://github.com/antonblanchard/will-it-scale/blob/master/tests/signal1.c

v5: * Use sizeof(buf) in copy_{vsx,fpr}_from_user() (Thanks David Laight)
* Rebase on latest linuxppc/next

v4: * Fix issues identified by Christophe Leroy (Thanks for review)
* Use __get_user() directly to copy the 8B sigset_t

v3: * Rebase on latest linuxppc/next
* Reword confusing commit messages
* Add missing comma in macro in signal.h which broke compiles without
  CONFIG_ALTIVEC
* Validate hash KUAP signal performance improvements

v2: * Rebase on latest linuxppc/next + Christophe Leroy's PPC32
  signal series
* Simplify/remove TM ifdefery similar to PPC32 series and clean
  up the uaccess begin/end calls
* Isolate non-inline functions so they are not called when
  uaccess window is open

Christopher M. Riedl (8):
  powerpc/uaccess: Add unsafe_copy_from_user
  powerpc/signal: Add unsafe_copy_{vsx,fpr}_from_user()
  powerpc/signal64: Move non-inline functions out of setup_sigcontext()
  powerpc: Reference param in MSR_TM_ACTIVE() macro
  powerpc/signal64: Remove TM ifdefery in middle of if/else block
  powerpc/signal64: Replace setup_sigcontext() w/
unsafe_setup_sigcontext()
  powerpc/signal64: Replace restore_sigcontext() w/
unsafe_restore_sigcontext()
  powerpc/signal64: Use __get_user() to copy sigset_t

Daniel Axtens (2):
  powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess
switches
  powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

 arch/powerpc/include/asm/reg.h |   2 +-
 arch/powerpc/include/asm/uaccess.h |   3 +
 arch/powerpc/kernel/signal.h   |  30 
 arch/powerpc/kernel/signal_64.c| 251 ++---
 4 files changed, 193 insertions(+), 93 deletions(-)

-- 
2.26.1



[PATCH v5 07/10] powerpc/signal64: Replace restore_sigcontext() w/ unsafe_restore_sigcontext()

2021-02-03 Thread Christopher M. Riedl
Previously restore_sigcontext() performed a costly KUAP switch on every
uaccess operation. These repeated uaccess switches cause a significant
drop in signal handling performance.

Rewrite restore_sigcontext() to assume that a userspace read access
window is open. Replace all uaccess functions with their 'unsafe'
versions which avoid the repeated uaccess switches.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 68 -
 1 file changed, 41 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 4248e4489ff1..d668f8af18fe 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -326,14 +326,14 @@ static long setup_tm_sigcontexts(struct sigcontext __user 
*sc,
 /*
  * Restore the sigcontext from the signal frame.
  */
-
-static long restore_sigcontext(struct task_struct *tsk, sigset_t *set, int sig,
- struct sigcontext __user *sc)
+#define unsafe_restore_sigcontext(tsk, set, sig, sc, e) \
+   unsafe_op_wrap(__unsafe_restore_sigcontext(tsk, set, sig, sc), e)
+static long notrace __unsafe_restore_sigcontext(struct task_struct *tsk, 
sigset_t *set,
+   int sig, struct sigcontext 
__user *sc)
 {
 #ifdef CONFIG_ALTIVEC
elf_vrreg_t __user *v_regs;
 #endif
-   unsigned long err = 0;
unsigned long save_r13 = 0;
unsigned long msr;
struct pt_regs *regs = tsk->thread.regs;
@@ -348,27 +348,28 @@ static long restore_sigcontext(struct task_struct *tsk, 
sigset_t *set, int sig,
save_r13 = regs->gpr[13];
 
/* copy the GPRs */
-   err |= __copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr));
-   err |= __get_user(regs->nip, &sc->gp_regs[PT_NIP]);
+   unsafe_copy_from_user(regs->gpr, sc->gp_regs, sizeof(regs->gpr),
+ efault_out);
+   unsafe_get_user(regs->nip, &sc->gp_regs[PT_NIP], efault_out);
/* get MSR separately, transfer the LE bit if doing signal return */
-   err |= __get_user(msr, &sc->gp_regs[PT_MSR]);
+   unsafe_get_user(msr, &sc->gp_regs[PT_MSR], efault_out);
if (sig)
regs->msr = (regs->msr & ~MSR_LE) | (msr & MSR_LE);
-   err |= __get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3]);
-   err |= __get_user(regs->ctr, &sc->gp_regs[PT_CTR]);
-   err |= __get_user(regs->link, &sc->gp_regs[PT_LNK]);
-   err |= __get_user(regs->xer, &sc->gp_regs[PT_XER]);
-   err |= __get_user(regs->ccr, &sc->gp_regs[PT_CCR]);
+   unsafe_get_user(regs->orig_gpr3, &sc->gp_regs[PT_ORIG_R3], efault_out);
+   unsafe_get_user(regs->ctr, &sc->gp_regs[PT_CTR], efault_out);
+   unsafe_get_user(regs->link, &sc->gp_regs[PT_LNK], efault_out);
+   unsafe_get_user(regs->xer, &sc->gp_regs[PT_XER], efault_out);
+   unsafe_get_user(regs->ccr, &sc->gp_regs[PT_CCR], efault_out);
/* Don't allow userspace to set SOFTE */
set_trap_norestart(regs);
-   err |= __get_user(regs->dar, &sc->gp_regs[PT_DAR]);
-   err |= __get_user(regs->dsisr, &sc->gp_regs[PT_DSISR]);
-   err |= __get_user(regs->result, &sc->gp_regs[PT_RESULT]);
+   unsafe_get_user(regs->dar, &sc->gp_regs[PT_DAR], efault_out);
+   unsafe_get_user(regs->dsisr, &sc->gp_regs[PT_DSISR], efault_out);
+   unsafe_get_user(regs->result, &sc->gp_regs[PT_RESULT], efault_out);
 
if (!sig)
regs->gpr[13] = save_r13;
if (set != NULL)
-   err |=  __get_user(set->sig[0], &sc->oldmask);
+   unsafe_get_user(set->sig[0], &sc->oldmask, efault_out);
 
/*
 * Force reload of FP/VEC.
@@ -378,29 +379,28 @@ static long restore_sigcontext(struct task_struct *tsk, 
sigset_t *set, int sig,
regs->msr &= ~(MSR_FP | MSR_FE0 | MSR_FE1 | MSR_VEC | MSR_VSX);
 
 #ifdef CONFIG_ALTIVEC
-   err |= __get_user(v_regs, &sc->v_regs);
-   if (err)
-   return err;
+   unsafe_get_user(v_regs, &sc->v_regs, efault_out);
if (v_regs && !access_ok(v_regs, 34 * sizeof(vector128)))
return -EFAULT;
/* Copy 33 vec registers (vr0..31 and vscr) from the stack */
if (v_regs != NULL && (msr & MSR_VEC) != 0) {
-   err |= __copy_from_user(&tsk->thread.vr_state, v_regs,
-   33 * sizeof(vector128));
+   unsafe_copy_from_user(&tsk->thread.vr_state, v_regs,
+ 33 * sizeof(vector128), efault_out);
tsk->thread.used_vr = true;
} else if (

[PATCH v5 04/10] powerpc: Reference param in MSR_TM_ACTIVE() macro

2021-02-03 Thread Christopher M. Riedl
Unlike the other MSR_TM_* macros, MSR_TM_ACTIVE does not reference or
use its parameter unless CONFIG_PPC_TRANSACTIONAL_MEM is defined. This
causes an 'unused variable' compile warning unless the variable is also
guarded with CONFIG_PPC_TRANSACTIONAL_MEM.

Reference but do nothing with the argument in the macro to avoid a
potential compile warning.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/include/asm/reg.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index e40a921d78f9..c5a3e856191c 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -124,7 +124,7 @@
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 #define MSR_TM_ACTIVE(x) (((x) & MSR_TS_MASK) != 0) /* Transaction active? */
 #else
-#define MSR_TM_ACTIVE(x) 0
+#define MSR_TM_ACTIVE(x) ((void)(x), 0)
 #endif
 
 #if defined(CONFIG_PPC_BOOK3S_64)
-- 
2.26.1



[PATCH v5 10/10] powerpc/signal64: Use __get_user() to copy sigset_t

2021-02-03 Thread Christopher M. Riedl
Usually sigset_t is exactly 8B which is a "trivial" size and does not
warrant using __copy_from_user(). Use __get_user() directly in
anticipation of future work to remove the trivial size optimizations
from __copy_from_user(). Calling __get_user() also results in a small
boost to signal handling throughput here.

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 14 --
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 817b64e1e409..42fdc4a7ff72 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -97,6 +97,14 @@ static void prepare_setup_sigcontext(struct task_struct 
*tsk, int ctx_has_vsx_re
 #endif /* CONFIG_VSX */
 }
 
+static inline int get_user_sigset(sigset_t *dst, const sigset_t *src)
+{
+   if (sizeof(sigset_t) <= 8)
+   return __get_user(dst->sig[0], &src->sig[0]);
+   else
+   return __copy_from_user(dst, src, sizeof(sigset_t));
+}
+
 /*
  * Set up the sigcontext for the signal frame.
  */
@@ -701,8 +709,9 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
 * We kill the task with a SIGSEGV in this situation.
 */
 
-   if (__copy_from_user(&set, &new_ctx->uc_sigmask, sizeof(set)))
+   if (get_user_sigset(&set, &new_ctx->uc_sigmask))
do_exit(SIGSEGV);
+
set_current_blocked(&set);
 
if (!user_read_access_begin(new_ctx, ctx_size))
@@ -740,8 +749,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
if (!access_ok(uc, sizeof(*uc)))
goto badframe;
 
-   if (__copy_from_user(&set, &uc->uc_sigmask, sizeof(set)))
+   if (get_user_sigset(&set, &uc->uc_sigmask))
goto badframe;
+
set_current_blocked(&set);
 
 #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-- 
2.26.1



[PATCH v5 03/10] powerpc/signal64: Move non-inline functions out of setup_sigcontext()

2021-02-03 Thread Christopher M. Riedl
There are non-inline functions which get called in setup_sigcontext() to
save register state to the thread struct. Move these functions into a
separate prepare_setup_sigcontext() function so that
setup_sigcontext() can be refactored later into an "unsafe" version
which assumes an open uaccess window. Non-inline functions should be
avoided when uaccess is open.

The majority of setup_sigcontext() can be refactored to execute in an
"unsafe" context (uaccess window is opened) except for some non-inline
functions. Move these out into a separate prepare_setup_sigcontext()
function which must be called first and before opening up a uaccess
window. A follow-up commit converts setup_sigcontext() to be "unsafe".

Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 32 +---
 1 file changed, 21 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index f9e4a1ac440f..b211a8ea4f6e 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -79,6 +79,24 @@ static elf_vrreg_t __user *sigcontext_vmx_regs(struct 
sigcontext __user *sc)
 }
 #endif
 
+static void prepare_setup_sigcontext(struct task_struct *tsk, int 
ctx_has_vsx_region)
+{
+#ifdef CONFIG_ALTIVEC
+   /* save altivec registers */
+   if (tsk->thread.used_vr)
+   flush_altivec_to_thread(tsk);
+   if (cpu_has_feature(CPU_FTR_ALTIVEC))
+   tsk->thread.vrsave = mfspr(SPRN_VRSAVE);
+#endif /* CONFIG_ALTIVEC */
+
+   flush_fp_to_thread(tsk);
+
+#ifdef CONFIG_VSX
+   if (tsk->thread.used_vsr && ctx_has_vsx_region)
+   flush_vsx_to_thread(tsk);
+#endif /* CONFIG_VSX */
+}
+
 /*
  * Set up the sigcontext for the signal frame.
  */
@@ -97,7 +115,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 */
 #ifdef CONFIG_ALTIVEC
elf_vrreg_t __user *v_regs = sigcontext_vmx_regs(sc);
-   unsigned long vrsave;
 #endif
struct pt_regs *regs = tsk->thread.regs;
unsigned long msr = regs->msr;
@@ -112,7 +129,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 
/* save altivec registers */
if (tsk->thread.used_vr) {
-   flush_altivec_to_thread(tsk);
/* Copy 33 vec registers (vr0..31 and vscr) to the stack */
err |= __copy_to_user(v_regs, &tsk->thread.vr_state,
  33 * sizeof(vector128));
@@ -124,17 +140,10 @@ static long setup_sigcontext(struct sigcontext __user *sc,
/* We always copy to/from vrsave, it's 0 if we don't have or don't
 * use altivec.
 */
-   vrsave = 0;
-   if (cpu_has_feature(CPU_FTR_ALTIVEC)) {
-   vrsave = mfspr(SPRN_VRSAVE);
-   tsk->thread.vrsave = vrsave;
-   }
-
-   err |= __put_user(vrsave, (u32 __user *)&v_regs[33]);
+   err |= __put_user(tsk->thread.vrsave, (u32 __user *)&v_regs[33]);
 #else /* CONFIG_ALTIVEC */
err |= __put_user(0, &sc->v_regs);
 #endif /* CONFIG_ALTIVEC */
-   flush_fp_to_thread(tsk);
/* copy fpr regs and fpscr */
err |= copy_fpr_to_user(&sc->fp_regs, tsk);
 
@@ -150,7 +159,6 @@ static long setup_sigcontext(struct sigcontext __user *sc,
 * VMX data.
 */
if (tsk->thread.used_vsr && ctx_has_vsx_region) {
-   flush_vsx_to_thread(tsk);
v_regs += ELF_NVRREG;
err |= copy_vsx_to_user(v_regs, tsk);
/* set MSR_VSX in the MSR value in the frame to
@@ -655,6 +663,7 @@ SYSCALL_DEFINE3(swapcontext, struct ucontext __user *, 
old_ctx,
ctx_has_vsx_region = 1;
 
if (old_ctx != NULL) {
+   prepare_setup_sigcontext(current, ctx_has_vsx_region);
if (!access_ok(old_ctx, ctx_size)
|| setup_sigcontext(&old_ctx->uc_mcontext, current, 0, 
NULL, 0,
ctx_has_vsx_region)
@@ -842,6 +851,7 @@ int handle_rt_signal64(struct ksignal *ksig, sigset_t *set,
 #endif
{
err |= __put_user(0, &frame->uc.uc_link);
+   prepare_setup_sigcontext(tsk, 1);
err |= setup_sigcontext(&frame->uc.uc_mcontext, tsk, ksig->sig,
NULL, (unsigned 
long)ksig->ka.sa.sa_handler,
1);
-- 
2.26.1



[PATCH v5 09/10] powerpc/signal64: Rewrite rt_sigreturn() to minimise uaccess switches

2021-02-03 Thread Christopher M. Riedl
From: Daniel Axtens 

Add uaccess blocks and use the 'unsafe' versions of functions doing user
access where possible to reduce the number of times uaccess has to be
opened/closed.

Signed-off-by: Daniel Axtens 
Co-developed-by: Christopher M. Riedl 
Signed-off-by: Christopher M. Riedl 
---
 arch/powerpc/kernel/signal_64.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index a471e97589a8..817b64e1e409 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -782,9 +782,13 @@ SYSCALL_DEFINE0(rt_sigreturn)
 * restore_tm_sigcontexts.
 */
regs->msr &= ~MSR_TS_MASK;
+#endif
 
-   if (__get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR]))
+   if (!user_read_access_begin(uc, sizeof(*uc)))
goto badframe;
+
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   unsafe_get_user(msr, &uc->uc_mcontext.gp_regs[PT_MSR], badframe_block);
 #endif
 
if (MSR_TM_ACTIVE(msr)) {
@@ -794,10 +798,12 @@ SYSCALL_DEFINE0(rt_sigreturn)
 
/* Trying to start TM on non TM system */
if (!cpu_has_feature(CPU_FTR_TM))
-   goto badframe;
+   goto badframe_block;
+
+   unsafe_get_user(uc_transact, &uc->uc_link, badframe_block);
+
+   user_read_access_end();
 
-   if (__get_user(uc_transact, &uc->uc_link))
-   goto badframe;
if (restore_tm_sigcontexts(current, &uc->uc_mcontext,
   &uc_transact->uc_mcontext))
goto badframe;
@@ -816,12 +822,9 @@ SYSCALL_DEFINE0(rt_sigreturn)
 * causing a TM bad thing.
 */
current->thread.regs->msr &= ~MSR_TS_MASK;
-   if (!user_read_access_begin(uc, sizeof(*uc)))
-   return -EFAULT;
-   if (__unsafe_restore_sigcontext(current, NULL, 1, 
&uc->uc_mcontext)) {
-   user_read_access_end();
-   goto badframe;
-   }
+   unsafe_restore_sigcontext(current, NULL, 1, &uc->uc_mcontext,
+ badframe_block);
+
user_read_access_end();
}
 
@@ -831,6 +834,8 @@ SYSCALL_DEFINE0(rt_sigreturn)
set_thread_flag(TIF_RESTOREALL);
return 0;
 
+badframe_block:
+   user_read_access_end();
 badframe:
signal_fault(current, regs, "rt_sigreturn", uc);
 
-- 
2.26.1



  1   2   3   >