date:20210204

Re: [PATCH] mm/memtest: Add ARCH_USE_MEMTEST

2021-02-04 Thread Max Filippov

On Thu, Feb 4, 2021 at 8:10 PM Anshuman Khandual
 wrote:
>
> early_memtest() does not get called from all architectures. Hence enabling
> CONFIG_MEMTEST and providing a valid memtest=[1..N] kernel command line
> option might not trigger the memory pattern tests as would be expected in
> normal circumstances. This situation is misleading.
>
> The change here prevents the above mentioned problem after introducing a
> new config option ARCH_USE_MEMTEST that should be subscribed on platforms
> that call early_memtest(), in order to enable the config CONFIG_MEMTEST.
> Conversely CONFIG_MEMTEST cannot be enabled on platforms where it would
> not be tested anyway.
>
> Cc: Russell King 
> Cc: Catalin Marinas 
> Cc: Will Deacon 
> Cc: Thomas Bogendoerfer 
> Cc: Michael Ellerman 
> Cc: Benjamin Herrenschmidt 
> Cc: Paul Mackerras 
> Cc: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Chris Zankel 
> Cc: Max Filippov 
> Cc: linux-arm-ker...@lists.infradead.org
> Cc: linux-m...@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Cc: linux-xte...@linux-xtensa.org
> Cc: linux...@kvack.org
> Cc: linux-ker...@vger.kernel.org
> Signed-off-by: Anshuman Khandual 
> ---
> This patch applies on v5.11-rc6 and has been tested on arm64 platform. But
> it has been just build tested on all other platforms.
>
>  arch/arm/Kconfig | 1 +
>  arch/arm64/Kconfig   | 1 +
>  arch/mips/Kconfig| 1 +
>  arch/powerpc/Kconfig | 1 +
>  arch/x86/Kconfig | 1 +
>  arch/xtensa/Kconfig  | 1 +
>  lib/Kconfig.debug| 9 -
>  7 files changed, 14 insertions(+), 1 deletion(-)

Anshuman, entries in arch/*/Konfig files are sorted in alphabetical order,
please keep them that way.

Reviewed-by: Max Filippov 

-- 
Thanks.
-- Max

Re: [PATCH] arch:powerpc simple_write_to_buffer return check

2021-02-04 Thread Christophe Leroy


Please provide some description of the change.

And please clarify the patch subject, because as far as I can see, the return is already checked 
allthough the check seams wrong.


Le 04/02/2021 à 19:16, Mayank Suman a écrit :

Signed-off-by: Mayank Suman 
---
  arch/powerpc/kernel/eeh.c| 8 
  arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++--
  2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 813713c9120c..2dbe1558a71f 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file *filp,
char buf[20];
int ret;
  
-	ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);

-   if (!ret)
+   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
+   if (ret <= 0) >   return -EFAULT;


Why return -EFAULT when the function has returned -EINVAL ?
And why is it -EFAULT when ret is 0 ? EFAULT means error accessing memory.

  
  	/*

@@ -1696,7 +1696,7 @@ static ssize_t eeh_dev_check_write(struct file *filp,
  
  	memset(buf, 0, sizeof(buf));

ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
-   if (!ret)
+   if (ret <= 0)
return -EFAULT;
  
  	ret = sscanf(buf, "%x:%x:%x.%x", , , , );

@@ -1836,7 +1836,7 @@ static ssize_t eeh_dev_break_write(struct file *filp,
  
  	memset(buf, 0, sizeof(buf));

ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
-   if (!ret)
+   if (ret <= 0)
return -EFAULT;
  
  	ret = sscanf(buf, "%x:%x:%x.%x", , , , );

diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 89e22c460ebf..36ed2b8f7375 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -76,8 +76,8 @@ static ssize_t pnv_eeh_ei_write(struct file *filp,
return -ENXIO;
  
  	/* Copy over argument buffer */

-   ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
-   if (!ret)
+   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
+   if (ret <= 0)
return -EFAULT;
  
  	/* Retrieve parameters */

Re: [PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-04 Thread Leonardo Bras

Hey Nick, thanks for reviewing :)

On Fri, 2021-02-05 at 16:28 +1000, Nicholas Piggin wrote:
> Excerpts from Leonardo Bras's message of February 5, 2021 4:06 pm:
> > Before guest entry, TBU40 register is changed to reflect guest timebase.
> > After exitting guest, the register is reverted to it's original value.
> > 
> > If one tries to get the timestamp from host between those changes, it
> > will present an incorrect value.
> > 
> > An example would be trying to add a tracepoint in
> > kvmppc_guest_entry_inject_int(), which depending on last tracepoint
> > acquired could actually cause the host to crash.
> > 
> > Save the Timebase Offset to PACA and use it on sched_clock() to always
> > get the correct timestamp.
> 
> Ouch. Not sure how reasonable it is to half switch into guest registers 
> and expect to call into the wider kernel, fixing things up as we go. 
> What if mftb is used in other places?

IIUC, the CPU is not supposed to call anything as host between guest
entry and guest exit, except guest-related cases, like
kvmppc_guest_entry_inject_int(), but anyway, if something calls mftb it
will still get the same value as before.

This is only supposed to change stuff that depends on sched_clock, like
Tracepoints, that can happen in those exceptions.


> Especially as it doesn't seem like there is a reason that function _has_
> to be called after the timebase is switched to guest, that's just how 
> the code is structured.

Correct, but if called, like in rb routines, used by tracepoints, the
difference between last tb and current (lower) tb may cause the CPU to
trap PROGRAM exception, crashing host. 

> As a local hack to work out a bug okay. If you really need it upstream 
> could you put it under a debug config option?

You mean something that is automatically selected whenever those
configs are enabled? 

CONFIG_TRACEPOINT && CONFIG_KVM_BOOK3S_HANDLER && CONFIG_PPC_BOOK3S_64

Or something the user need to select himself in menuconfig?

> 
> Thanks,
> Nick
> 

Thank you!
Leonardo Bras

> > Signed-off-by: Leonardo Bras 
> > Suggested-by: Paul Mackerras 
> > ---
> > Changes since v1:
> > - Subtracts offset only when CONFIG_KVM_BOOK3S_HANDLER and
> >   CONFIG_PPC_BOOK3S_64 are defined.
> > ---
> >  arch/powerpc/include/asm/kvm_book3s_asm.h | 1 +
> >  arch/powerpc/kernel/asm-offsets.c | 1 +
> >  arch/powerpc/kernel/time.c| 8 +++-
> >  arch/powerpc/kvm/book3s_hv.c  | 2 ++
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 2 ++
> >  5 files changed, 13 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
> > b/arch/powerpc/include/asm/kvm_book3s_asm.h
> > index 078f4648ea27..e2c12a10eed2 100644
> > --- a/arch/powerpc/include/asm/kvm_book3s_asm.h
> > +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
> > @@ -131,6 +131,7 @@ struct kvmppc_host_state {
> >     u64 cfar;
> >     u64 ppr;
> >     u64 host_fscr;
> > +   u64 tb_offset;  /* Timebase offset: keeps correct timebase 
> > while on guest */
> >  #endif
> >  };
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > diff --git a/arch/powerpc/kernel/asm-offsets.c 
> > b/arch/powerpc/kernel/asm-offsets.c
> > index b12d7c049bfe..0beb8fdc6352 100644
> > --- a/arch/powerpc/kernel/asm-offsets.c
> > +++ b/arch/powerpc/kernel/asm-offsets.c
> > @@ -706,6 +706,7 @@ int main(void)
> >     HSTATE_FIELD(HSTATE_CFAR, cfar);
> >     HSTATE_FIELD(HSTATE_PPR, ppr);
> >     HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
> > +   HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset);
> >  #endif /* CONFIG_PPC_BOOK3S_64 */
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >  #else /* CONFIG_PPC_BOOK3S */
> > diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> > index 67feb3524460..f27f0163792b 100644
> > --- a/arch/powerpc/kernel/time.c
> > +++ b/arch/powerpc/kernel/time.c
> > @@ -699,7 +699,13 @@ EXPORT_SYMBOL_GPL(tb_to_ns);
> >   */
> >  notrace unsigned long long sched_clock(void)
> >  {
> > -   return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
> > +   u64 tb = get_tb() - boot_tb;
> > +
> > +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_HANDLER)
> > +   tb -= local_paca->kvm_hstate.tb_offset;
> > +#endif
> > +
> > +   return mulhdu(tb, tb_to_ns_scale) << tb_to_ns_shift;
> >  }
> >  
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> >

Re: [PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-04 Thread Nicholas Piggin

Excerpts from Leonardo Bras's message of February 5, 2021 4:06 pm:
> Before guest entry, TBU40 register is changed to reflect guest timebase.
> After exitting guest, the register is reverted to it's original value.
> 
> If one tries to get the timestamp from host between those changes, it
> will present an incorrect value.
> 
> An example would be trying to add a tracepoint in
> kvmppc_guest_entry_inject_int(), which depending on last tracepoint
> acquired could actually cause the host to crash.
> 
> Save the Timebase Offset to PACA and use it on sched_clock() to always
> get the correct timestamp.

Ouch. Not sure how reasonable it is to half switch into guest registers 
and expect to call into the wider kernel, fixing things up as we go. 
What if mftb is used in other places?

Especially as it doesn't seem like there is a reason that function _has_
to be called after the timebase is switched to guest, that's just how 
the code is structured.

As a local hack to work out a bug okay. If you really need it upstream 
could you put it under a debug config option?

Thanks,
Nick

> Signed-off-by: Leonardo Bras 
> Suggested-by: Paul Mackerras 
> ---
> Changes since v1:
> - Subtracts offset only when CONFIG_KVM_BOOK3S_HANDLER and
>   CONFIG_PPC_BOOK3S_64 are defined.
> ---
>  arch/powerpc/include/asm/kvm_book3s_asm.h | 1 +
>  arch/powerpc/kernel/asm-offsets.c | 1 +
>  arch/powerpc/kernel/time.c| 8 +++-
>  arch/powerpc/kvm/book3s_hv.c  | 2 ++
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 2 ++
>  5 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
> b/arch/powerpc/include/asm/kvm_book3s_asm.h
> index 078f4648ea27..e2c12a10eed2 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_asm.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
> @@ -131,6 +131,7 @@ struct kvmppc_host_state {
>   u64 cfar;
>   u64 ppr;
>   u64 host_fscr;
> + u64 tb_offset;  /* Timebase offset: keeps correct timebase 
> while on guest */
>  #endif
>  };
>  
> diff --git a/arch/powerpc/kernel/asm-offsets.c 
> b/arch/powerpc/kernel/asm-offsets.c
> index b12d7c049bfe..0beb8fdc6352 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -706,6 +706,7 @@ int main(void)
>   HSTATE_FIELD(HSTATE_CFAR, cfar);
>   HSTATE_FIELD(HSTATE_PPR, ppr);
>   HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
> + HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset);
>  #endif /* CONFIG_PPC_BOOK3S_64 */
>  
>  #else /* CONFIG_PPC_BOOK3S */
> diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
> index 67feb3524460..f27f0163792b 100644
> --- a/arch/powerpc/kernel/time.c
> +++ b/arch/powerpc/kernel/time.c
> @@ -699,7 +699,13 @@ EXPORT_SYMBOL_GPL(tb_to_ns);
>   */
>  notrace unsigned long long sched_clock(void)
>  {
> - return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
> + u64 tb = get_tb() - boot_tb;
> +
> +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_HANDLER)
> + tb -= local_paca->kvm_hstate.tb_offset;
> +#endif
> +
> + return mulhdu(tb, tb_to_ns_scale) << tb_to_ns_shift;
>  }
>  
>  
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index b3731572295e..c08593c63353 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -3491,6 +3491,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
> *vcpu, u64 time_limit,
>   if ((tb & 0xff) < (new_tb & 0xff))
>   mtspr(SPRN_TBU40, new_tb + 0x100);
>   vc->tb_offset_applied = vc->tb_offset;
> + local_paca->kvm_hstate.tb_offset = vc->tb_offset;
>   }
>  
>   if (vc->pcr)
> @@ -3594,6 +3595,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
> *vcpu, u64 time_limit,
>   if ((tb & 0xff) < (new_tb & 0xff))
>   mtspr(SPRN_TBU40, new_tb + 0x100);
>   vc->tb_offset_applied = 0;
> + local_paca->kvm_hstate.tb_offset = 0;
>   }
>  
>   mtspr(SPRN_HDEC, 0x7fff);
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b73140607875..8f7a9f7f4ee6 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -632,6 +632,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
>   cmpdi   r8,0
>   beq 37f
>   std r8, VCORE_TB_OFFSET_APPL(r5)
> + std r8, HSTATE_TB_OFFSET(r13)
>   mftbr6  /* current host timebase */
>   add r8,r8,r6
>   mtspr   SPRN_TBU40,r8   /* update upper 40 bits */
> @@ -1907,6 +1908,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
>   beq 17f
>   li  r0, 0
>   std r0, VCORE_TB_OFFSET_APPL(r5)
> + std r0, HSTATE_TB_OFFSET(r13)
>   mftbr6  /* current guest timebase */
>   subf

Re: [PATCH] arch:powerpc simple_write_to_buffer return check

2021-02-04 Thread Mayank Suman

On 05/02/21 4:05 am, Oliver O'Halloran wrote:
> On Fri, Feb 5, 2021 at 5:17 AM Mayank Suman  wrote:
>>
>> Signed-off-by: Mayank Suman 
> 
> commit messages aren't optional

Sorry. I will include the commit message in PATCH v2.

> 
>> ---
>>  arch/powerpc/kernel/eeh.c| 8 
>>  arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++--
>>  2 files changed, 6 insertions(+), 6 deletions(-)
>>
>> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>> index 813713c9120c..2dbe1558a71f 100644
>> --- a/arch/powerpc/kernel/eeh.c
>> +++ b/arch/powerpc/kernel/eeh.c
>> @@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file 
>> *filp,
>> char buf[20];
>> int ret;
>>
>> -   ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, 
>> count);
>> -   if (!ret)
>> +   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
>> count);
> 
> We should probably be zeroing the buffer. Reading to sizeof(buf) - 1
> is done in a few places to guarantee that the string is nul
> terminated, but without the preceeding memset() that isn't actually
> guaranteed.

Yes, the buffer should be zeroed out first. I have included memset() in Patch 
v2.

> 
>> +   if (ret <= 0)
>> return -EFAULT;
> 
> EFAULT is supposed to be returned when the user supplies a buffer to
> write(2) which is outside their address space. I figured letting the
> sscanf() in the next step fail if the user passes writes a zero-length
> buffer and returning EINVAL made more sense. That said, the exact
> semantics around zero length writes are pretty handwavy so I guess
> this isn't wrong, but I don't think it's better either.
> 

simple_write_to_buffer may return negative value on fail.
So, -EFAULT should be return in case of negative return value. 
The conditional (!ret) was not sufficient to catch negative return value.

[PATCH v2 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-04 Thread Leonardo Bras

Before guest entry, TBU40 register is changed to reflect guest timebase.
After exitting guest, the register is reverted to it's original value.

If one tries to get the timestamp from host between those changes, it
will present an incorrect value.

An example would be trying to add a tracepoint in
kvmppc_guest_entry_inject_int(), which depending on last tracepoint
acquired could actually cause the host to crash.

Save the Timebase Offset to PACA and use it on sched_clock() to always
get the correct timestamp.

Signed-off-by: Leonardo Bras 
Suggested-by: Paul Mackerras 
---
Changes since v1:
- Subtracts offset only when CONFIG_KVM_BOOK3S_HANDLER and
  CONFIG_PPC_BOOK3S_64 are defined.
---
 arch/powerpc/include/asm/kvm_book3s_asm.h | 1 +
 arch/powerpc/kernel/asm-offsets.c | 1 +
 arch/powerpc/kernel/time.c| 8 +++-
 arch/powerpc/kvm/book3s_hv.c  | 2 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 2 ++
 5 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 078f4648ea27..e2c12a10eed2 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -131,6 +131,7 @@ struct kvmppc_host_state {
u64 cfar;
u64 ppr;
u64 host_fscr;
+   u64 tb_offset;  /* Timebase offset: keeps correct timebase 
while on guest */
 #endif
 };
 
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index b12d7c049bfe..0beb8fdc6352 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -706,6 +706,7 @@ int main(void)
HSTATE_FIELD(HSTATE_CFAR, cfar);
HSTATE_FIELD(HSTATE_PPR, ppr);
HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
+   HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset);
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #else /* CONFIG_PPC_BOOK3S */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 67feb3524460..f27f0163792b 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -699,7 +699,13 @@ EXPORT_SYMBOL_GPL(tb_to_ns);
  */
 notrace unsigned long long sched_clock(void)
 {
-   return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
+   u64 tb = get_tb() - boot_tb;
+
+#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_HANDLER)
+   tb -= local_paca->kvm_hstate.tb_offset;
+#endif
+
+   return mulhdu(tb, tb_to_ns_scale) << tb_to_ns_shift;
 }
 
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index b3731572295e..c08593c63353 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3491,6 +3491,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
if ((tb & 0xff) < (new_tb & 0xff))
mtspr(SPRN_TBU40, new_tb + 0x100);
vc->tb_offset_applied = vc->tb_offset;
+   local_paca->kvm_hstate.tb_offset = vc->tb_offset;
}
 
if (vc->pcr)
@@ -3594,6 +3595,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
if ((tb & 0xff) < (new_tb & 0xff))
mtspr(SPRN_TBU40, new_tb + 0x100);
vc->tb_offset_applied = 0;
+   local_paca->kvm_hstate.tb_offset = 0;
}
 
mtspr(SPRN_HDEC, 0x7fff);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b73140607875..8f7a9f7f4ee6 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -632,6 +632,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
cmpdi   r8,0
beq 37f
std r8, VCORE_TB_OFFSET_APPL(r5)
+   std r8, HSTATE_TB_OFFSET(r13)
mftbr6  /* current host timebase */
add r8,r8,r6
mtspr   SPRN_TBU40,r8   /* update upper 40 bits */
@@ -1907,6 +1908,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
beq 17f
li  r0, 0
std r0, VCORE_TB_OFFSET_APPL(r5)
+   std r0, HSTATE_TB_OFFSET(r13)
mftbr6  /* current guest timebase */
subfr8,r8,r6
mtspr   SPRN_TBU40,r8   /* update upper 40 bits */
-- 
2.29.2

Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C

2021-02-04 Thread Christophe Leroy





Le 05/02/2021 à 03:16, Nicholas Piggin a écrit :

Excerpts from Michael Ellerman's message of February 5, 2021 10:22 am:

Nicholas Piggin  writes:

Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm:

Le 04/02/2021 à 04:27, Nicholas Piggin a écrit :

Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:

Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :

...

+
+   /*
+* We don't need to restore AMR on the way back to userspace for KUAP.
+* The value of AMR only matters while we're in the kernel.
+*/
+   kuap_restore_amr(regs);


Is that correct to restore KUAP state here ? Shouldn't we have it at lower 
level in assembly ?

Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() 
or the end of it in
a way or another, and get the previous KUAP state restored by this way ?


I'm not sure if there much more risk if it's here rather than the
instruction being in another place in the code.

There's a lot of user access around the kernel too if you want to find a
gadget to unlock KUAP then I suppose there is a pretty large attack
surface.


My understanding is that user access scope is strictly limited, for instance we 
enforce the
begin/end of user access to be in the same function, and we refrain from 
calling any other function
inside the user access window. x86 even have 'objtool' to enforce it at build 
time. So in theory
there is no way to get out of the function while user access is open.

Here with the interrupt exit function it is free beer. You have a place where 
you re-open user
access and return with a simple blr. So that's open bar. If someone manages to 
just call the
interrupt exit function, then user access remains open


Hmm okay maybe that's a good point.


I don't think it's a very attractive gadget, it's not just a plain blr,
it does a full stack frame tear down before the return. And there's no
LR reloads anywhere very close.

Obviously it depends on what the compiler decides to do, it's possible
it could be a usable gadget. But there are other places that are more
attractive I think, eg:

c061d768:   a6 03 3d 7d mtspr   29,r9
c061d76c:   2c 01 00 4c isync
c061d770:   00 00 00 60 nop
c061d774:   30 00 21 38 addir1,r1,48
c061d778:   20 00 80 4e blr


So I don't think we should redesign the code *purely* because we're
worried about interrupt_exit_kernel_prepare() being a useful gadget. If
we can come up with a way to restructure it that reads well and is
maintainable, and also reduces the chance of it being a good gadget then
sure.


Okay. That would be good if we can keep it in C, the pkeys + kuap combo
is fairly complicated and we might want to something cleverer with it,
so that would make it even more difficult in asm.



Ok.

For ppc32, I prefer to keep it in assembly for the time being and move everything from ASM to C at 
once after porting syscall and interrupts to C and wrappers.


Hope this is OK for you.

Christophe

Re: [PATCH v5 02/10] powerpc/signal: Add unsafe_copy_{vsx, fpr}_from_user()

2021-02-04 Thread Daniel Axtens

Hi Christopher,

I have checked that each implementation matches the corresponding
*_to_user implementation. We've had some debate about whether the
overarching implementation in the to/from pairs (especially where things
go via a bounce buffer) can be simplified - but that's probably not
really something that this patch set should do.

On that basis:
Reviewed-by: Daniel Axtens 

Kind regards,
Daniel

> Reuse the "safe" implementation from signal.c except for calling
> unsafe_copy_from_user() to copy into a local buffer.
>
> Signed-off-by: Christopher M. Riedl 
> ---
>  arch/powerpc/kernel/signal.h | 30 ++
>  1 file changed, 30 insertions(+)
>
> diff --git a/arch/powerpc/kernel/signal.h b/arch/powerpc/kernel/signal.h
> index 2559a681536e..7dfc536c78ef 100644
> --- a/arch/powerpc/kernel/signal.h
> +++ b/arch/powerpc/kernel/signal.h
> @@ -53,6 +53,30 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> *task, void __user *from);
>   [i], label);\
>  } while (0)
>  
> +#define unsafe_copy_fpr_from_user(task, from, label) do {\
> + struct task_struct *__t = task; \
> + u64 __user *__f = (u64 __user *)from;   \
> + u64 buf[ELF_NFPREG];\
> + int i;  \
> + \
> + unsafe_copy_from_user(buf, __f, sizeof(buf), label);\
> + for (i = 0; i < ELF_NFPREG - 1; i++)\
> + __t->thread.TS_FPR(i) = buf[i]; \
> + __t->thread.fp_state.fpscr = buf[i];\
> +} while (0)
> +
> +#define unsafe_copy_vsx_from_user(task, from, label) do {\
> + struct task_struct *__t = task; \
> + u64 __user *__f = (u64 __user *)from;   \
> + u64 buf[ELF_NVSRHALFREG];   \
> + int i;  \
> + \
> + unsafe_copy_from_user(buf, __f, sizeof(buf), label);\
> + for (i = 0; i < ELF_NVSRHALFREG ; i++)  \
> + __t->thread.fp_state.fpr[i][TS_VSRLOWOFFSET] = buf[i];  \
> +} while (0)
> +
> +
>  #ifdef CONFIG_PPC_TRANSACTIONAL_MEM
>  #define unsafe_copy_ckfpr_to_user(to, task, label)   do {\
>   struct task_struct *__t = task; \
> @@ -80,6 +104,10 @@ unsigned long copy_ckfpr_from_user(struct task_struct 
> *task, void __user *from);
>   unsafe_copy_to_user(to, (task)->thread.fp_state.fpr,\
>   ELF_NFPREG * sizeof(double), label)
>  
> +#define unsafe_copy_fpr_from_user(task, from, label) \
> + unsafe_copy_from_user((task)->thread.fp_state.fpr, from,\
> + ELF_NFPREG * sizeof(double), label)
> +
>  static inline unsigned long
>  copy_fpr_to_user(void __user *to, struct task_struct *task)
>  {
> @@ -115,6 +143,8 @@ copy_ckfpr_from_user(struct task_struct *task, void 
> __user *from)
>  #else
>  #define unsafe_copy_fpr_to_user(to, task, label) do { } while (0)
>  
> +#define unsafe_copy_fpr_from_user(task, from, label) do { } while (0)
> +
>  static inline unsigned long
>  copy_fpr_to_user(void __user *to, struct task_struct *task)
>  {
> -- 
> 2.26.1

Re: [PATCH v5 01/10] powerpc/uaccess: Add unsafe_copy_from_user

2021-02-04 Thread Daniel Axtens

Hi Chris,

Pending anything that sparse reported (which I haven't checked), this
looks ok to me.

Reviewed-by: Daniel Axtens 

Kind regards,
Daniel

> Just wrap __copy_tofrom_user() for the usual 'unsafe' pattern which
> takes in a label to goto on error.
>
> Signed-off-by: Christopher M. Riedl 
> ---
>  arch/powerpc/include/asm/uaccess.h | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/include/asm/uaccess.h 
> b/arch/powerpc/include/asm/uaccess.h
> index 501c9a79038c..036e82eefac9 100644
> --- a/arch/powerpc/include/asm/uaccess.h
> +++ b/arch/powerpc/include/asm/uaccess.h
> @@ -542,6 +542,9 @@ user_write_access_begin(const void __user *ptr, size_t 
> len)
>  #define unsafe_get_user(x, p, e) unsafe_op_wrap(__get_user_allowed(x, p), e)
>  #define unsafe_put_user(x, p, e) __put_user_goto(x, p, e)
>  
> +#define unsafe_copy_from_user(d, s, l, e) \
> + unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
> +
>  #define unsafe_copy_to_user(d, s, l, e) \
>  do { \
>   u8 __user *_dst = (u8 __user *)(d); \
> -- 
> 2.26.1

Re: [PATCH v5 10/10] powerpc/signal64: Use __get_user() to copy sigset_t

2021-02-04 Thread Christopher M. Riedl

On Wed Feb 3, 2021 at 12:43 PM CST, Christopher M. Riedl wrote:
> Usually sigset_t is exactly 8B which is a "trivial" size and does not
> warrant using __copy_from_user(). Use __get_user() directly in
> anticipation of future work to remove the trivial size optimizations
> from __copy_from_user(). Calling __get_user() also results in a small
> boost to signal handling throughput here.
>
> Signed-off-by: Christopher M. Riedl 

This patch triggered sparse warnings about 'different address spaces'.
This minor fixup cleans that up:

diff --git a/arch/powerpc/kernel/signal_64.c b/arch/powerpc/kernel/signal_64.c
index 42fdc4a7ff72..1dfda6403e14 100644
--- a/arch/powerpc/kernel/signal_64.c
+++ b/arch/powerpc/kernel/signal_64.c
@@ -97,7 +97,7 @@ static void prepare_setup_sigcontext(struct task_struct *tsk, 
int ctx_has_vsx_re
 #endif /* CONFIG_VSX */
 }

-static inline int get_user_sigset(sigset_t *dst, const sigset_t *src)
+static inline int get_user_sigset(sigset_t *dst, const sigset_t __user *src)
 {
if (sizeof(sigset_t) <= 8)
return __get_user(dst->sig[0], >sig[0]);

[PATCH] mm/memtest: Add ARCH_USE_MEMTEST

2021-02-04 Thread Anshuman Khandual

early_memtest() does not get called from all architectures. Hence enabling
CONFIG_MEMTEST and providing a valid memtest=[1..N] kernel command line
option might not trigger the memory pattern tests as would be expected in
normal circumstances. This situation is misleading.

The change here prevents the above mentioned problem after introducing a
new config option ARCH_USE_MEMTEST that should be subscribed on platforms
that call early_memtest(), in order to enable the config CONFIG_MEMTEST.
Conversely CONFIG_MEMTEST cannot be enabled on platforms where it would
not be tested anyway.

Cc: Russell King 
Cc: Catalin Marinas 
Cc: Will Deacon 
Cc: Thomas Bogendoerfer 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Chris Zankel 
Cc: Max Filippov 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-m...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-xte...@linux-xtensa.org
Cc: linux...@kvack.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Anshuman Khandual 
---
This patch applies on v5.11-rc6 and has been tested on arm64 platform. But
it has been just build tested on all other platforms.

 arch/arm/Kconfig | 1 +
 arch/arm64/Kconfig   | 1 +
 arch/mips/Kconfig| 1 +
 arch/powerpc/Kconfig | 1 +
 arch/x86/Kconfig | 1 +
 arch/xtensa/Kconfig  | 1 +
 lib/Kconfig.debug| 9 -
 7 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 138248999df7..a63b53c568df 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -32,6 +32,7 @@ config ARM
select ARCH_SUPPORTS_ATOMIC_RMW
select ARCH_USE_BUILTIN_BSWAP
select ARCH_USE_CMPXCHG_LOCKREF
+   select ARCH_USE_MEMTEST
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_WANT_LD_ORPHAN_WARN
diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index c4acf8230f20..dfee5831d876 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -70,6 +70,7 @@ config ARM64
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_USE_SYM_ANNOTATIONS
+   select ARCH_USE_MEMTEST
select ARCH_SUPPORTS_DEBUG_PAGEALLOC
select ARCH_SUPPORTS_MEMORY_FAILURE
select ARCH_SUPPORTS_SHADOW_CALL_STACK if CC_HAVE_SHADOW_CALL_STACK
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 0a17bedf4f0d..1b21d8e53e6b 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -16,6 +16,7 @@ config MIPS
select ARCH_USE_CMPXCHG_LOCKREF if 64BIT
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
+   select ARCH_USE_MEMTEST
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT if MMU
select ARCH_WANT_IPC_PARSE_VERSION
select BUILDTIME_TABLE_SORT
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..9935343a8750 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -151,6 +151,7 @@ config PPC
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
select ARCH_USE_QUEUED_SPINLOCKSif PPC_QUEUED_SPINLOCKS
+   select ARCH_USE_MEMTEST
select ARCH_WANT_IPC_PARSE_VERSION
select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
select ARCH_WANT_LD_ORPHAN_WARN
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 21f851179ff0..90545348db1b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -100,6 +100,7 @@ config X86
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_USE_SYM_ANNOTATIONS
+   select ARCH_USE_MEMTEST
select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
select ARCH_WANT_DEFAULT_BPF_JITif X86_64
select ARCH_WANTS_DYNAMIC_TASK_STRUCT
diff --git a/arch/xtensa/Kconfig b/arch/xtensa/Kconfig
index 37ce1489364e..8eb61fcdfc7f 100644
--- a/arch/xtensa/Kconfig
+++ b/arch/xtensa/Kconfig
@@ -9,6 +9,7 @@ config XTENSA
select ARCH_HAS_DMA_SET_UNCACHED if MMU
select ARCH_USE_QUEUED_RWLOCKS
select ARCH_USE_QUEUED_SPINLOCKS
+   select ARCH_USE_MEMTEST
select ARCH_WANT_FRAME_POINTERS
select ARCH_WANT_IPC_PARSE_VERSION
select BUILDTIME_TABLE_SORT
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 7937265ef879..6dd25b755a82 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -2469,11 +2469,18 @@ config TEST_FPU
 
 endif # RUNTIME_TESTING_MENU
 
+config ARCH_USE_MEMTEST
+   bool
+   help
+ An architecture should select this when it uses early_memtest()
+ during boot process.
+
 config MEMTEST
bool "Memtest"
+   depends on ARCH_USE_MEMTEST
help
  This option adds a kernel parameter 'memtest', which allows memtest
- to be set.
+ to be set and executed.
memtest=0, mean disabled; -- default
memtest=1, mean

[PATCH 1/1] powerpc/kvm: Save Timebase Offset to fix sched_clock() while running guest code.

2021-02-04 Thread Leonardo Bras

Before guest entry, TBU40 register is changed to reflect guest timebase.
After exitting guest, the register is reverted to it's original value.

If one tries to get the timestamp from host between those changes, it
will present an incorrect value.

An example would be trying to add a tracepoint in
kvmppc_guest_entry_inject_int(), which depending on last tracepoint
acquired could actually cause the host to crash.

Save the Timebase Offset to PACA and use it on sched_clock() to always
get the correct timestamp.

Signed-off-by: Leonardo Bras 
---
 arch/powerpc/include/asm/kvm_book3s_asm.h | 1 +
 arch/powerpc/kernel/asm-offsets.c | 1 +
 arch/powerpc/kernel/time.c| 3 ++-
 arch/powerpc/kvm/book3s_hv.c  | 2 ++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   | 2 ++
 5 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h 
b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 078f4648ea27..e2c12a10eed2 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -131,6 +131,7 @@ struct kvmppc_host_state {
u64 cfar;
u64 ppr;
u64 host_fscr;
+   u64 tb_offset;  /* Timebase offset: keeps correct timebase 
while on guest */
 #endif
 };
 
diff --git a/arch/powerpc/kernel/asm-offsets.c 
b/arch/powerpc/kernel/asm-offsets.c
index b12d7c049bfe..0beb8fdc6352 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -706,6 +706,7 @@ int main(void)
HSTATE_FIELD(HSTATE_CFAR, cfar);
HSTATE_FIELD(HSTATE_PPR, ppr);
HSTATE_FIELD(HSTATE_HOST_FSCR, host_fscr);
+   HSTATE_FIELD(HSTATE_TB_OFFSET, tb_offset);
 #endif /* CONFIG_PPC_BOOK3S_64 */
 
 #else /* CONFIG_PPC_BOOK3S */
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index 67feb3524460..adf6648e3572 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -699,7 +699,8 @@ EXPORT_SYMBOL_GPL(tb_to_ns);
  */
 notrace unsigned long long sched_clock(void)
 {
-   return mulhdu(get_tb() - boot_tb, tb_to_ns_scale) << tb_to_ns_shift;
+   return mulhdu(get_tb() - boot_tb - local_paca->kvm_hstate.tb_offset, 
tb_to_ns_scale)
+   << tb_to_ns_shift;
 }
 
 
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index b3731572295e..c08593c63353 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3491,6 +3491,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
if ((tb & 0xff) < (new_tb & 0xff))
mtspr(SPRN_TBU40, new_tb + 0x100);
vc->tb_offset_applied = vc->tb_offset;
+   local_paca->kvm_hstate.tb_offset = vc->tb_offset;
}
 
if (vc->pcr)
@@ -3594,6 +3595,7 @@ static int kvmhv_load_hv_regs_and_go(struct kvm_vcpu 
*vcpu, u64 time_limit,
if ((tb & 0xff) < (new_tb & 0xff))
mtspr(SPRN_TBU40, new_tb + 0x100);
vc->tb_offset_applied = 0;
+   local_paca->kvm_hstate.tb_offset = 0;
}
 
mtspr(SPRN_HDEC, 0x7fff);
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index b73140607875..8f7a9f7f4ee6 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -632,6 +632,7 @@ END_FTR_SECTION_IFCLR(CPU_FTR_ARCH_300)
cmpdi   r8,0
beq 37f
std r8, VCORE_TB_OFFSET_APPL(r5)
+   std r8, HSTATE_TB_OFFSET(r13)
mftbr6  /* current host timebase */
add r8,r8,r6
mtspr   SPRN_TBU40,r8   /* update upper 40 bits */
@@ -1907,6 +1908,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_ARCH_207S)
beq 17f
li  r0, 0
std r0, VCORE_TB_OFFSET_APPL(r5)
+   std r0, HSTATE_TB_OFFSET(r13)
mftbr6  /* current guest timebase */
subfr8,r8,r6
mtspr   SPRN_TBU40,r8   /* update upper 40 bits */
-- 
2.29.2

[PATCH] powerpc/kuap: Allow kernel thread to access userspace after kthread_use_mm

2021-02-04 Thread Aneesh Kumar K.V

This fix the bad fault reported by KUAP when io_wqe_worker access userspace.

 Bug: Read fault blocked by KUAP!
 WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 
__do_page_fault+0x6b4/0xcd0
 NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
 LR [c009e7e0] __do_page_fault+0x6b0/0xcd0
..
 Call Trace:
 [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0 (unreliable)
 [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120
 [c00016367430] [c000c848] handle_page_fault+0x10/0x2c
 --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..
 NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0
 LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0
 interrupt: 300
 [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280
 [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780
 [c00016367990] [c0710330] iomap_file_buffered_write+0xa0/0x120
 [c000163679e0] [c0080040791c] xfs_file_buffered_aio_write+0x314/0x5e0 
[xfs]
 [c00016367a90] [c06d74bc] io_write+0x10c/0x460
 [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200
 [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250
 [c00016367cb0] [c06e2578] io_worker_handle_work+0x498/0x800
 [c00016367d40] [c06e2cdc] io_wqe_worker+0x3fc/0x4f0
 [c00016367da0] [c01cb0a4] kthread+0x1c4/0x1d0
 [c00016367e10] [c000dbf0] ret_from_kernel_thread+0x5c/0x6c

The kernel consider thread AMR value for kernel thread to be
AMR_KUAP_BLOCKED. Hence access to userspace is denied. This
of course not correct and we should allow userspace access after
kthread_use_mm(). To be precise, kthread_use_mm() should inherit the
AMR value of the operating address space. But, the AMR value is
thread-specific and we inherit the address space and not thread
access restrictions. Because of this ignore AMR value when accessing
userspace via kernel thread.

Cc: Zorro Lang 
Cc: Jens Axboe 
Cc: Christophe Leroy 
Cc: Nicholas Piggin 
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/kup.h | 23 ---
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index f50f72e535aa..2064621ae7b6 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -202,22 +202,16 @@ DECLARE_STATIC_KEY_FALSE(uaccess_flush_key);
 #include 
 #include 
 
-/*
- * For kernel thread that doesn't have thread.regs return
- * default AMR/IAMR values.
- */
 static inline u64 current_thread_amr(void)
 {
-   if (current->thread.regs)
-   return current->thread.regs->amr;
-   return AMR_KUAP_BLOCKED;
+   VM_BUG_ON(!current->thread.regs);
+   return current->thread.regs->amr;
 }
 
 static inline u64 current_thread_iamr(void)
 {
-   if (current->thread.regs)
-   return current->thread.regs->iamr;
-   return AMR_KUEP_BLOCKED;
+   VM_BUG_ON(!current->thread.regs);
+   return current->thread.regs->iamr;
 }
 #endif /* CONFIG_PPC_PKEY */
 
@@ -384,7 +378,14 @@ static __always_inline void allow_user_access(void __user 
*to, const void __user
// This is written so we can resolve to a single case at build time
BUILD_BUG_ON(!__builtin_constant_p(dir));
 
-   if (mmu_has_feature(MMU_FTR_PKEY))
+   /*
+* Kernel threads may access user mm with kthread_use_mm() but
+* can't use current_thread_amr because they have thread.regs==NULL,
+* but they have no pkeys.
+*/
+   if (current->flags & PF_KTHREAD)
+   thread_amr = 0;
+   else if (mmu_has_feature(MMU_FTR_PKEY))
thread_amr = current_thread_amr();
 
if (dir == KUAP_READ)
-- 
2.29.2

[PATCH] mm/pmem: Avoid inserting hugepage PTE entry with fsdax if hugepage support is disabled

2021-02-04 Thread Aneesh Kumar K.V

Differentiate between hardware not supporting hugepages and user disabling THP
via 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'

For the devdax namespace, the kernel handles the above via the
supported_alignment attribute and failing to initialize the namespace
if the namespace align value is not supported on the platform.

For the fsdax namespace, the kernel will continue to initialize
the namespace. This can result in the kernel creating a huge pte
entry even though the hardware don't support the same.

We do want hugepage support with pmem even if the end-user disabled THP
via sysfs file (/sys/kernel/mm/transparent_hugepage/enabled). Hence
differentiate between hardware/firmware lacking support vs user-controlled
disable of THP and prevent a huge fault if the hardware lacks hugepage
support.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/huge_mm.h | 15 +--
 mm/huge_memory.c|  6 +-
 2 files changed, 14 insertions(+), 7 deletions(-)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 6a19f35f836b..ba973efcd369 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -78,6 +78,7 @@ static inline vm_fault_t vmf_insert_pfn_pud(struct vm_fault 
*vmf, pfn_t pfn,
 }
 
 enum transparent_hugepage_flag {
+   TRANSPARENT_HUGEPAGE_NEVER_DAX,
TRANSPARENT_HUGEPAGE_FLAG,
TRANSPARENT_HUGEPAGE_REQ_MADV_FLAG,
TRANSPARENT_HUGEPAGE_DEFRAG_DIRECT_FLAG,
@@ -123,6 +124,13 @@ extern unsigned long transparent_hugepage_flags;
  */
 static inline bool __transparent_hugepage_enabled(struct vm_area_struct *vma)
 {
+
+   /*
+* If the hardware/firmware marked hugepage support disabled.
+*/
+   if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_NEVER_DAX))
+   return false;
+
if (vma->vm_flags & VM_NOHUGEPAGE)
return false;
 
@@ -134,12 +142,7 @@ static inline bool __transparent_hugepage_enabled(struct 
vm_area_struct *vma)
 
if (transparent_hugepage_flags & (1 << TRANSPARENT_HUGEPAGE_FLAG))
return true;
-   /*
-* For dax vmas, try to always use hugepage mappings. If the kernel does
-* not support hugepages, fsdax mappings will fallback to PAGE_SIZE
-* mappings, and device-dax namespaces, that try to guarantee a given
-* mapping size, will fail to enable
-*/
+
if (vma_is_dax(vma))
return true;
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 9237976abe72..d698b7e27447 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -386,7 +386,11 @@ static int __init hugepage_init(void)
struct kobject *hugepage_kobj;
 
if (!has_transparent_hugepage()) {
-   transparent_hugepage_flags = 0;
+   /*
+* Hardware doesn't support hugepages, hence disable
+* DAX PMD support.
+*/
+   transparent_hugepage_flags = 1 << 
TRANSPARENT_HUGEPAGE_NEVER_DAX;
return -EINVAL;
}
 
-- 
2.29.2

Re: [PATCH] powerpc/perf: Record counter overflow always if SAMPLE_IP is unset

2021-02-04 Thread Athira Rajeev




> On 04-Feb-2021, at 8:25 AM, Michael Ellerman  wrote:
> 
> Athira Rajeev  writes:
>> While sampling for marked events, currently we record the sample only
>> if the SIAR valid bit of Sampled Instruction Event Register (SIER) is
>> set. SIAR_VALID bit is used for fetching the instruction address from
>> Sampled Instruction Address Register(SIAR). But there are some usecases,
>> where the user is interested only in the PMU stats at each counter
>> overflow and the exact IP of the overflow event is not required.
>> Dropping SIAR invalid samples will fail to record some of the counter
>> overflows in such cases.
>> 
>> Example of such usecase is dumping the PMU stats (event counts)
>> after some regular amount of instructions/events from the userspace
>> (ex: via ptrace). Here counter overflow is indicated to userspace via
>> signal handler, and captured by monitoring and enabling I/O
>> signaling on the event file descriptor. In these cases, we expect to
>> get sample/overflow indication after each specified sample_period.
>> 
>> Perf event attribute will not have PERF_SAMPLE_IP set in the
>> sample_type if exact IP of the overflow event is not requested. So
>> while profiling if SAMPLE_IP is not set, just record the counter overflow
>> irrespective of SIAR_VALID check.
>> 
>> Signed-off-by: Athira Rajeev 
>> ---
>> arch/powerpc/perf/core-book3s.c | 10 --
>> 1 file changed, 8 insertions(+), 2 deletions(-)
>> 
>> diff --git a/arch/powerpc/perf/core-book3s.c 
>> b/arch/powerpc/perf/core-book3s.c
>> index 28206b1fe172..bb4828a05e4d 100644
>> --- a/arch/powerpc/perf/core-book3s.c
>> +++ b/arch/powerpc/perf/core-book3s.c
>> @@ -2166,10 +2166,16 @@ static void record_and_restart(struct perf_event 
>> *event, unsigned long val,
>>   * address even when freeze on supervisor state (kernel) is set in
>>   * MMCR2. Check attr.exclude_kernel and address to drop the sample in
>>   * these cases.
>> + *
>> + * If address is not requested in the sample
>> + * via PERF_SAMPLE_IP, just record that sample
>> + * irrespective of SIAR valid check.
>>   */
>> -if (event->attr.exclude_kernel && record)
>> -if (is_kernel_addr(mfspr(SPRN_SIAR)))
>> +if (event->attr.exclude_kernel && record) {
>> +if (is_kernel_addr(mfspr(SPRN_SIAR)) && 
>> (event->attr.sample_type & PERF_SAMPLE_IP))
>>  record = 0;
>> +} else if (!record && !(event->attr.sample_type & PERF_SAMPLE_IP))
>> +record = 1;
> 
> This seems wrong, you're assuming that record was set previously to
> = siar_valid(), but it may be that record is still 0 from the
> initialisation and we weren't going to record.
> 
> Don't we need something more like this?

Hi Michael,

Thanks for checking the patch and sharing the suggestion.

Yes, the below change looks good and tested with my scenario. 
I will send a V2 with new change.

Thanks
Athira
> 
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 9fd06010e8b6..e4e8a017d339 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -2136,7 +2136,12 @@ static void record_and_restart(struct perf_event 
> *event, unsigned long val,
>   left += period;
>   if (left <= 0)
>   left = period;
> - record = siar_valid(regs);
> +
> + if (event->attr.sample_type & PERF_SAMPLE_IP)
> + record = siar_valid(regs);
> + else
> + record = 1;
> +
>   event->hw.last_period = event->hw.sample_period;
>   }
>   if (left < 0x8000LL)
> @@ -2154,9 +2159,10 @@ static void record_and_restart(struct perf_event 
> *event, unsigned long val,
>* MMCR2. Check attr.exclude_kernel and address to drop the sample in
>* these cases.
>*/
> - if (event->attr.exclude_kernel && record)
> - if (is_kernel_addr(mfspr(SPRN_SIAR)))
> - record = 0;
> + if (event->attr.exclude_kernel &&
> + (event->attr.sample_type & PERF_SAMPLE_IP) &&
> + is_kernel_addr(mfspr(SPRN_SIAR)))
> + record = 0;
> 
>   /*
>* Finally record data if requested.
> 
> 
> 
> cheers

Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C

2021-02-04 Thread Nicholas Piggin

Excerpts from Michael Ellerman's message of February 5, 2021 10:22 am:
> Nicholas Piggin  writes:
>> Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm:
>>> Le 04/02/2021 à 04:27, Nicholas Piggin a écrit :
 Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:
> Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :
> ...
>> +
>> +/*
>> + * We don't need to restore AMR on the way back to userspace 
>> for KUAP.
>> + * The value of AMR only matters while we're in the kernel.
>> + */
>> +kuap_restore_amr(regs);
>
> Is that correct to restore KUAP state here ? Shouldn't we have it at 
> lower level in assembly ?
>
> Isn't there a risk that someone manages to call 
> interrupt_exit_kernel_prepare() or the end of it in
> a way or another, and get the previous KUAP state restored by this way ?
 
 I'm not sure if there much more risk if it's here rather than the
 instruction being in another place in the code.
 
 There's a lot of user access around the kernel too if you want to find a
 gadget to unlock KUAP then I suppose there is a pretty large attack
 surface.
>>> 
>>> My understanding is that user access scope is strictly limited, for 
>>> instance we enforce the 
>>> begin/end of user access to be in the same function, and we refrain from 
>>> calling any other function 
>>> inside the user access window. x86 even have 'objtool' to enforce it at 
>>> build time. So in theory 
>>> there is no way to get out of the function while user access is open.
>>> 
>>> Here with the interrupt exit function it is free beer. You have a place 
>>> where you re-open user 
>>> access and return with a simple blr. So that's open bar. If someone manages 
>>> to just call the 
>>> interrupt exit function, then user access remains open
>>
>> Hmm okay maybe that's a good point.
> 
> I don't think it's a very attractive gadget, it's not just a plain blr,
> it does a full stack frame tear down before the return. And there's no
> LR reloads anywhere very close.
> 
> Obviously it depends on what the compiler decides to do, it's possible
> it could be a usable gadget. But there are other places that are more
> attractive I think, eg:
> 
> c061d768: a6 03 3d 7d mtspr   29,r9
> c061d76c: 2c 01 00 4c isync
> c061d770: 00 00 00 60 nop
> c061d774: 30 00 21 38 addir1,r1,48
> c061d778: 20 00 80 4e blr
> 
> 
> So I don't think we should redesign the code *purely* because we're
> worried about interrupt_exit_kernel_prepare() being a useful gadget. If
> we can come up with a way to restructure it that reads well and is
> maintainable, and also reduces the chance of it being a good gadget then
> sure.

Okay. That would be good if we can keep it in C, the pkeys + kuap combo
is fairly complicated and we might want to something cleverer with it, 
so that would make it even more difficult in asm.

Thanks,
Nick

Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C

2021-02-04 Thread Michael Ellerman

Nicholas Piggin  writes:
> Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm:
>> Le 04/02/2021 à 04:27, Nicholas Piggin a écrit :
>>> Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:
 Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :
...
> +
> + /*
> +  * We don't need to restore AMR on the way back to userspace for KUAP.
> +  * The value of AMR only matters while we're in the kernel.
> +  */
> + kuap_restore_amr(regs);

 Is that correct to restore KUAP state here ? Shouldn't we have it at lower 
 level in assembly ?

 Isn't there a risk that someone manages to call 
 interrupt_exit_kernel_prepare() or the end of it in
 a way or another, and get the previous KUAP state restored by this way ?
>>> 
>>> I'm not sure if there much more risk if it's here rather than the
>>> instruction being in another place in the code.
>>> 
>>> There's a lot of user access around the kernel too if you want to find a
>>> gadget to unlock KUAP then I suppose there is a pretty large attack
>>> surface.
>> 
>> My understanding is that user access scope is strictly limited, for instance 
>> we enforce the 
>> begin/end of user access to be in the same function, and we refrain from 
>> calling any other function 
>> inside the user access window. x86 even have 'objtool' to enforce it at 
>> build time. So in theory 
>> there is no way to get out of the function while user access is open.
>> 
>> Here with the interrupt exit function it is free beer. You have a place 
>> where you re-open user 
>> access and return with a simple blr. So that's open bar. If someone manages 
>> to just call the 
>> interrupt exit function, then user access remains open
>
> Hmm okay maybe that's a good point.

I don't think it's a very attractive gadget, it's not just a plain blr,
it does a full stack frame tear down before the return. And there's no
LR reloads anywhere very close.

Obviously it depends on what the compiler decides to do, it's possible
it could be a usable gadget. But there are other places that are more
attractive I think, eg:

c061d768:   a6 03 3d 7d mtspr   29,r9
c061d76c:   2c 01 00 4c isync
c061d770:   00 00 00 60 nop
c061d774:   30 00 21 38 addir1,r1,48
c061d778:   20 00 80 4e blr


So I don't think we should redesign the code *purely* because we're
worried about interrupt_exit_kernel_prepare() being a useful gadget. If
we can come up with a way to restructure it that reads well and is
maintainable, and also reduces the chance of it being a good gadget then
sure.

cheers

Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT

2021-02-04 Thread Lakshmi Ramasubramanian


On 2/4/21 3:36 PM, Rob Herring wrote:

On Thu, Feb 4, 2021 at 5:23 PM Lakshmi Ramasubramanian
 wrote:


On 2/4/21 11:26 AM, Rob Herring wrote:

On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian
 wrote:


of_alloc_and_init_fdt() and of_free_fdt() have been defined in
drivers/of/kexec.c to allocate and free memory for FDT.

Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
initialize the FDT, and to free the FDT respectively.

powerpc sets the FDT address in image_loader_data field in
"struct kimage" and the memory is freed in
kimage_file_post_load_cleanup().  This cleanup function uses kfree()
to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc()
for allocation, the buffer needs to be freed using kvfree().


You could just change the kexec core to call kvfree() instead.





Define "fdt" field in "struct kimage_arch" for powerpc to store
the address of FDT, and free the memory in powerpc specific
arch_kimage_file_post_load_cleanup().


However, given all the other buffers have an explicit field in kimage
or kimage_arch, changing powerpc is to match arm64 is better IMO.


Just to be clear:
I'll leave this as is - free FDT buffer in powerpc's
arch_kimage_file_post_load_cleanup() to match arm64 behavior.


Yes.


ok




Will not change "kexec core" to call kvfree() - doing that change would
require changing all architectures to use kvmalloc() for
image_loader_data allocation.


Actually, no. AIUI, kvfree() can be used whether you used kvmalloc,
vmalloc, or kmalloc for the alloc.

That is good information. Thanks.




Signed-off-by: Lakshmi Ramasubramanian 
Suggested-by: Rob Herring 
Suggested-by: Thiago Jung Bauermann 
---
   arch/powerpc/include/asm/kexec.h  |  2 ++
   arch/powerpc/kexec/elf_64.c   | 26 --
   arch/powerpc/kexec/file_load_64.c |  3 +++
   3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 2c0be93d239a..d7d13cac4d31 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -111,6 +111,8 @@ struct kimage_arch {
  unsigned long elf_headers_mem;
  unsigned long elf_headers_sz;
  void *elf_headers;
+
+   void *fdt;
   };

   char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index d0e459bb2f05..51d2d8eb6c1b 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -19,6 +19,7 @@
   #include 
   #include 
   #include 
+#include 
   #include 
   #include 
   #include 
@@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
  unsigned int fdt_size;
  unsigned long kernel_load_addr;
  unsigned long initrd_load_addr = 0, fdt_load_addr;
-   void *fdt;
+   void *fdt = NULL;
  const void *slave_code;
  struct elfhdr ehdr;
  char *modified_cmdline = NULL;
@@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
  }

  fdt_size = fdt_totalsize(initial_boot_params) * 2;
-   fdt = kmalloc(fdt_size, GFP_KERNEL);
+   fdt = of_alloc_and_init_fdt(fdt_size);
  if (!fdt) {
  pr_err("Not enough memory for the device tree.\n");
  ret = -ENOMEM;
  goto out;
  }
-   ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
-   if (ret < 0) {
-   pr_err("Error setting up the new device tree.\n");
-   ret = -EINVAL;
-   goto out;
-   }

  ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,


The first thing this function does is call setup_new_fdt() which first
calls of_kexec_setup_new_fdt(). (Note, I really don't understand the
PPC code split. It looks like there's a 32-bit and 64-bit split, but
32-bit looks broken to me. Nothing ever calls setup_new_fdt() except
setup_new_fdt_ppc64()). The arm64 version is calling
of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly.

So we can just make of_alloc_and_init_fdt() also call
of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do
the alloc and copy).

ok - will move fdt allocation into of_kexec_setup_new_fdt().

I don't think the architecture needs to pick the

size either. It's doubtful that either one is that sensitive to the
amount of extra space.

I am not clear about the above comment -
are you saying the architectures don't need to pass FDT size to the
alloc function?

arm64 is adding command line string length and some extra space to the
size computed from initial_boot_params for FDT Size:

 buf_size = fdt_totalsize(initial_boot_params)
 + cmdline_len + DTB_EXTRA_SPACE;

powerpc is just using twice the size computed from initial_boot_params

 fdt_size = fdt_totalsize(initial_boot_params) * 2;

I think it would be safe to let arm64 and

Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT

2021-02-04 Thread Rob Herring

On Thu, Feb 4, 2021 at 5:23 PM Lakshmi Ramasubramanian
 wrote:
>
> On 2/4/21 11:26 AM, Rob Herring wrote:
> > On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian
> >  wrote:
> >>
> >> of_alloc_and_init_fdt() and of_free_fdt() have been defined in
> >> drivers/of/kexec.c to allocate and free memory for FDT.
> >>
> >> Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
> >> initialize the FDT, and to free the FDT respectively.
> >>
> >> powerpc sets the FDT address in image_loader_data field in
> >> "struct kimage" and the memory is freed in
> >> kimage_file_post_load_cleanup().  This cleanup function uses kfree()
> >> to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc()
> >> for allocation, the buffer needs to be freed using kvfree().
> >
> > You could just change the kexec core to call kvfree() instead.
>
> >
> >> Define "fdt" field in "struct kimage_arch" for powerpc to store
> >> the address of FDT, and free the memory in powerpc specific
> >> arch_kimage_file_post_load_cleanup().
> >
> > However, given all the other buffers have an explicit field in kimage
> > or kimage_arch, changing powerpc is to match arm64 is better IMO.
>
> Just to be clear:
> I'll leave this as is - free FDT buffer in powerpc's
> arch_kimage_file_post_load_cleanup() to match arm64 behavior.

Yes.

> Will not change "kexec core" to call kvfree() - doing that change would
> require changing all architectures to use kvmalloc() for
> image_loader_data allocation.

Actually, no. AIUI, kvfree() can be used whether you used kvmalloc,
vmalloc, or kmalloc for the alloc.

> >> Signed-off-by: Lakshmi Ramasubramanian 
> >> Suggested-by: Rob Herring 
> >> Suggested-by: Thiago Jung Bauermann 
> >> ---
> >>   arch/powerpc/include/asm/kexec.h  |  2 ++
> >>   arch/powerpc/kexec/elf_64.c   | 26 --
> >>   arch/powerpc/kexec/file_load_64.c |  3 +++
> >>   3 files changed, 21 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/arch/powerpc/include/asm/kexec.h 
> >> b/arch/powerpc/include/asm/kexec.h
> >> index 2c0be93d239a..d7d13cac4d31 100644
> >> --- a/arch/powerpc/include/asm/kexec.h
> >> +++ b/arch/powerpc/include/asm/kexec.h
> >> @@ -111,6 +111,8 @@ struct kimage_arch {
> >>  unsigned long elf_headers_mem;
> >>  unsigned long elf_headers_sz;
> >>  void *elf_headers;
> >> +
> >> +   void *fdt;
> >>   };
> >>
> >>   char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
> >> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> >> index d0e459bb2f05..51d2d8eb6c1b 100644
> >> --- a/arch/powerpc/kexec/elf_64.c
> >> +++ b/arch/powerpc/kexec/elf_64.c
> >> @@ -19,6 +19,7 @@
> >>   #include 
> >>   #include 
> >>   #include 
> >> +#include 
> >>   #include 
> >>   #include 
> >>   #include 
> >> @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char 
> >> *kernel_buf,
> >>  unsigned int fdt_size;
> >>  unsigned long kernel_load_addr;
> >>  unsigned long initrd_load_addr = 0, fdt_load_addr;
> >> -   void *fdt;
> >> +   void *fdt = NULL;
> >>  const void *slave_code;
> >>  struct elfhdr ehdr;
> >>  char *modified_cmdline = NULL;
> >> @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char 
> >> *kernel_buf,
> >>  }
> >>
> >>  fdt_size = fdt_totalsize(initial_boot_params) * 2;
> >> -   fdt = kmalloc(fdt_size, GFP_KERNEL);
> >> +   fdt = of_alloc_and_init_fdt(fdt_size);
> >>  if (!fdt) {
> >>  pr_err("Not enough memory for the device tree.\n");
> >>  ret = -ENOMEM;
> >>  goto out;
> >>  }
> >> -   ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
> >> -   if (ret < 0) {
> >> -   pr_err("Error setting up the new device tree.\n");
> >> -   ret = -EINVAL;
> >> -   goto out;
> >> -   }
> >>
> >>  ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
> >
> > The first thing this function does is call setup_new_fdt() which first
> > calls of_kexec_setup_new_fdt(). (Note, I really don't understand the
> > PPC code split. It looks like there's a 32-bit and 64-bit split, but
> > 32-bit looks broken to me. Nothing ever calls setup_new_fdt() except
> > setup_new_fdt_ppc64()). The arm64 version is calling
> > of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly.
> >
> > So we can just make of_alloc_and_init_fdt() also call
> > of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do
> > the alloc and copy).
> ok - will move fdt allocation into of_kexec_setup_new_fdt().
>
> I don't think the architecture needs to pick the
> > size either. It's doubtful that either one is that sensitive to the
> > amount of extra space.
> I am not clear about the above comment -
> are you saying the architectures don't need to pass FDT size to the
> alloc function?
>
> arm64 is adding command line string length

Re: [PATCH 1/2] PCI/AER: Disable AER interrupt during suspend

2021-02-04 Thread Bjorn Helgaas

[+cc Alex]

On Thu, Jan 28, 2021 at 12:09:37PM +0800, Kai-Heng Feng wrote:
> On Thu, Jan 28, 2021 at 4:51 AM Bjorn Helgaas  wrote:
> > On Thu, Jan 28, 2021 at 01:31:00AM +0800, Kai-Heng Feng wrote:
> > > Commit 50310600ebda ("iommu/vt-d: Enable PCI ACS for platform opt in
> > > hint") enables ACS, and some platforms lose its NVMe after resume from
> > > firmware:
> > > [   50.947816] pcieport :00:1b.0: DPC: containment event, 
> > > status:0x1f01 source:0x
> > > [   50.947817] pcieport :00:1b.0: DPC: unmasked uncorrectable error 
> > > detected
> > > [   50.947829] pcieport :00:1b.0: PCIe Bus Error: 
> > > severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
> > > [   50.947830] pcieport :00:1b.0:   device [8086:06ac] error 
> > > status/mask=0020/0001
> > > [   50.947831] pcieport :00:1b.0:[21] ACSViol
> > > (First)
> > > [   50.947841] pcieport :00:1b.0: AER: broadcast error_detected 
> > > message
> > > [   50.947843] nvme nvme0: frozen state error detected, reset controller
> > >
> > > It happens right after ACS gets enabled during resume.
> > >
> > > To prevent that from happening, disable AER interrupt and enable it on
> > > system suspend and resume, respectively.
> >
> > Lots of questions here.  Maybe this is what we'll end up doing, but I
> > am curious about why the error is reported in the first place.
> >
> > Is this a consequence of the link going down and back up?
> 
> Could be. From the observations, it only happens when firmware suspend
> (S3) is used.
> Maybe it happens when it's gets powered up, but I don't have equipment
> to debug at hardware level.
> 
> If we use non-firmware suspend method, enabling ACS after resume won't
> trip AER and DPC.
> 
> > Is it consequence of the device doing a DMA when it shouldn't?
> 
> If it's doing DMA while suspending, the same error should also happen
> after NVMe is suspended and before PCIe port suspending.
> Furthermore, if non-firmware suspend method is used, there's so such
> issue, so less likely to be any DMA operation.
> 
> > Are we doing something in the wrong order during suspend?  Or maybe
> > resume, since I assume the error is reported during resume?
> 
> Yes the error is reported during resume. The suspend/resume order
> seems fine as non-firmware suspend doesn't have this issue.

I really feel like we need a better understanding of what's going on
here.  Disabling the AER interrupt is like closing our eyes and
pretending that because we don't see it, it didn't happen.

An ACS error is triggered by a DMA, right?  I'm assuming an MMIO
access from the CPU wouldn't trigger this error.  And it sounds like
the error is triggered before we even start running the driver after
resume.

If we're powering up an NVMe device from D3cold and it DMAs before the
driver touches it, something would be seriously broken.  I doubt
that's what's happening.  Maybe a device could resume some previously
programmed DMA after powering up from D3hot.

Or maybe the error occurred on suspend, like if the device wasn't
quiesced or something, but we didn't notice it until resume?  The 
AER error status bits are RW1CS, which means they can be preserved
across hot/warm/cold resets.

Can you instrument the code to see whether the AER error status bit is
set before enabling ACS?  I'm not sure that merely enabling ACS (I
assume you mean pci_std_enable_acs(), where we write PCI_ACS_CTRL)
should cause an interrupt for a previously-logged error.  I suspect
that could happen when enabling *AER*, but I wouldn't think it would
happen when enabling *ACS*.

Does this error happen on multiple machines from different vendors?
Wondering if it could be a BIOS issue, e.g., BIOS not cleaning up
after it did something to cause an error.

> > If we *do* take the error, why doesn't DPC recovery work?
> 
> It works for the root port, but not for the NVMe drive:
> [   50.947816] pcieport :00:1b.0: DPC: containment event,
> status:0x1f01 source:0x
> [   50.947817] pcieport :00:1b.0: DPC: unmasked uncorrectable error 
> detected
> [   50.947829] pcieport :00:1b.0: PCIe Bus Error:
> severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver
> ID)
> [   50.947830] pcieport :00:1b.0:   device [8086:06ac] error
> status/mask=0020/0001
> [   50.947831] pcieport :00:1b.0:[21] ACSViol(First)
> [   50.947841] pcieport :00:1b.0: AER: broadcast error_detected message
> [   50.947843] nvme nvme0: frozen state error detected, reset controller
> [   50.948400] ACPI: EC: event unblocked
> [   50.948432] xhci_hcd :00:14.0: PME# disabled
> [   50.948444] xhci_hcd :00:14.0: enabling bus mastering
> [   50.949056] pcieport :00:1b.0: PME# disabled
> [   50.949068] pcieport :00:1c.0: PME# disabled
> [   50.949416] e1000e :00:1f.6: PME# disabled
> [   50.949463] e1000e :00:1f.6: enabling bus mastering
> [   50.951606] sd 0:0:0:0: [sda] Starting disk
> [

Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT

2021-02-04 Thread Lakshmi Ramasubramanian


On 2/4/21 11:26 AM, Rob Herring wrote:

On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian
 wrote:


of_alloc_and_init_fdt() and of_free_fdt() have been defined in
drivers/of/kexec.c to allocate and free memory for FDT.

Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
initialize the FDT, and to free the FDT respectively.

powerpc sets the FDT address in image_loader_data field in
"struct kimage" and the memory is freed in
kimage_file_post_load_cleanup().  This cleanup function uses kfree()
to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc()
for allocation, the buffer needs to be freed using kvfree().


You could just change the kexec core to call kvfree() instead.





Define "fdt" field in "struct kimage_arch" for powerpc to store
the address of FDT, and free the memory in powerpc specific
arch_kimage_file_post_load_cleanup().


However, given all the other buffers have an explicit field in kimage
or kimage_arch, changing powerpc is to match arm64 is better IMO.


Just to be clear:
I'll leave this as is - free FDT buffer in powerpc's 
arch_kimage_file_post_load_cleanup() to match arm64 behavior.


Will not change "kexec core" to call kvfree() - doing that change would 
require changing all architectures to use kvmalloc() for 
image_loader_data allocation.





Signed-off-by: Lakshmi Ramasubramanian 
Suggested-by: Rob Herring 
Suggested-by: Thiago Jung Bauermann 
---
  arch/powerpc/include/asm/kexec.h  |  2 ++
  arch/powerpc/kexec/elf_64.c   | 26 --
  arch/powerpc/kexec/file_load_64.c |  3 +++
  3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 2c0be93d239a..d7d13cac4d31 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -111,6 +111,8 @@ struct kimage_arch {
 unsigned long elf_headers_mem;
 unsigned long elf_headers_sz;
 void *elf_headers;
+
+   void *fdt;
  };

  char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index d0e459bb2f05..51d2d8eb6c1b 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -19,6 +19,7 @@
  #include 
  #include 
  #include 
+#include 
  #include 
  #include 
  #include 
@@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
 unsigned int fdt_size;
 unsigned long kernel_load_addr;
 unsigned long initrd_load_addr = 0, fdt_load_addr;
-   void *fdt;
+   void *fdt = NULL;
 const void *slave_code;
 struct elfhdr ehdr;
 char *modified_cmdline = NULL;
@@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
 }

 fdt_size = fdt_totalsize(initial_boot_params) * 2;
-   fdt = kmalloc(fdt_size, GFP_KERNEL);
+   fdt = of_alloc_and_init_fdt(fdt_size);
 if (!fdt) {
 pr_err("Not enough memory for the device tree.\n");
 ret = -ENOMEM;
 goto out;
 }
-   ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
-   if (ret < 0) {
-   pr_err("Error setting up the new device tree.\n");
-   ret = -EINVAL;
-   goto out;
-   }

 ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,


The first thing this function does is call setup_new_fdt() which first
calls of_kexec_setup_new_fdt(). (Note, I really don't understand the
PPC code split. It looks like there's a 32-bit and 64-bit split, but
32-bit looks broken to me. Nothing ever calls setup_new_fdt() except
setup_new_fdt_ppc64()). The arm64 version is calling
of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly.

So we can just make of_alloc_and_init_fdt() also call
of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do
the alloc and copy). 

ok - will move fdt allocation into of_kexec_setup_new_fdt().

I don't think the architecture needs to pick the

size either. It's doubtful that either one is that sensitive to the
amount of extra space.

I am not clear about the above comment -
are you saying the architectures don't need to pass FDT size to the 
alloc function?


arm64 is adding command line string length and some extra space to the 
size computed from initial_boot_params for FDT Size:


buf_size = fdt_totalsize(initial_boot_params)
+ cmdline_len + DTB_EXTRA_SPACE;

powerpc is just using twice the size computed from initial_boot_params

fdt_size = fdt_totalsize(initial_boot_params) * 2;

I think it would be safe to let arm64 and powerpc pass the required FDT 
size, along with the other params to of_kexec_setup_new_fdt() - and in 
this function we allocate FDT and set it up.


And, for powerpc leave the remaining code in setup_new_fdt_ppc64().

Would that be ok?

Re: [PATCH] arch:powerpc simple_write_to_buffer return check

2021-02-04 Thread Oliver O'Halloran

On Fri, Feb 5, 2021 at 5:17 AM Mayank Suman  wrote:
>
> Signed-off-by: Mayank Suman 

commit messages aren't optional

> ---
>  arch/powerpc/kernel/eeh.c| 8 
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++--
>  2 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 813713c9120c..2dbe1558a71f 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file 
> *filp,
> char buf[20];
> int ret;
>
> -   ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
> -   if (!ret)
> +   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);

We should probably be zeroing the buffer. Reading to sizeof(buf) - 1
is done in a few places to guarantee that the string is nul
terminated, but without the preceeding memset() that isn't actually
guaranteed.

> +   if (ret <= 0)
> return -EFAULT;

EFAULT is supposed to be returned when the user supplies a buffer to
write(2) which is outside their address space. I figured letting the
sscanf() in the next step fail if the user passes writes a zero-length
buffer and returning EINVAL made more sense. That said, the exact
semantics around zero length writes are pretty handwavy so I guess
this isn't wrong, but I don't think it's better either.

> /*
> @@ -1696,7 +1696,7 @@ static ssize_t eeh_dev_check_write(struct file *filp,
>
> memset(buf, 0, sizeof(buf));
> ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);
> -   if (!ret)
> +   if (ret <= 0)
> return -EFAULT;
>
> ret = sscanf(buf, "%x:%x:%x.%x", , , , );
> @@ -1836,7 +1836,7 @@ static ssize_t eeh_dev_break_write(struct file *filp,
>
> memset(buf, 0, sizeof(buf));
> ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);
> -   if (!ret)
> +   if (ret <= 0)
> return -EFAULT;
>
> ret = sscanf(buf, "%x:%x:%x.%x", , , , );
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
> b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 89e22c460ebf..36ed2b8f7375 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -76,8 +76,8 @@ static ssize_t pnv_eeh_ei_write(struct file *filp,
> return -ENXIO;
>
> /* Copy over argument buffer */
> -   ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
> -   if (!ret)
> +   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, 
> count);
> +   if (ret <= 0)
> return -EFAULT;
>
> /* Retrieve parameters */
> --
> 2.30.0
>

[PATCH] KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path

2021-02-04 Thread Fabiano Rosas

The Facility Status and Control Register is a privileged SPR that
defines the availability of some features in problem state. Since it
can be written by the guest, we must restore it to the previous host
value after guest exit.

This restoration is currently done by taking the value from
current->thread.fscr, which in the P9 path is not enough anymore
because the guest could context switch the QEMU thread, causing the
guest-current value to be saved into the thread struct.

The above situation manifested when running a QEMU linked against a
libc with System Call Vectored support, which causes scv
instructions to be run by QEMU early during the guest boot (during
SLOF), at which point the FSCR is 0 due to guest entry. After a few
scv calls (1 to a couple hundred), the context switching happens and
the QEMU thread runs with the guest value, resulting in a Facility
Unavailable interrupt.

This patch saves and restores the host value of FSCR in the inner
guest entry loop in a way independent of current->thread.fscr. The old
way of doing it is still kept in place because it works for the old
entry path.

Signed-off-by: Fabiano Rosas 
---
 arch/powerpc/kvm/book3s_hv.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 6f612d240392..f2ddf7139a2a 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -3595,6 +3595,7 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
unsigned long host_tidr = mfspr(SPRN_TIDR);
unsigned long host_iamr = mfspr(SPRN_IAMR);
unsigned long host_amr = mfspr(SPRN_AMR);
+   unsigned long host_fscr = mfspr(SPRN_FSCR);
s64 dec;
u64 tb;
int trap, save_pmu;
@@ -3735,6 +3736,9 @@ static int kvmhv_p9_guest_entry(struct kvm_vcpu *vcpu, 
u64 time_limit,
if (host_amr != vcpu->arch.amr)
mtspr(SPRN_AMR, host_amr);
 
+   if (host_fscr != vcpu->arch.fscr)
+   mtspr(SPRN_FSCR, host_fscr);
+
msr_check_and_set(MSR_FP | MSR_VEC | MSR_VSX);
store_fp_state(>arch.fp);
 #ifdef CONFIG_ALTIVEC
-- 
2.29.2

[PATCH] arch:powerpc simple_write_to_buffer return check

2021-02-04 Thread Mayank Suman

Signed-off-by: Mayank Suman 
---
 arch/powerpc/kernel/eeh.c| 8 
 arch/powerpc/platforms/powernv/eeh-powernv.c | 4 ++--
 2 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
index 813713c9120c..2dbe1558a71f 100644
--- a/arch/powerpc/kernel/eeh.c
+++ b/arch/powerpc/kernel/eeh.c
@@ -1628,8 +1628,8 @@ static ssize_t eeh_force_recover_write(struct file *filp,
char buf[20];
int ret;
 
-   ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
-   if (!ret)
+   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
+   if (ret <= 0)
return -EFAULT;
 
/*
@@ -1696,7 +1696,7 @@ static ssize_t eeh_dev_check_write(struct file *filp,
 
memset(buf, 0, sizeof(buf));
ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
-   if (!ret)
+   if (ret <= 0)
return -EFAULT;
 
ret = sscanf(buf, "%x:%x:%x.%x", , , , );
@@ -1836,7 +1836,7 @@ static ssize_t eeh_dev_break_write(struct file *filp,
 
memset(buf, 0, sizeof(buf));
ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
-   if (!ret)
+   if (ret <= 0)
return -EFAULT;
 
ret = sscanf(buf, "%x:%x:%x.%x", , , , );
diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c 
b/arch/powerpc/platforms/powernv/eeh-powernv.c
index 89e22c460ebf..36ed2b8f7375 100644
--- a/arch/powerpc/platforms/powernv/eeh-powernv.c
+++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
@@ -76,8 +76,8 @@ static ssize_t pnv_eeh_ei_write(struct file *filp,
return -ENXIO;
 
/* Copy over argument buffer */
-   ret = simple_write_to_buffer(buf, sizeof(buf), ppos, user_buf, count);
-   if (!ret)
+   ret = simple_write_to_buffer(buf, sizeof(buf)-1, ppos, user_buf, count);
+   if (ret <= 0)
return -EFAULT;
 
/* Retrieve parameters */
-- 
2.30.0

Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays

2021-02-04 Thread Konrad Rzeszutek Wilk

On Thu, Feb 04, 2021 at 11:49:23AM +, Robin Murphy wrote:
> On 2021-02-04 07:29, Christoph Hellwig wrote:
> > On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote:
> > > This patch converts several swiotlb related variables to arrays, in
> > > order to maintain stat/status for different swiotlb buffers. Here are
> > > variables involved:
> > > 
> > > - io_tlb_start and io_tlb_end
> > > - io_tlb_nslabs and io_tlb_used
> > > - io_tlb_list
> > > - io_tlb_index
> > > - max_segment
> > > - io_tlb_orig_addr
> > > - no_iotlb_memory
> > > 
> > > There is no functional change and this is to prepare to enable 64-bit
> > > swiotlb.
> > 
> > Claire Chang (on Cc) already posted a patch like this a month ago,
> > which looks much better because it actually uses a struct instead
> > of all the random variables.
> 
> Indeed, I skimmed the cover letter and immediately thought that this whole
> thing is just the restricted DMA pool concept[1] again, only from a slightly
> different angle.

Kind of. Let me lay out how some of these pieces are right now:

+---+  +--+
|   |  |  |
|   |  |  |
|   a)Xen-SWIOTLB   |  | b)SWIOTLB (for !Xen) |
|   |  |  |
+---XX--+  +---X--+
   X
  XX XXX
X   XX

 +--XX---+
 |   |
 |   |
 |   c) SWIOTLB generic  |
 |   |
 +---+

Dongli's patches modify the SWIOTLB generic c), and Xen-SWIOTLB a)
parts.

Also see the IOMMU_INIT logic which lays this a bit more deepth
(for example how to enable SWIOTLB on AMD boxes, or IBM with Calgary
IOMMU, etc - see iommu_table.h).

Furtheremore it lays the groundwork to allocate AMD SEV SWIOTLB buffers
later after boot (so that you can stich different pools together).
All the bits are kind of inside of the SWIOTLB code. And also it changes
the Xen-SWIOTLB to do something similar.

The mempool did it similarly by taking the internal parts (aka the
various io_tlb) of SWIOTLB and exposing them out and having
other code:

+---+  +--+
|   |  |  |
|   |  |  |
| a)Xen-SWIOTLB |  | b)SWIOTLB (for !Xen) |
|   |  |  |
+---XX--+  +---X--+
   X
  XX XXX
X   XX

 +--XX---+ +--+
 |   | | Device tree  |
 |   +<+ enabling SWIOTLB |
 |c) SWIOTLB generic | |  |
 |   | | mempool  |
 +---+ +--+

What I was suggesting to Clarie to follow Xen model, that is
do something like this:

+---+  +--+   ++
|   |  |  |   ||
|   |  |  |   ||
| a)Xen-SWIOTLB |  | b)SWIOTLB (for !Xen) |   | e) DT-SWIOTLB  |
|   |  |  |   ||
+---XX--+  +---X--+   +XX-X+
   XXXX X X XX X XX
  XX XXX
X   XX X

 +--XXX--+
 |   |
 |   |
 |c) SWIOTLB generic |
 |   |
 +---+

so using the SWIOTLB generic parts, and then bolt on top
of the device-tree logic, along with the mempool logic.

But Christopher has an interesting suggestion which is
to squash the all the existing code (a, b, c) all together
and pepper it with various jump-tables.

So:

-+
| SWIOTLB:   |
||
|  a) SWIOTLB (for non-Xen)  |
|  b) Xen-SWIOTLB|
|  c) DT-SWIOTLB |
||
||
-+

with all the various bits (M2P/P2M for Xen, mempool for ARM,
and normal allocation for BM) in one big file.

Re: [PATCH v16 11/12] powerpc: Use OF alloc and free for FDT

2021-02-04 Thread Rob Herring

On Thu, Feb 4, 2021 at 10:42 AM Lakshmi Ramasubramanian
 wrote:
>
> of_alloc_and_init_fdt() and of_free_fdt() have been defined in
> drivers/of/kexec.c to allocate and free memory for FDT.
>
> Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
> initialize the FDT, and to free the FDT respectively.
>
> powerpc sets the FDT address in image_loader_data field in
> "struct kimage" and the memory is freed in
> kimage_file_post_load_cleanup().  This cleanup function uses kfree()
> to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc()
> for allocation, the buffer needs to be freed using kvfree().

You could just change the kexec core to call kvfree() instead.

> Define "fdt" field in "struct kimage_arch" for powerpc to store
> the address of FDT, and free the memory in powerpc specific
> arch_kimage_file_post_load_cleanup().

However, given all the other buffers have an explicit field in kimage
or kimage_arch, changing powerpc is to match arm64 is better IMO.

> Signed-off-by: Lakshmi Ramasubramanian 
> Suggested-by: Rob Herring 
> Suggested-by: Thiago Jung Bauermann 
> ---
>  arch/powerpc/include/asm/kexec.h  |  2 ++
>  arch/powerpc/kexec/elf_64.c   | 26 --
>  arch/powerpc/kexec/file_load_64.c |  3 +++
>  3 files changed, 21 insertions(+), 10 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kexec.h 
> b/arch/powerpc/include/asm/kexec.h
> index 2c0be93d239a..d7d13cac4d31 100644
> --- a/arch/powerpc/include/asm/kexec.h
> +++ b/arch/powerpc/include/asm/kexec.h
> @@ -111,6 +111,8 @@ struct kimage_arch {
> unsigned long elf_headers_mem;
> unsigned long elf_headers_sz;
> void *elf_headers;
> +
> +   void *fdt;
>  };
>
>  char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
> diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
> index d0e459bb2f05..51d2d8eb6c1b 100644
> --- a/arch/powerpc/kexec/elf_64.c
> +++ b/arch/powerpc/kexec/elf_64.c
> @@ -19,6 +19,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char 
> *kernel_buf,
> unsigned int fdt_size;
> unsigned long kernel_load_addr;
> unsigned long initrd_load_addr = 0, fdt_load_addr;
> -   void *fdt;
> +   void *fdt = NULL;
> const void *slave_code;
> struct elfhdr ehdr;
> char *modified_cmdline = NULL;
> @@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char 
> *kernel_buf,
> }
>
> fdt_size = fdt_totalsize(initial_boot_params) * 2;
> -   fdt = kmalloc(fdt_size, GFP_KERNEL);
> +   fdt = of_alloc_and_init_fdt(fdt_size);
> if (!fdt) {
> pr_err("Not enough memory for the device tree.\n");
> ret = -ENOMEM;
> goto out;
> }
> -   ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
> -   if (ret < 0) {
> -   pr_err("Error setting up the new device tree.\n");
> -   ret = -EINVAL;
> -   goto out;
> -   }
>
> ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,

The first thing this function does is call setup_new_fdt() which first
calls of_kexec_setup_new_fdt(). (Note, I really don't understand the
PPC code split. It looks like there's a 32-bit and 64-bit split, but
32-bit looks broken to me. Nothing ever calls setup_new_fdt() except
setup_new_fdt_ppc64()). The arm64 version is calling
of_alloc_and_init_fdt() and then of_kexec_setup_new_fdt() directly.

So we can just make of_alloc_and_init_fdt() also call
of_kexec_setup_new_fdt() (really, just tweak of_kexec_setup_new_fdt do
the alloc and copy). I don't think the architecture needs to pick the
size either. It's doubtful that either one is that sensitive to the
amount of extra space.

>   initrd_len, cmdline);
> @@ -131,6 +126,10 @@ static void *elf64_load(struct kimage *image, char 
> *kernel_buf,
> ret = kexec_add_buffer();
> if (ret)
> goto out;
> +
> +   /* FDT will be freed in arch_kimage_file_post_load_cleanup */
> +   image->arch.fdt = fdt;
> +
> fdt_load_addr = kbuf.mem;
>
> pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr);
> @@ -145,8 +144,15 @@ static void *elf64_load(struct kimage *image, char 
> *kernel_buf,
> kfree(modified_cmdline);
> kexec_free_elf_info(_info);
>
> -   /* Make kimage_file_post_load_cleanup free the fdt buffer for us. */
> -   return ret ? ERR_PTR(ret) : fdt;
> +   /*
> +* Once FDT buffer has been successfully passed to kexec_add_buffer(),
> +* the FDT buffer address is saved in image->arch.fdt. In that case,
> +* the memory cannot be freed here in case of any other error.
> +*/
> +   if (ret && !image->arch.fdt)
> +   of_free_fdt(fdt);

Just call kvfree() directly.

> +
> +

Re: [PATCH] cpufreq: Remove unused flag CPUFREQ_PM_NO_WARN

2021-02-04 Thread Rafael J. Wysocki

On Tue, Feb 2, 2021 at 6:42 AM Viresh Kumar  wrote:
>
> This flag is set by one of the drivers but it isn't used in the code
> otherwise. Remove the unused flag and update the driver.
>
> Signed-off-by: Viresh Kumar 

Applied as 5.12 material, thanks!

> ---
> Rebased over:
>
> https://lore.kernel.org/lkml/a59bb322b22c247d570b70a8e94067804287623b.1612241683.git.viresh.ku...@linaro.org/
>
>  drivers/cpufreq/pmac32-cpufreq.c |  3 +--
>  include/linux/cpufreq.h  | 13 +
>  2 files changed, 6 insertions(+), 10 deletions(-)
>
> diff --git a/drivers/cpufreq/pmac32-cpufreq.c 
> b/drivers/cpufreq/pmac32-cpufreq.c
> index 73621bc11976..4f20c6a9108d 100644
> --- a/drivers/cpufreq/pmac32-cpufreq.c
> +++ b/drivers/cpufreq/pmac32-cpufreq.c
> @@ -439,8 +439,7 @@ static struct cpufreq_driver pmac_cpufreq_driver = {
> .init   = pmac_cpufreq_cpu_init,
> .suspend= pmac_cpufreq_suspend,
> .resume = pmac_cpufreq_resume,
> -   .flags  = CPUFREQ_PM_NO_WARN |
> - CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING,
> +   .flags  = CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING,
> .attr   = cpufreq_generic_attr,
> .name   = "powermac",
>  };
> diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
> index c8e40e91fe9b..353969c7acd3 100644
> --- a/include/linux/cpufreq.h
> +++ b/include/linux/cpufreq.h
> @@ -398,8 +398,11 @@ struct cpufreq_driver {
>  /* loops_per_jiffy or other kernel "constants" aren't affected by frequency 
> transitions */
>  #define CPUFREQ_CONST_LOOPSBIT(1)
>
> -/* don't warn on suspend/resume speed mismatches */
> -#define CPUFREQ_PM_NO_WARN BIT(2)
> +/*
> + * Set by drivers that want the core to automatically register the cpufreq
> + * driver as a thermal cooling device.
> + */
> +#define CPUFREQ_IS_COOLING_DEV BIT(2)
>
>  /*
>   * This should be set by platforms having multiple clock-domains, i.e.
> @@ -431,12 +434,6 @@ struct cpufreq_driver {
>   */
>  #define CPUFREQ_NO_AUTO_DYNAMIC_SWITCHING  BIT(6)
>
> -/*
> - * Set by drivers that want the core to automatically register the cpufreq
> - * driver as a thermal cooling device.
> - */
> -#define CPUFREQ_IS_COOLING_DEV BIT(7)
> -
>  int cpufreq_register_driver(struct cpufreq_driver *driver_data);
>  int cpufreq_unregister_driver(struct cpufreq_driver *driver_data);
>
> --
> 2.25.0.rc1.19.g042ed3e048af
>

Re: [PATCH v16 10/12] arm64: Use OF alloc and free functions for FDT

2021-02-04 Thread Will Deacon

On Thu, Feb 04, 2021 at 08:41:33AM -0800, Lakshmi Ramasubramanian wrote:
> of_alloc_and_init_fdt() and of_free_fdt() have been defined in
> drivers/of/kexec.c to allocate and free memory for FDT.
> 
> Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
> initialize the FDT, and to free the FDT respectively.
> 
> Signed-off-by: Lakshmi Ramasubramanian 
> Suggested-by: Rob Herring 
> ---
>  arch/arm64/kernel/machine_kexec_file.c | 37 +++---
>  1 file changed, 10 insertions(+), 27 deletions(-)

Acked-by: Will Deacon 

Will

[PATCH v2 1/2] ima: Free IMA measurement buffer on error

2021-02-04 Thread Lakshmi Ramasubramanian

IMA allocates kernel virtual memory to carry forward the measurement
list, from the current kernel to the next kernel on kexec system call,
in ima_add_kexec_buffer() function.  In error code paths this memory
is not freed resulting in memory leak.

Free the memory allocated for the IMA measurement list in
the error code paths in ima_add_kexec_buffer() function.

Signed-off-by: Lakshmi Ramasubramanian 
Suggested-by: Tyler Hicks 
Fixes: 7b8589cc29e7 ("ima: on soft reboot, save the measurement list")
---
 security/integrity/ima/ima_kexec.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/security/integrity/ima/ima_kexec.c 
b/security/integrity/ima/ima_kexec.c
index 121de3e04af2..206ddcaa5c67 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -119,6 +119,7 @@ void ima_add_kexec_buffer(struct kimage *image)
ret = kexec_add_buffer();
if (ret) {
pr_err("Error passing over kexec measurement buffer.\n");
+   vfree(kexec_buffer);
return;
}
 
-- 
2.30.0

[PATCH v2 2/2] ima: Free IMA measurement buffer after kexec syscall

2021-02-04 Thread Lakshmi Ramasubramanian

IMA allocates kernel virtual memory to carry forward the measurement
list, from the current kernel to the next kernel on kexec system call,
in ima_add_kexec_buffer() function.  This buffer is not freed before
completing the kexec system call resulting in memory leak.

Add ima_buffer field in "struct kimage" to store the virtual address
of the buffer allocated for the IMA measurement list.
Free the memory allocated for the IMA measurement list in
kimage_file_post_load_cleanup() function.

Signed-off-by: Lakshmi Ramasubramanian 
Suggested-by: Tyler Hicks 
Reviewed-by: Thiago Jung Bauermann 
Reviewed-by: Tyler Hicks 
Fixes: 7b8589cc29e7 ("ima: on soft reboot, save the measurement list")
---
 include/linux/kexec.h  | 5 +
 kernel/kexec_file.c| 5 +
 security/integrity/ima/ima_kexec.c | 2 ++
 3 files changed, 12 insertions(+)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 9e93bef52968..5f61389f5f36 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -300,6 +300,11 @@ struct kimage {
/* Information for loading purgatory */
struct purgatory_info purgatory_info;
 #endif
+
+#ifdef CONFIG_IMA_KEXEC
+   /* Virtual address of IMA measurement buffer for kexec syscall */
+   void *ima_buffer;
+#endif
 };
 
 /* kexec interface functions */
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index b02086d70492..5c3447cf7ad5 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -166,6 +166,11 @@ void kimage_file_post_load_cleanup(struct kimage *image)
vfree(pi->sechdrs);
pi->sechdrs = NULL;
 
+#ifdef CONFIG_IMA_KEXEC
+   vfree(image->ima_buffer);
+   image->ima_buffer = NULL;
+#endif /* CONFIG_IMA_KEXEC */
+
/* See if architecture has anything to cleanup post load */
arch_kimage_file_post_load_cleanup(image);
 
diff --git a/security/integrity/ima/ima_kexec.c 
b/security/integrity/ima/ima_kexec.c
index 206ddcaa5c67..e29bea3dd4cc 100644
--- a/security/integrity/ima/ima_kexec.c
+++ b/security/integrity/ima/ima_kexec.c
@@ -129,6 +129,8 @@ void ima_add_kexec_buffer(struct kimage *image)
return;
}
 
+   image->ima_buffer = kexec_buffer;
+
pr_debug("kexec measurement buffer for the loaded kernel at 0x%lx.\n",
 kbuf.mem);
 }
-- 
2.30.0

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

2021-02-04 Thread Aneesh Kumar K.V


On 2/4/21 11:27 PM, Zorro Lang wrote:

On Thu, Feb 04, 2021 at 10:31:59PM +0530, Aneesh Kumar K.V wrote:

On 2/4/21 10:23 PM, Jens Axboe wrote:

On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:

On 2/2/21 11:50 AM, Christophe Leroy wrote:



Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :

On 2/2/21 11:32 AM, Christophe Leroy wrote:



Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :

Aneesh Kumar K.V  writes:


Nicholas Piggin  writes:


Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:

Christophe Leroy  writes:

+Aneesh

Le 29/01/2021 à 07:52, Zorro Lang a écrit :

..

[   96.200296] [ cut here ]
[   96.200304] Bug: Read fault blocked by KUAP!
[   96.200309] WARNING: CPU: 3 PID: 1876 at
arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310



[   96.200734] NIP [c0849424]
fault_in_pages_readable+0x104/0x350
[   96.200741] LR [c084952c]
fault_in_pages_readable+0x20c/0x350
[   96.200747] --- interrupt: 300



Problem happens in a section where userspace access is supposed
to be granted, so the patch you
proposed is definitely not the right fix.

c0849408:2c 01 00 4c isync
c084940c:a6 03 3d 7d mtspr   29,r9  <== granting
userspace access permission
c0849410:2c 01 00 4c isync
c0849414:00 00 36 e9 ld  r9,0(r22)
c0849418:20 00 29 81 lwz r9,32(r9)
c084941c:00 02 29 71 andi.   r9,r9,512
c0849420:78 d3 5e 7f mr  r30,r26
==> c0849424:00 00 bf 8b lbz r29,0(r31)  <==
accessing userspace
c0849428:10 00 82 41 beq c0849438

c084942c:2c 01 00 4c isync
c0849430:a6 03 bd 7e mtspr   29,r21  <==
clearing userspace access permission
c0849434:2c 01 00 4c isync

My first guess is that the problem is linked to the following
function, see the comment

/*
 * For kernel thread that doesn't have thread.regs return
 * default AMR/IAMR values.
 */
static inline u64 current_thread_amr(void)
{
  if (current->thread.regs)
  return current->thread.regs->amr;
  return AMR_KUAP_BLOCKED;
}

Above function was introduced by commit 48a8ab4eeb82
("powerpc/book3s64/pkeys: Don't update SPRN_AMR
when in kernel mode")


Yeah that's a bit of a curly one.

At some point io_uring did kthread_use_mm(), which is supposed to
mean
the kthread can operate on behalf of the original process that
submitted
the IO.

But because KUAP is implemented using memory protection keys, it
depends
on the value of the AMR register, which is not part of the mm,
it's in
thread.regs->amr.

And what's worse by the time we're in kthread_use_mm() we no
longer have
access to the thread.regs->amr of the original process that
submitted
the IO.

We also can't simply move the AMR into the mm, precisely because
it's
per thread, not per mm.

So TBH I don't know how we're going to fix this.

I guess we could return AMR=unblocked for kernel threads, but that's
arguably a bug because it allows a process to circumvent memory
keys by
asking the kernel to do the access.


We shouldn't need to inherit AMR should we? We only need it to be
locked
for kernel threads until it's explicitly unlocked -- nothing mm
specific
there. I think current_thread_amr could return 0 for kernel
threads? Or
I would even avoid using that function for allow_user_access and open
code the kthread case and remove it from current_thread_amr().

Thanks,
Nick




updated one

   From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" 
Date: Tue, 2 Feb 2021 09:23:38 +0530
Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access
userspace
after kthread_use_mm

This fix the bad fault reported by KUAP when io_wqe_worker access
userspace.

Bug: Read fault blocked by KUAP!
WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229
__do_page_fault+0x6b4/0xcd0
NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
LR [c009e7e0] __do_page_fault+0x6b0/0xcd0
..
Call Trace:
[c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0
(unreliable)
[c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120
[c00016367430] [c000c848] handle_page_fault+0x10/0x2c
--- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..
NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0
LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0
interrupt: 300
[c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280
[c00016367880] [c070fc94] iomap_apply+0x1c4/0x780
[c00016367990] [c0710330]
iomap_file_buffered_write+0xa0/0x120
[c000163679e0] [c0080040791c]
xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
[c00016367a90] [c06d74bc] io_write+0x10c/0x460
[c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

2021-02-04 Thread Zorro Lang

On Thu, Feb 04, 2021 at 10:31:59PM +0530, Aneesh Kumar K.V wrote:
> On 2/4/21 10:23 PM, Jens Axboe wrote:
> > On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:
> > > On 2/2/21 11:50 AM, Christophe Leroy wrote:
> > > > 
> > > > 
> > > > Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :
> > > > > On 2/2/21 11:32 AM, Christophe Leroy wrote:
> > > > > > 
> > > > > > 
> > > > > > Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :
> > > > > > > Aneesh Kumar K.V  writes:
> > > > > > > 
> > > > > > > > Nicholas Piggin  writes:
> > > > > > > > 
> > > > > > > > > Excerpts from Michael Ellerman's message of January 30, 2021 
> > > > > > > > > 9:22 pm:
> > > > > > > > > > Christophe Leroy  writes:
> > > > > > > > > > > +Aneesh
> > > > > > > > > > > 
> > > > > > > > > > > Le 29/01/2021 à 07:52, Zorro Lang a écrit :
> > > > > > > > > > ..
> > > > > > > > > > > > [   96.200296] [ cut here ]
> > > > > > > > > > > > [   96.200304] Bug: Read fault blocked by KUAP!
> > > > > > > > > > > > [   96.200309] WARNING: CPU: 3 PID: 1876 at
> > > > > > > > > > > > arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310
> > > > > > > > > > > 
> > > > > > > > > > > > [   96.200734] NIP [c0849424]
> > > > > > > > > > > > fault_in_pages_readable+0x104/0x350
> > > > > > > > > > > > [   96.200741] LR [c084952c]
> > > > > > > > > > > > fault_in_pages_readable+0x20c/0x350
> > > > > > > > > > > > [   96.200747] --- interrupt: 300
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > Problem happens in a section where userspace access is 
> > > > > > > > > > > supposed
> > > > > > > > > > > to be granted, so the patch you
> > > > > > > > > > > proposed is definitely not the right fix.
> > > > > > > > > > > 
> > > > > > > > > > > c0849408:2c 01 00 4c isync
> > > > > > > > > > > c084940c:a6 03 3d 7d mtspr   29,r9  <== 
> > > > > > > > > > > granting
> > > > > > > > > > > userspace access permission
> > > > > > > > > > > c0849410:2c 01 00 4c isync
> > > > > > > > > > > c0849414:00 00 36 e9 ld  r9,0(r22)
> > > > > > > > > > > c0849418:20 00 29 81 lwz r9,32(r9)
> > > > > > > > > > > c084941c:00 02 29 71 andi.   r9,r9,512
> > > > > > > > > > > c0849420:78 d3 5e 7f mr  r30,r26
> > > > > > > > > > > ==> c0849424:00 00 bf 8b lbz 
> > > > > > > > > > > r29,0(r31)  <==
> > > > > > > > > > > accessing userspace
> > > > > > > > > > > c0849428:10 00 82 41 beq 
> > > > > > > > > > > c0849438
> > > > > > > > > > > 
> > > > > > > > > > > c084942c:2c 01 00 4c isync
> > > > > > > > > > > c0849430:a6 03 bd 7e mtspr   29,r21  <==
> > > > > > > > > > > clearing userspace access permission
> > > > > > > > > > > c0849434:2c 01 00 4c isync
> > > > > > > > > > > 
> > > > > > > > > > > My first guess is that the problem is linked to the 
> > > > > > > > > > > following
> > > > > > > > > > > function, see the comment
> > > > > > > > > > > 
> > > > > > > > > > > /*
> > > > > > > > > > > * For kernel thread that doesn't have thread.regs 
> > > > > > > > > > > return
> > > > > > > > > > > * default AMR/IAMR values.
> > > > > > > > > > > */
> > > > > > > > > > > static inline u64 current_thread_amr(void)
> > > > > > > > > > > {
> > > > > > > > > > >  if (current->thread.regs)
> > > > > > > > > > >  return current->thread.regs->amr;
> > > > > > > > > > >  return AMR_KUAP_BLOCKED;
> > > > > > > > > > > }
> > > > > > > > > > > 
> > > > > > > > > > > Above function was introduced by commit 48a8ab4eeb82
> > > > > > > > > > > ("powerpc/book3s64/pkeys: Don't update SPRN_AMR
> > > > > > > > > > > when in kernel mode")
> > > > > > > > > > 
> > > > > > > > > > Yeah that's a bit of a curly one.
> > > > > > > > > > 
> > > > > > > > > > At some point io_uring did kthread_use_mm(), which is 
> > > > > > > > > > supposed to
> > > > > > > > > > mean
> > > > > > > > > > the kthread can operate on behalf of the original process 
> > > > > > > > > > that
> > > > > > > > > > submitted
> > > > > > > > > > the IO.
> > > > > > > > > > 
> > > > > > > > > > But because KUAP is implemented using memory protection 
> > > > > > > > > > keys, it
> > > > > > > > > > depends
> > > > > > > > > > on the value of the AMR register, which is not part of the 
> > > > > > > > > > mm,
> > > > > > > > > > it's in
> > > > > > > > > > thread.regs->amr.
> > > > > > > > > > 
> > > > > > > > > > And what's worse by the time we're in kthread_use_mm() we no
> > > > > > > > > > longer have
> > > > > > > > > > access to the thread.regs->amr of the original process that
> > > > > > > > > > submitted
> > > > > > > > > > the IO.
> > > > > > > > > > 
> > > > > > > > > > We also can't simply move the AMR into the mm, precisely 
> > > > > > > > > > because
> > > > > > > > > > it's
> > > > > > > > > > per thread, not per mm.
> > > > > > >

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

2021-02-04 Thread Jens Axboe

On 2/4/21 10:01 AM, Aneesh Kumar K.V wrote:
> On 2/4/21 10:23 PM, Jens Axboe wrote:
>> On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:
>>> On 2/2/21 11:50 AM, Christophe Leroy wrote:


 Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :
> On 2/2/21 11:32 AM, Christophe Leroy wrote:
>>
>>
>> Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :
>>> Aneesh Kumar K.V  writes:
>>>
 Nicholas Piggin  writes:

> Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
>> Christophe Leroy  writes:
>>> +Aneesh
>>>
>>> Le 29/01/2021 à 07:52, Zorro Lang a écrit :
>> ..
 [   96.200296] [ cut here ]
 [   96.200304] Bug: Read fault blocked by KUAP!
 [   96.200309] WARNING: CPU: 3 PID: 1876 at
 arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310
>>>
 [   96.200734] NIP [c0849424]
 fault_in_pages_readable+0x104/0x350
 [   96.200741] LR [c084952c]
 fault_in_pages_readable+0x20c/0x350
 [   96.200747] --- interrupt: 300
>>>
>>>
>>> Problem happens in a section where userspace access is supposed
>>> to be granted, so the patch you
>>> proposed is definitely not the right fix.
>>>
>>> c0849408:2c 01 00 4c isync
>>> c084940c:a6 03 3d 7d mtspr   29,r9  <== granting
>>> userspace access permission
>>> c0849410:2c 01 00 4c isync
>>> c0849414:00 00 36 e9 ld  r9,0(r22)
>>> c0849418:20 00 29 81 lwz r9,32(r9)
>>> c084941c:00 02 29 71 andi.   r9,r9,512
>>> c0849420:78 d3 5e 7f mr  r30,r26
>>> ==> c0849424:00 00 bf 8b lbz r29,0(r31)  <==
>>> accessing userspace
>>> c0849428:10 00 82 41 beq c0849438
>>> 
>>> c084942c:2c 01 00 4c isync
>>> c0849430:a6 03 bd 7e mtspr   29,r21  <==
>>> clearing userspace access permission
>>> c0849434:2c 01 00 4c isync
>>>
>>> My first guess is that the problem is linked to the following
>>> function, see the comment
>>>
>>> /*
>>> * For kernel thread that doesn't have thread.regs return
>>> * default AMR/IAMR values.
>>> */
>>> static inline u64 current_thread_amr(void)
>>> {
>>>  if (current->thread.regs)
>>>  return current->thread.regs->amr;
>>>  return AMR_KUAP_BLOCKED;
>>> }
>>>
>>> Above function was introduced by commit 48a8ab4eeb82
>>> ("powerpc/book3s64/pkeys: Don't update SPRN_AMR
>>> when in kernel mode")
>>
>> Yeah that's a bit of a curly one.
>>
>> At some point io_uring did kthread_use_mm(), which is supposed to
>> mean
>> the kthread can operate on behalf of the original process that
>> submitted
>> the IO.
>>
>> But because KUAP is implemented using memory protection keys, it
>> depends
>> on the value of the AMR register, which is not part of the mm,
>> it's in
>> thread.regs->amr.
>>
>> And what's worse by the time we're in kthread_use_mm() we no
>> longer have
>> access to the thread.regs->amr of the original process that
>> submitted
>> the IO.
>>
>> We also can't simply move the AMR into the mm, precisely because
>> it's
>> per thread, not per mm.
>>
>> So TBH I don't know how we're going to fix this.
>>
>> I guess we could return AMR=unblocked for kernel threads, but that's
>> arguably a bug because it allows a process to circumvent memory
>> keys by
>> asking the kernel to do the access.
>
> We shouldn't need to inherit AMR should we? We only need it to be
> locked
> for kernel threads until it's explicitly unlocked -- nothing mm
> specific
> there. I think current_thread_amr could return 0 for kernel
> threads? Or
> I would even avoid using that function for allow_user_access and open
> code the kthread case and remove it from current_thread_amr().
>
> Thanks,
> Nick

>>>
>>> updated one
>>>
>>>   From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
>>> From: "Aneesh Kumar K.V" 
>>> Date: Tue, 2 Feb 2021 09:23:38 +0530
>>> Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access
>>> userspace
>>>after kthread_use_mm
>>>
>>> This fix the bad fault

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

2021-02-04 Thread Aneesh Kumar K.V


On 2/4/21 10:23 PM, Jens Axboe wrote:

On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:

On 2/2/21 11:50 AM, Christophe Leroy wrote:



Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :

On 2/2/21 11:32 AM, Christophe Leroy wrote:



Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :

Aneesh Kumar K.V  writes:


Nicholas Piggin  writes:


Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:

Christophe Leroy  writes:

+Aneesh

Le 29/01/2021 à 07:52, Zorro Lang a écrit :

..

[   96.200296] [ cut here ]
[   96.200304] Bug: Read fault blocked by KUAP!
[   96.200309] WARNING: CPU: 3 PID: 1876 at
arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310



[   96.200734] NIP [c0849424]
fault_in_pages_readable+0x104/0x350
[   96.200741] LR [c084952c]
fault_in_pages_readable+0x20c/0x350
[   96.200747] --- interrupt: 300



Problem happens in a section where userspace access is supposed
to be granted, so the patch you
proposed is definitely not the right fix.

c0849408:2c 01 00 4c isync
c084940c:a6 03 3d 7d mtspr   29,r9  <== granting
userspace access permission
c0849410:2c 01 00 4c isync
c0849414:00 00 36 e9 ld  r9,0(r22)
c0849418:20 00 29 81 lwz r9,32(r9)
c084941c:00 02 29 71 andi.   r9,r9,512
c0849420:78 d3 5e 7f mr  r30,r26
==> c0849424:00 00 bf 8b lbz r29,0(r31)  <==
accessing userspace
c0849428:10 00 82 41 beq c0849438

c084942c:2c 01 00 4c isync
c0849430:a6 03 bd 7e mtspr   29,r21  <==
clearing userspace access permission
c0849434:2c 01 00 4c isync

My first guess is that the problem is linked to the following
function, see the comment

/*
* For kernel thread that doesn't have thread.regs return
* default AMR/IAMR values.
*/
static inline u64 current_thread_amr(void)
{
 if (current->thread.regs)
 return current->thread.regs->amr;
 return AMR_KUAP_BLOCKED;
}

Above function was introduced by commit 48a8ab4eeb82
("powerpc/book3s64/pkeys: Don't update SPRN_AMR
when in kernel mode")


Yeah that's a bit of a curly one.

At some point io_uring did kthread_use_mm(), which is supposed to
mean
the kthread can operate on behalf of the original process that
submitted
the IO.

But because KUAP is implemented using memory protection keys, it
depends
on the value of the AMR register, which is not part of the mm,
it's in
thread.regs->amr.

And what's worse by the time we're in kthread_use_mm() we no
longer have
access to the thread.regs->amr of the original process that
submitted
the IO.

We also can't simply move the AMR into the mm, precisely because
it's
per thread, not per mm.

So TBH I don't know how we're going to fix this.

I guess we could return AMR=unblocked for kernel threads, but that's
arguably a bug because it allows a process to circumvent memory
keys by
asking the kernel to do the access.


We shouldn't need to inherit AMR should we? We only need it to be
locked
for kernel threads until it's explicitly unlocked -- nothing mm
specific
there. I think current_thread_amr could return 0 for kernel
threads? Or
I would even avoid using that function for allow_user_access and open
code the kthread case and remove it from current_thread_amr().

Thanks,
Nick




updated one

  From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" 
Date: Tue, 2 Feb 2021 09:23:38 +0530
Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access
userspace
   after kthread_use_mm

This fix the bad fault reported by KUAP when io_wqe_worker access
userspace.

   Bug: Read fault blocked by KUAP!
   WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229
__do_page_fault+0x6b4/0xcd0
   NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
   LR [c009e7e0] __do_page_fault+0x6b0/0xcd0
..
   Call Trace:
   [c00016367330] [c009e7e0] __do_page_fault+0x6b0/0xcd0
(unreliable)
   [c000163673e0] [c009ee3c] do_page_fault+0x3c/0x120
   [c00016367430] [c000c848] handle_page_fault+0x10/0x2c
   --- interrupt: 300 at iov_iter_fault_in_readable+0x148/0x6f0
..
   NIP [c08e8228] iov_iter_fault_in_readable+0x148/0x6f0
   LR [c08e834c] iov_iter_fault_in_readable+0x26c/0x6f0
   interrupt: 300
   [c000163677e0] [c07154a0] iomap_write_actor+0xc0/0x280
   [c00016367880] [c070fc94] iomap_apply+0x1c4/0x780
   [c00016367990] [c0710330]
iomap_file_buffered_write+0xa0/0x120
   [c000163679e0] [c0080040791c]
xfs_file_buffered_aio_write+0x314/0x5e0 [xfs]
   [c00016367a90] [c06d74bc] io_write+0x10c/0x460
   [c00016367bb0] [c06d80e4] io_issue_sqe+0x8d4/0x1200
   [c00016367c70] [c06d8ad0] io_wq_submit_work+0xc0/0x250
   [c00016367cb0] [c06e2578]

Re: [PATCH] powerpc/fault: fix wrong KUAP fault for IO_URING

2021-02-04 Thread Jens Axboe

On 2/1/21 11:30 PM, Aneesh Kumar K.V wrote:
> On 2/2/21 11:50 AM, Christophe Leroy wrote:
>>
>>
>> Le 02/02/2021 à 07:16, Aneesh Kumar K.V a écrit :
>>> On 2/2/21 11:32 AM, Christophe Leroy wrote:


 Le 02/02/2021 à 06:55, Aneesh Kumar K.V a écrit :
> Aneesh Kumar K.V  writes:
>
>> Nicholas Piggin  writes:
>>
>>> Excerpts from Michael Ellerman's message of January 30, 2021 9:22 pm:
 Christophe Leroy  writes:
> +Aneesh
>
> Le 29/01/2021 à 07:52, Zorro Lang a écrit :
 ..
>> [   96.200296] [ cut here ]
>> [   96.200304] Bug: Read fault blocked by KUAP!
>> [   96.200309] WARNING: CPU: 3 PID: 1876 at 
>> arch/powerpc/mm/fault.c:229 bad_kernel_fault+0x180/0x310
>
>> [   96.200734] NIP [c0849424] 
>> fault_in_pages_readable+0x104/0x350
>> [   96.200741] LR [c084952c] 
>> fault_in_pages_readable+0x20c/0x350
>> [   96.200747] --- interrupt: 300
>
>
> Problem happens in a section where userspace access is supposed 
> to be granted, so the patch you
> proposed is definitely not the right fix.
>
> c0849408:2c 01 00 4c isync
> c084940c:a6 03 3d 7d mtspr   29,r9  <== granting 
> userspace access permission
> c0849410:2c 01 00 4c isync
> c0849414:00 00 36 e9 ld  r9,0(r22)
> c0849418:20 00 29 81 lwz r9,32(r9)
> c084941c:00 02 29 71 andi.   r9,r9,512
> c0849420:78 d3 5e 7f mr  r30,r26
> ==> c0849424:00 00 bf 8b lbz r29,0(r31)  <== 
> accessing userspace
> c0849428:10 00 82 41 beq c0849438 
> 
> c084942c:2c 01 00 4c isync
> c0849430:a6 03 bd 7e mtspr   29,r21  <== 
> clearing userspace access permission
> c0849434:2c 01 00 4c isync
>
> My first guess is that the problem is linked to the following 
> function, see the comment
>
> /*
>* For kernel thread that doesn't have thread.regs return
>* default AMR/IAMR values.
>*/
> static inline u64 current_thread_amr(void)
> {
> if (current->thread.regs)
> return current->thread.regs->amr;
> return AMR_KUAP_BLOCKED;
> }
>
> Above function was introduced by commit 48a8ab4eeb82 
> ("powerpc/book3s64/pkeys: Don't update SPRN_AMR
> when in kernel mode")

 Yeah that's a bit of a curly one.

 At some point io_uring did kthread_use_mm(), which is supposed to 
 mean
 the kthread can operate on behalf of the original process that 
 submitted
 the IO.

 But because KUAP is implemented using memory protection keys, it 
 depends
 on the value of the AMR register, which is not part of the mm, 
 it's in
 thread.regs->amr.

 And what's worse by the time we're in kthread_use_mm() we no 
 longer have
 access to the thread.regs->amr of the original process that 
 submitted
 the IO.

 We also can't simply move the AMR into the mm, precisely because 
 it's
 per thread, not per mm.

 So TBH I don't know how we're going to fix this.

 I guess we could return AMR=unblocked for kernel threads, but that's
 arguably a bug because it allows a process to circumvent memory 
 keys by
 asking the kernel to do the access.
>>>
>>> We shouldn't need to inherit AMR should we? We only need it to be 
>>> locked
>>> for kernel threads until it's explicitly unlocked -- nothing mm 
>>> specific
>>> there. I think current_thread_amr could return 0 for kernel 
>>> threads? Or
>>> I would even avoid using that function for allow_user_access and open
>>> code the kthread case and remove it from current_thread_amr().
>>>
>>> Thanks,
>>> Nick
>>
>
> updated one
>
>  From 8fdb0680f983940d61f91da8252b13c8d3e8ebee Mon Sep 17 00:00:00 2001
> From: "Aneesh Kumar K.V" 
> Date: Tue, 2 Feb 2021 09:23:38 +0530
> Subject: [PATCH v2] powerpc/kuap: Allow kernel thread to access 
> userspace
>   after kthread_use_mm
>
> This fix the bad fault reported by KUAP when io_wqe_worker access 
> userspace.
>
>   Bug: Read fault blocked by KUAP!
>   WARNING: CPU: 1 PID: 101841 at arch/powerpc/mm/fault.c:229 
> __do_page_fault+0x6b4/0xcd0
>   NIP [c009e7e4] __do_page_fault+0x6b4/0xcd0
>   LR [c009e7e0]

[PATCH v16 11/12] powerpc: Use OF alloc and free for FDT

2021-02-04 Thread Lakshmi Ramasubramanian

of_alloc_and_init_fdt() and of_free_fdt() have been defined in
drivers/of/kexec.c to allocate and free memory for FDT.

Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
initialize the FDT, and to free the FDT respectively.

powerpc sets the FDT address in image_loader_data field in
"struct kimage" and the memory is freed in
kimage_file_post_load_cleanup().  This cleanup function uses kfree()
to free the memory. But since of_alloc_and_init_fdt() uses kvmalloc()
for allocation, the buffer needs to be freed using kvfree().

Define "fdt" field in "struct kimage_arch" for powerpc to store
the address of FDT, and free the memory in powerpc specific
arch_kimage_file_post_load_cleanup().

Signed-off-by: Lakshmi Ramasubramanian 
Suggested-by: Rob Herring 
Suggested-by: Thiago Jung Bauermann 
---
 arch/powerpc/include/asm/kexec.h  |  2 ++
 arch/powerpc/kexec/elf_64.c   | 26 --
 arch/powerpc/kexec/file_load_64.c |  3 +++
 3 files changed, 21 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 2c0be93d239a..d7d13cac4d31 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -111,6 +111,8 @@ struct kimage_arch {
unsigned long elf_headers_mem;
unsigned long elf_headers_sz;
void *elf_headers;
+
+   void *fdt;
 };
 
 char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index d0e459bb2f05..51d2d8eb6c1b 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -19,6 +19,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -32,7 +33,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
unsigned int fdt_size;
unsigned long kernel_load_addr;
unsigned long initrd_load_addr = 0, fdt_load_addr;
-   void *fdt;
+   void *fdt = NULL;
const void *slave_code;
struct elfhdr ehdr;
char *modified_cmdline = NULL;
@@ -103,18 +104,12 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
}
 
fdt_size = fdt_totalsize(initial_boot_params) * 2;
-   fdt = kmalloc(fdt_size, GFP_KERNEL);
+   fdt = of_alloc_and_init_fdt(fdt_size);
if (!fdt) {
pr_err("Not enough memory for the device tree.\n");
ret = -ENOMEM;
goto out;
}
-   ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
-   if (ret < 0) {
-   pr_err("Error setting up the new device tree.\n");
-   ret = -EINVAL;
-   goto out;
-   }
 
ret = setup_new_fdt_ppc64(image, fdt, initrd_load_addr,
  initrd_len, cmdline);
@@ -131,6 +126,10 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
ret = kexec_add_buffer();
if (ret)
goto out;
+
+   /* FDT will be freed in arch_kimage_file_post_load_cleanup */
+   image->arch.fdt = fdt;
+
fdt_load_addr = kbuf.mem;
 
pr_debug("Loaded device tree at 0x%lx\n", fdt_load_addr);
@@ -145,8 +144,15 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
kfree(modified_cmdline);
kexec_free_elf_info(_info);
 
-   /* Make kimage_file_post_load_cleanup free the fdt buffer for us. */
-   return ret ? ERR_PTR(ret) : fdt;
+   /*
+* Once FDT buffer has been successfully passed to kexec_add_buffer(),
+* the FDT buffer address is saved in image->arch.fdt. In that case,
+* the memory cannot be freed here in case of any other error.
+*/
+   if (ret && !image->arch.fdt)
+   of_free_fdt(fdt);
+
+   return ret ? ERR_PTR(ret) : NULL;
 }
 
 const struct kexec_file_ops kexec_elf64_ops = {
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index 3cab318aa3b9..d9d5b5569a6d 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -1113,5 +1113,8 @@ int arch_kimage_file_post_load_cleanup(struct kimage 
*image)
image->arch.elf_headers = NULL;
image->arch.elf_headers_sz = 0;
 
+   of_free_fdt(image->arch.fdt);
+   image->arch.fdt = NULL;
+
return kexec_image_post_load_cleanup_default(image);
 }
-- 
2.30.0

[PATCH v16 12/12] arm64: Enable passing IMA log to next kernel on kexec

2021-02-04 Thread Lakshmi Ramasubramanian

Update CONFIG_KEXEC_FILE to select CONFIG_HAVE_IMA_KEXEC, if CONFIG_IMA
is enabled, to indicate that the IMA measurement log information is
present in the device tree for ARM64.

Co-developed-by: Prakhar Srivastava 
Signed-off-by: Prakhar Srivastava 
Signed-off-by: Lakshmi Ramasubramanian 
---
 arch/arm64/Kconfig | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 05e17351e4f3..8a93573cebb6 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1093,6 +1093,7 @@ config KEXEC
 config KEXEC_FILE
bool "kexec file based system call"
select KEXEC_CORE
+   select HAVE_IMA_KEXEC if IMA
help
  This is new version of kexec system call. This system call is
  file based and takes file descriptors as system call argument
-- 
2.30.0

[PATCH v16 10/12] arm64: Use OF alloc and free functions for FDT

2021-02-04 Thread Lakshmi Ramasubramanian

of_alloc_and_init_fdt() and of_free_fdt() have been defined in
drivers/of/kexec.c to allocate and free memory for FDT.

Use of_alloc_and_init_fdt() and of_free_fdt() to allocate and
initialize the FDT, and to free the FDT respectively.

Signed-off-by: Lakshmi Ramasubramanian 
Suggested-by: Rob Herring 
---
 arch/arm64/kernel/machine_kexec_file.c | 37 +++---
 1 file changed, 10 insertions(+), 27 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec_file.c 
b/arch/arm64/kernel/machine_kexec_file.c
index 7da22bb7b9d5..7d6cc478f73c 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -29,7 +29,7 @@ const struct kexec_file_ops * const kexec_file_loaders[] = {
 
 int arch_kimage_file_post_load_cleanup(struct kimage *image)
 {
-   vfree(image->arch.dtb);
+   of_free_fdt(image->arch.dtb);
image->arch.dtb = NULL;
 
vfree(image->arch.elf_headers);
@@ -57,36 +57,19 @@ static int create_dtb(struct kimage *image,
cmdline_len = cmdline ? strlen(cmdline) : 0;
buf_size = fdt_totalsize(initial_boot_params)
+ cmdline_len + DTB_EXTRA_SPACE;
-
-   for (;;) {
-   buf = vmalloc(buf_size);
-   if (!buf)
-   return -ENOMEM;
-
-   /* duplicate a device tree blob */
-   ret = fdt_open_into(initial_boot_params, buf, buf_size);
-   if (ret)
-   return -EINVAL;
-
-   ret = of_kexec_setup_new_fdt(image, buf, initrd_load_addr,
+   buf = of_alloc_and_init_fdt(buf_size);
+   if (!buf)
+   return -ENOMEM;
+   ret = of_kexec_setup_new_fdt(image, buf, initrd_load_addr,
 initrd_len, cmdline);
-   if (ret) {
-   vfree(buf);
-   if (ret == -ENOMEM) {
-   /* unlikely, but just in case */
-   buf_size += DTB_EXTRA_SPACE;
-   continue;
-   } else {
-   return ret;
-   }
-   }
-
+   if (!ret) {
/* trim it */
fdt_pack(buf);
*dtb = buf;
+   } else
+   of_free_fdt(buf);
 
-   return 0;
-   }
+   return ret;
 }
 
 static int prepare_elf_headers(void **addr, unsigned long *sz)
@@ -224,6 +207,6 @@ int load_other_segments(struct kimage *image,
 
 out_err:
image->nr_segments = orig_segments;
-   vfree(dtb);
+   of_free_fdt(dtb);
return ret;
 }
-- 
2.30.0

[PATCH v16 08/12] powerpc: Delete unused function delete_fdt_mem_rsv()

2021-02-04 Thread Lakshmi Ramasubramanian

delete_fdt_mem_rsv() defined in "arch/powerpc/kexec/file_load.c"
has been renamed to fdt_find_and_del_mem_rsv(), and moved to
"drivers/of/kexec.c".

Remove delete_fdt_mem_rsv() in "arch/powerpc/kexec/file_load.c".

Co-developed-by: Prakhar Srivastava 
Signed-off-by: Prakhar Srivastava 
Signed-off-by: Lakshmi Ramasubramanian 
---
 arch/powerpc/include/asm/kexec.h |  1 -
 arch/powerpc/kexec/file_load.c   | 32 
 2 files changed, 33 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 939bc40dfa62..2c0be93d239a 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -118,7 +118,6 @@ char *setup_kdump_cmdline(struct kimage *image, char 
*cmdline,
 int setup_purgatory(struct kimage *image, const void *slave_code,
const void *fdt, unsigned long kernel_load_addr,
unsigned long fdt_load_addr);
-int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
 
 #ifdef CONFIG_PPC64
 struct kexec_buf;
diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c
index 5dd3a9c45a2d..036c8fb48fc3 100644
--- a/arch/powerpc/kexec/file_load.c
+++ b/arch/powerpc/kexec/file_load.c
@@ -108,35 +108,3 @@ int setup_purgatory(struct kimage *image, const void 
*slave_code,
 
return 0;
 }
-
-/**
- * delete_fdt_mem_rsv - delete memory reservation with given address and size
- *
- * Return: 0 on success, or negative errno on error.
- */
-int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size)
-{
-   int i, ret, num_rsvs = fdt_num_mem_rsv(fdt);
-
-   for (i = 0; i < num_rsvs; i++) {
-   uint64_t rsv_start, rsv_size;
-
-   ret = fdt_get_mem_rsv(fdt, i, _start, _size);
-   if (ret) {
-   pr_err("Malformed device tree.\n");
-   return -EINVAL;
-   }
-
-   if (rsv_start == start && rsv_size == size) {
-   ret = fdt_del_mem_rsv(fdt, i);
-   if (ret) {
-   pr_err("Error deleting device tree 
reservation.\n");
-   return -EINVAL;
-   }
-
-   return 0;
-   }
-   }
-
-   return -ENOENT;
-}
-- 
2.30.0

[PATCH v16 09/12] of: Define functions to allocate and free FDT

2021-02-04 Thread Lakshmi Ramasubramanian

Kernel components that use Flattened Device Tree (FDT) allocate kernel
memory and call fdt_open_into() to reorganize the tree into a form
suitable for the read-write operations.  These operations can be
combined into a single function to allocate and initialize the FDT
so the different architecures do not have to duplicate the code.

Define of_alloc_and_init_fdt() and of_free_fdt() in drivers/of/kexec.c
to allocate and initialize FDT, and to free the FDT buffer respectively.

Signed-off-by: Lakshmi Ramasubramanian 
Suggested-by: Rob Herring 
Suggested-by: Joe Perches 
---
 drivers/of/kexec.c | 37 +
 include/linux/of.h |  2 ++
 2 files changed, 39 insertions(+)

diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 5ae0e5d90f55..197e71104f47 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -11,6 +11,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -28,6 +29,42 @@
 #define FDT_PROP_RNG_SEED  "rng-seed"
 #define RNG_SEED_SIZE  128
 
+/**
+ * of_alloc_and_init_fdt - Allocate and initialize a Flattened device tree
+ *
+ * @fdt_size:  Flattened device tree size
+ *
+ * Return the allocated FDT buffer on success, or NULL on error.
+ */
+void *of_alloc_and_init_fdt(unsigned int fdt_size)
+{
+   void *fdt;
+   int ret;
+
+   fdt = kvmalloc(fdt_size, GFP_KERNEL);
+   if (!fdt)
+   return NULL;
+
+   ret = fdt_open_into(initial_boot_params, fdt, fdt_size);
+   if (ret < 0) {
+   pr_err("Error setting up the new device tree.\n");
+   kvfree(fdt);
+   fdt = NULL;
+   }
+
+   return fdt;
+}
+
+/**
+ * of_free_fdt - Free the buffer for Flattened device tree
+ *
+ * @fdt:   Flattened device tree buffer to free
+ */
+void of_free_fdt(void *fdt)
+{
+   kvfree(fdt);
+}
+
 /**
  * fdt_find_and_del_mem_rsv - delete memory reservation with given address and 
size
  *
diff --git a/include/linux/of.h b/include/linux/of.h
index 19f77dd12507..9f0261761e28 100644
--- a/include/linux/of.h
+++ b/include/linux/of.h
@@ -563,6 +563,8 @@ struct kimage;
 int of_kexec_setup_new_fdt(const struct kimage *image, void *fdt,
   unsigned long initrd_load_addr, unsigned long 
initrd_len,
   const char *cmdline);
+void *of_alloc_and_init_fdt(unsigned int fdt_size);
+void of_free_fdt(void *fdt);
 
 #ifdef CONFIG_IMA_KEXEC
 int of_ima_add_kexec_buffer(struct kimage *image,
-- 
2.30.0

[PATCH v16 07/12] kexec: Use fdt_appendprop_addrrange() to add ima buffer to FDT

2021-02-04 Thread Lakshmi Ramasubramanian

fdt_appendprop_addrrange() function adds a property, with the given name,
to the device tree at the given node offset, and also sets the address
and size of the property.  This function should be used to add
"linux,ima-kexec-buffer" property to the device tree and set the address
and size of the IMA measurement buffer, instead of using custom function.

Use fdt_appendprop_addrrange() to add  "linux,ima-kexec-buffer" property
to the device tree.  This property holds the address and size of
the IMA measurement buffer that needs to be passed from the current
kernel to the next kernel across kexec system call.

Remove custom code that is used in setup_ima_buffer() to add
"linux,ima-kexec-buffer" property to the device tree.

Co-developed-by: Prakhar Srivastava 
Signed-off-by: Prakhar Srivastava 
Signed-off-by: Lakshmi Ramasubramanian 
Reviewed-by: Thiago Jung Bauermann 
---
 drivers/of/kexec.c | 57 --
 1 file changed, 5 insertions(+), 52 deletions(-)

diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 7ee4f498ca19..5ae0e5d90f55 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -232,36 +232,6 @@ int of_ima_add_kexec_buffer(struct kimage *image,
return 0;
 }
 
-/**
- * write_number - Convert number to big-endian format
- *
- * @p: Buffer to write the number to
- * @value: Number to convert
- * @cells: Number of cells
- *
- * Return: 0 on success, or negative errno on error.
- */
-static int write_number(void *p, u64 value, int cells)
-{
-   if (cells == 1) {
-   u32 tmp;
-
-   if (value > U32_MAX)
-   return -EINVAL;
-
-   tmp = cpu_to_be32(value);
-   memcpy(p, , sizeof(tmp));
-   } else if (cells == 2) {
-   u64 tmp;
-
-   tmp = cpu_to_be64(value);
-   memcpy(p, , sizeof(tmp));
-   } else
-   return -EINVAL;
-
-   return 0;
-}
-
 /**
  * setup_ima_buffer - add IMA buffer information to the fdt
  * @image: kexec image being loaded.
@@ -273,32 +243,15 @@ static int write_number(void *p, u64 value, int cells)
 static int setup_ima_buffer(const struct kimage *image, void *fdt,
int chosen_node)
 {
-   int ret, addr_cells, size_cells, entry_size;
-   u8 value[16];
+   int ret;
 
if (!image->ima_buffer_size)
return 0;
 
-   ret = get_addr_size_cells(_cells, _cells);
-   if (ret)
-   return ret;
-
-   entry_size = 4 * (addr_cells + size_cells);
-
-   if (entry_size > sizeof(value))
-   return -EINVAL;
-
-   ret = write_number(value, image->ima_buffer_addr, addr_cells);
-   if (ret)
-   return ret;
-
-   ret = write_number(value + 4 * addr_cells, image->ima_buffer_size,
-  size_cells);
-   if (ret)
-   return ret;
-
-   ret = fdt_setprop(fdt, chosen_node, "linux,ima-kexec-buffer", value,
- entry_size);
+   ret = fdt_appendprop_addrrange(fdt, 0, chosen_node,
+  "linux,ima-kexec-buffer",
+  image->ima_buffer_addr,
+  image->ima_buffer_size);
if (ret < 0)
return -EINVAL;
 
-- 
2.30.0

[PATCH v16 06/12] powerpc: Move arch independent ima kexec functions to drivers/of/kexec.c

2021-02-04 Thread Lakshmi Ramasubramanian

The functions defined in "arch/powerpc/kexec/ima.c" handle setting up
and freeing the resources required to carry over the IMA measurement
list from the current kernel to the next kernel across kexec system call.
These functions do not have architecture specific code, but are
currently limited to powerpc.

Move setup_ima_buffer() call into of_kexec_setup_new_fdt() defined in
"drivers/of/kexec.c".  Call of_kexec_setup_new_fdt() from
setup_new_fdt_ppc64() and remove setup_new_fdt() in
"arch/powerpc/kexec/file_load.c".

Move the remaining architecture independent functions from
"arch/powerpc/kexec/ima.c" to "drivers/of/kexec.c".
Delete "arch/powerpc/kexec/ima.c" and "arch/powerpc/include/asm/ima.h".
Remove references to the deleted files in powerpc and in ima.

Co-developed-by: Prakhar Srivastava 
Signed-off-by: Prakhar Srivastava 
Signed-off-by: Lakshmi Ramasubramanian 
---
 arch/powerpc/Kconfig  |   2 +-
 arch/powerpc/include/asm/ima.h|  27 
 arch/powerpc/include/asm/kexec.h  |   3 -
 arch/powerpc/kexec/Makefile   |   7 -
 arch/powerpc/kexec/file_load.c|  35 -
 arch/powerpc/kexec/file_load_64.c |   4 +-
 arch/powerpc/kexec/ima.c  | 202 -
 drivers/of/kexec.c| 239 ++
 include/linux/of.h|   2 +
 security/integrity/ima/ima.h  |   4 -
 10 files changed, 245 insertions(+), 280 deletions(-)
 delete mode 100644 arch/powerpc/include/asm/ima.h
 delete mode 100644 arch/powerpc/kexec/ima.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..d6e593ad270e 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -554,7 +554,7 @@ config KEXEC
 config KEXEC_FILE
bool "kexec file based system call"
select KEXEC_CORE
-   select HAVE_IMA_KEXEC
+   select HAVE_IMA_KEXEC if IMA
select BUILD_BIN2C
select KEXEC_ELF
depends on PPC64
diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h
deleted file mode 100644
index 51f64fd06c19..
--- a/arch/powerpc/include/asm/ima.h
+++ /dev/null
@@ -1,27 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef _ASM_POWERPC_IMA_H
-#define _ASM_POWERPC_IMA_H
-
-struct kimage;
-
-int ima_get_kexec_buffer(void **addr, size_t *size);
-int ima_free_kexec_buffer(void);
-
-#ifdef CONFIG_IMA
-void remove_ima_buffer(void *fdt, int chosen_node);
-#else
-static inline void remove_ima_buffer(void *fdt, int chosen_node) {}
-#endif
-
-#ifdef CONFIG_IMA_KEXEC
-int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node);
-#else
-static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
-  int chosen_node)
-{
-   remove_ima_buffer(fdt, chosen_node);
-   return 0;
-}
-#endif /* CONFIG_IMA_KEXEC */
-
-#endif /* _ASM_POWERPC_IMA_H */
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 2248dc27080b..939bc40dfa62 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -118,9 +118,6 @@ char *setup_kdump_cmdline(struct kimage *image, char 
*cmdline,
 int setup_purgatory(struct kimage *image, const void *slave_code,
const void *fdt, unsigned long kernel_load_addr,
unsigned long fdt_load_addr);
-int setup_new_fdt(const struct kimage *image, void *fdt,
- unsigned long initrd_load_addr, unsigned long initrd_len,
- const char *cmdline);
 int delete_fdt_mem_rsv(void *fdt, unsigned long start, unsigned long size);
 
 #ifdef CONFIG_PPC64
diff --git a/arch/powerpc/kexec/Makefile b/arch/powerpc/kexec/Makefile
index 4aff6846c772..b6c52608cb49 100644
--- a/arch/powerpc/kexec/Makefile
+++ b/arch/powerpc/kexec/Makefile
@@ -9,13 +9,6 @@ obj-$(CONFIG_PPC32)+= relocate_32.o
 
 obj-$(CONFIG_KEXEC_FILE)   += file_load.o ranges.o file_load_$(BITS).o 
elf_$(BITS).o
 
-ifdef CONFIG_HAVE_IMA_KEXEC
-ifdef CONFIG_IMA
-obj-y  += ima.o
-endif
-endif
-
-
 # Disable GCOV, KCOV & sanitizers in odd or sensitive code
 GCOV_PROFILE_core_$(BITS).o := n
 KCOV_INSTRUMENT_core_$(BITS).o := n
diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c
index 956bcb2d1ec2..5dd3a9c45a2d 100644
--- a/arch/powerpc/kexec/file_load.c
+++ b/arch/powerpc/kexec/file_load.c
@@ -20,7 +20,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define SLAVE_CODE_SIZE256 /* First 0x100 bytes */
 
@@ -141,37 +140,3 @@ int delete_fdt_mem_rsv(void *fdt, unsigned long start, 
unsigned long size)
 
return -ENOENT;
 }
-
-/*
- * setup_new_fdt - modify /chosen and memory reservation for the next kernel
- * @image: kexec image being loaded.
- * @fdt:   Flattened device tree for the next kernel.
- * @initrd_load_addr:  Address where the next initrd will be loaded.
- * @initrd_len:Size of the next initrd, or 0 if

[PATCH v16 05/12] powerpc: Move ima buffer fields to struct kimage

2021-02-04 Thread Lakshmi Ramasubramanian

The fields ima_buffer_addr and ima_buffer_size in "struct kimage_arch"
for powerpc are used to carry forward the IMA measurement list across
kexec system call.  These fields are not architecture specific, but are
currently limited to powerpc.

arch_ima_add_kexec_buffer() defined in "arch/powerpc/kexec/ima.c"
sets ima_buffer_addr and ima_buffer_size for the kexec system call.
This function does not have architecture specific code, but is
currently limited to powerpc.

Move ima_buffer_addr and ima_buffer_size to "struct kimage".
Rename arch_ima_add_kexec_buffer() to of_ima_add_kexec_buffer()
and move it in drivers/of/kexec.c.

Co-developed-by: Prakhar Srivastava 
Signed-off-by: Prakhar Srivastava 
Signed-off-by: Lakshmi Ramasubramanian 
Suggested-by: Will Deacon 
---
 arch/powerpc/include/asm/ima.h |  3 ---
 arch/powerpc/include/asm/kexec.h   |  5 -
 arch/powerpc/kexec/ima.c   | 29 ++---
 drivers/of/kexec.c | 23 +++
 include/linux/kexec.h  |  5 +
 include/linux/of.h |  5 +
 security/integrity/ima/ima_kexec.c |  3 ++-
 7 files changed, 41 insertions(+), 32 deletions(-)

diff --git a/arch/powerpc/include/asm/ima.h b/arch/powerpc/include/asm/ima.h
index ead488cf3981..51f64fd06c19 100644
--- a/arch/powerpc/include/asm/ima.h
+++ b/arch/powerpc/include/asm/ima.h
@@ -14,9 +14,6 @@ static inline void remove_ima_buffer(void *fdt, int 
chosen_node) {}
 #endif
 
 #ifdef CONFIG_IMA_KEXEC
-int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr,
- size_t size);
-
 int setup_ima_buffer(const struct kimage *image, void *fdt, int chosen_node);
 #else
 static inline int setup_ima_buffer(const struct kimage *image, void *fdt,
diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index dbf09d2f36d0..2248dc27080b 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -111,11 +111,6 @@ struct kimage_arch {
unsigned long elf_headers_mem;
unsigned long elf_headers_sz;
void *elf_headers;
-
-#ifdef CONFIG_IMA_KEXEC
-   phys_addr_t ima_buffer_addr;
-   size_t ima_buffer_size;
-#endif
 };
 
 char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
diff --git a/arch/powerpc/kexec/ima.c b/arch/powerpc/kexec/ima.c
index 720e50e490b6..ed38125e2f87 100644
--- a/arch/powerpc/kexec/ima.c
+++ b/arch/powerpc/kexec/ima.c
@@ -128,23 +128,6 @@ void remove_ima_buffer(void *fdt, int chosen_node)
 }
 
 #ifdef CONFIG_IMA_KEXEC
-/**
- * arch_ima_add_kexec_buffer - do arch-specific steps to add the IMA buffer
- *
- * Architectures should use this function to pass on the IMA buffer
- * information to the next kernel.
- *
- * Return: 0 on success, negative errno on error.
- */
-int arch_ima_add_kexec_buffer(struct kimage *image, unsigned long load_addr,
- size_t size)
-{
-   image->arch.ima_buffer_addr = load_addr;
-   image->arch.ima_buffer_size = size;
-
-   return 0;
-}
-
 static int write_number(void *p, u64 value, int cells)
 {
if (cells == 1) {
@@ -180,7 +163,7 @@ int setup_ima_buffer(const struct kimage *image, void *fdt, 
int chosen_node)
u8 value[16];
 
remove_ima_buffer(fdt, chosen_node);
-   if (!image->arch.ima_buffer_size)
+   if (!image->ima_buffer_size)
return 0;
 
ret = get_addr_size_cells(_cells, _cells);
@@ -192,11 +175,11 @@ int setup_ima_buffer(const struct kimage *image, void 
*fdt, int chosen_node)
if (entry_size > sizeof(value))
return -EINVAL;
 
-   ret = write_number(value, image->arch.ima_buffer_addr, addr_cells);
+   ret = write_number(value, image->ima_buffer_addr, addr_cells);
if (ret)
return ret;
 
-   ret = write_number(value + 4 * addr_cells, image->arch.ima_buffer_size,
+   ret = write_number(value + 4 * addr_cells, image->ima_buffer_size,
   size_cells);
if (ret)
return ret;
@@ -206,13 +189,13 @@ int setup_ima_buffer(const struct kimage *image, void 
*fdt, int chosen_node)
if (ret < 0)
return -EINVAL;
 
-   ret = fdt_add_mem_rsv(fdt, image->arch.ima_buffer_addr,
- image->arch.ima_buffer_size);
+   ret = fdt_add_mem_rsv(fdt, image->ima_buffer_addr,
+ image->ima_buffer_size);
if (ret)
return -EINVAL;
 
pr_debug("IMA buffer at 0x%llx, size = 0x%zx\n",
-image->arch.ima_buffer_addr, image->arch.ima_buffer_size);
+image->ima_buffer_addr, image->ima_buffer_size);
 
return 0;
 }
diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
index 4afd3cc1c04a..efbf54f164fd 100644
--- a/drivers/of/kexec.c
+++ b/drivers/of/kexec.c
@@ -63,6 +63,29 @@ static int fdt_find_and_del_mem_rsv(void *fdt, unsigned long 
start, unsigned

[PATCH v16 04/12] powerpc: Use common of_kexec_setup_new_fdt()

2021-02-04 Thread Lakshmi Ramasubramanian

From: Rob Herring 

The code for setting up the /chosen node in the device tree
and updating the memory reservation for the next kernel has been
moved to of_kexec_setup_new_fdt() defined in "drivers/of/kexec.c".

Use the common of_kexec_setup_new_fdt() to setup the device tree
and update the memory reservation for kexec for powerpc.

Signed-off-by: Rob Herring 
Reviewed-by: Thiago Jung Bauermann 
Reviewed-by: Lakshmi Ramasubramanian 
---
 arch/powerpc/kexec/file_load.c | 125 ++---
 1 file changed, 6 insertions(+), 119 deletions(-)

diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c
index e452b11df631..956bcb2d1ec2 100644
--- a/arch/powerpc/kexec/file_load.c
+++ b/arch/powerpc/kexec/file_load.c
@@ -16,6 +16,7 @@
 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -156,132 +157,18 @@ int setup_new_fdt(const struct kimage *image, void *fdt,
  unsigned long initrd_load_addr, unsigned long initrd_len,
  const char *cmdline)
 {
-   int ret, chosen_node;
-   const void *prop;
-
-   /* Remove memory reservation for the current device tree. */
-   ret = delete_fdt_mem_rsv(fdt, __pa(initial_boot_params),
-fdt_totalsize(initial_boot_params));
-   if (ret == 0)
-   pr_debug("Removed old device tree reservation.\n");
-   else if (ret != -ENOENT)
-   return ret;
-
-   chosen_node = fdt_path_offset(fdt, "/chosen");
-   if (chosen_node == -FDT_ERR_NOTFOUND) {
-   chosen_node = fdt_add_subnode(fdt, fdt_path_offset(fdt, "/"),
- "chosen");
-   if (chosen_node < 0) {
-   pr_err("Error creating /chosen.\n");
-   return -EINVAL;
-   }
-   } else if (chosen_node < 0) {
-   pr_err("Malformed device tree: error reading /chosen.\n");
-   return -EINVAL;
-   }
-
-   /* Did we boot using an initrd? */
-   prop = fdt_getprop(fdt, chosen_node, "linux,initrd-start", NULL);
-   if (prop) {
-   uint64_t tmp_start, tmp_end, tmp_size;
-
-   tmp_start = fdt64_to_cpu(*((const fdt64_t *) prop));
-
-   prop = fdt_getprop(fdt, chosen_node, "linux,initrd-end", NULL);
-   if (!prop) {
-   pr_err("Malformed device tree.\n");
-   return -EINVAL;
-   }
-   tmp_end = fdt64_to_cpu(*((const fdt64_t *) prop));
-
-   /*
-* kexec reserves exact initrd size, while firmware may
-* reserve a multiple of PAGE_SIZE, so check for both.
-*/
-   tmp_size = tmp_end - tmp_start;
-   ret = delete_fdt_mem_rsv(fdt, tmp_start, tmp_size);
-   if (ret == -ENOENT)
-   ret = delete_fdt_mem_rsv(fdt, tmp_start,
-round_up(tmp_size, PAGE_SIZE));
-   if (ret == 0)
-   pr_debug("Removed old initrd reservation.\n");
-   else if (ret != -ENOENT)
-   return ret;
-
-   /* If there's no new initrd, delete the old initrd's info. */
-   if (initrd_len == 0) {
-   ret = fdt_delprop(fdt, chosen_node,
- "linux,initrd-start");
-   if (ret) {
-   pr_err("Error deleting linux,initrd-start.\n");
-   return -EINVAL;
-   }
-
-   ret = fdt_delprop(fdt, chosen_node, "linux,initrd-end");
-   if (ret) {
-   pr_err("Error deleting linux,initrd-end.\n");
-   return -EINVAL;
-   }
-   }
-   }
-
-   if (initrd_len) {
-   ret = fdt_setprop_u64(fdt, chosen_node,
- "linux,initrd-start",
- initrd_load_addr);
-   if (ret < 0)
-   goto err;
-
-   /* initrd-end is the first address after the initrd image. */
-   ret = fdt_setprop_u64(fdt, chosen_node, "linux,initrd-end",
- initrd_load_addr + initrd_len);
-   if (ret < 0)
-   goto err;
-
-   ret = fdt_add_mem_rsv(fdt, initrd_load_addr, initrd_len);
-   if (ret) {
-   pr_err("Error reserving initrd memory: %s\n",
-  fdt_strerror(ret));
-   return -EINVAL;
-   }
-   }
-
-   if (cmdline != NULL) {
-   ret = fdt_setprop_string(fdt, chosen_node, "bootargs", cmdline);
-   if (ret < 0)
-   goto err;
-   } else {

[PATCH v16 03/12] arm64: Use common of_kexec_setup_new_fdt()

2021-02-04 Thread Lakshmi Ramasubramanian

From: Rob Herring 

The code for setting up the /chosen node in the device tree
and updating the memory reservation for the next kernel has been
moved to of_kexec_setup_new_fdt() defined in "drivers/of/kexec.c".

Use the common of_kexec_setup_new_fdt() to setup the device tree
and update the memory reservation for kexec for arm64.

Signed-off-by: Rob Herring 
Reviewed-by: Thiago Jung Bauermann 
Reviewed-by: Lakshmi Ramasubramanian 
Acked-by: Will Deacon 
---
 arch/arm64/kernel/machine_kexec_file.c | 123 +
 1 file changed, 3 insertions(+), 120 deletions(-)

diff --git a/arch/arm64/kernel/machine_kexec_file.c 
b/arch/arm64/kernel/machine_kexec_file.c
index 03210f644790..7da22bb7b9d5 100644
--- a/arch/arm64/kernel/machine_kexec_file.c
+++ b/arch/arm64/kernel/machine_kexec_file.c
@@ -15,23 +15,12 @@
 #include 
 #include 
 #include 
+#include 
 #include 
-#include 
 #include 
 #include 
 #include 
 #include 
-#include 
-
-/* relevant device tree properties */
-#define FDT_PROP_KEXEC_ELFHDR  "linux,elfcorehdr"
-#define FDT_PROP_MEM_RANGE "linux,usable-memory-range"
-#define FDT_PROP_INITRD_START  "linux,initrd-start"
-#define FDT_PROP_INITRD_END"linux,initrd-end"
-#define FDT_PROP_BOOTARGS  "bootargs"
-#define FDT_PROP_KASLR_SEED"kaslr-seed"
-#define FDT_PROP_RNG_SEED  "rng-seed"
-#define RNG_SEED_SIZE  128
 
 const struct kexec_file_ops * const kexec_file_loaders[] = {
_image_ops,
@@ -50,112 +39,6 @@ int arch_kimage_file_post_load_cleanup(struct kimage *image)
return kexec_image_post_load_cleanup_default(image);
 }
 
-static int setup_dtb(struct kimage *image,
-unsigned long initrd_load_addr, unsigned long initrd_len,
-char *cmdline, void *dtb)
-{
-   int off, ret;
-
-   ret = fdt_path_offset(dtb, "/chosen");
-   if (ret < 0)
-   goto out;
-
-   off = ret;
-
-   ret = fdt_delprop(dtb, off, FDT_PROP_KEXEC_ELFHDR);
-   if (ret && ret != -FDT_ERR_NOTFOUND)
-   goto out;
-   ret = fdt_delprop(dtb, off, FDT_PROP_MEM_RANGE);
-   if (ret && ret != -FDT_ERR_NOTFOUND)
-   goto out;
-
-   if (image->type == KEXEC_TYPE_CRASH) {
-   /* add linux,elfcorehdr */
-   ret = fdt_appendprop_addrrange(dtb, 0, off,
-   FDT_PROP_KEXEC_ELFHDR,
-   image->arch.elf_headers_mem,
-   image->arch.elf_headers_sz);
-   if (ret)
-   return (ret == -FDT_ERR_NOSPACE ? -ENOMEM : -EINVAL);
-
-   /* add linux,usable-memory-range */
-   ret = fdt_appendprop_addrrange(dtb, 0, off,
-   FDT_PROP_MEM_RANGE,
-   crashk_res.start,
-   crashk_res.end - crashk_res.start + 1);
-   if (ret)
-   return (ret == -FDT_ERR_NOSPACE ? -ENOMEM : -EINVAL);
-   }
-
-   /* add bootargs */
-   if (cmdline) {
-   ret = fdt_setprop_string(dtb, off, FDT_PROP_BOOTARGS, cmdline);
-   if (ret)
-   goto out;
-   } else {
-   ret = fdt_delprop(dtb, off, FDT_PROP_BOOTARGS);
-   if (ret && (ret != -FDT_ERR_NOTFOUND))
-   goto out;
-   }
-
-   /* add initrd-* */
-   if (initrd_load_addr) {
-   ret = fdt_setprop_u64(dtb, off, FDT_PROP_INITRD_START,
- initrd_load_addr);
-   if (ret)
-   goto out;
-
-   ret = fdt_setprop_u64(dtb, off, FDT_PROP_INITRD_END,
- initrd_load_addr + initrd_len);
-   if (ret)
-   goto out;
-   } else {
-   ret = fdt_delprop(dtb, off, FDT_PROP_INITRD_START);
-   if (ret && (ret != -FDT_ERR_NOTFOUND))
-   goto out;
-
-   ret = fdt_delprop(dtb, off, FDT_PROP_INITRD_END);
-   if (ret && (ret != -FDT_ERR_NOTFOUND))
-   goto out;
-   }
-
-   /* add kaslr-seed */
-   ret = fdt_delprop(dtb, off, FDT_PROP_KASLR_SEED);
-   if (ret == -FDT_ERR_NOTFOUND)
-   ret = 0;
-   else if (ret)
-   goto out;
-
-   if (rng_is_initialized()) {
-   u64 seed = get_random_u64();
-   ret = fdt_setprop_u64(dtb, off, FDT_PROP_KASLR_SEED, seed);
-   if (ret)
-   goto out;
-   } else {
-   pr_notice("RNG is not initialised: omitting \"%s\" property\n",
-   FDT_PROP_KASLR_SEED);
-   }
-
-   /* add rng-seed */
-   if (rng_is_initialized()) {
-   void *rng_seed;
-   ret = fdt_setprop_placeholder(dtb, off, FDT_PROP_RNG_SEED,
-   RNG_SEED_SIZE, _seed);
-   if

[PATCH v16 00/12] Carry forward IMA measurement log on kexec on ARM64

2021-02-04 Thread Lakshmi Ramasubramanian

On kexec file load Integrity Measurement Architecture (IMA) subsystem
may verify the IMA signature of the kernel and initramfs, and measure
it.  The command line parameters passed to the kernel in the kexec call
may also be measured by IMA.  A remote attestation service can verify
a TPM quote based on the TPM event log, the IMA measurement list, and
the TPM PCR data.  This can be achieved only if the IMA measurement log
is carried over from the current kernel to the next kernel across
the kexec call.

powerpc already supports carrying forward the IMA measurement log on
kexec.  This patch set adds support for carrying forward the IMA
measurement log on kexec on ARM64.

This patch set moves the platform independent code defined for powerpc
such that it can be reused for other platforms as well.  A chosen node
"linux,ima-kexec-buffer" is added to the DTB for ARM64 to hold
the address and the size of the memory reserved to carry
the IMA measurement log.

This patch set has been tested for ARM64 platform using QEMU.
I would like help from the community for testing this change on powerpc.
Thanks.

This patch set is based on
commit b3f82afc1041 ("IMA: Measure kernel version in early boot")
in https://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git
"next-integrity" branch.

Changelog:

v16
  - Defined functions to allocate and free buffer for FDT for powerpc
and arm64.
  - Moved ima_buffer_addr and ima_buffer_size fields from
"struct kimage_arch" in powerpc to "struct kimage"
v15
  - Included Rob's patches in the patch set, and rebased
the changes to "next-integrity" branch.
  - Allocate memory for DTB, on arm64, using kmalloc() instead of
vmalloc() to keep it consistent with powerpc implementation.
  - Call of_kexec_setup_new_fdt() from setup_new_fdt_ppc64() and
remove setup_new_fdt() in the same patch to keep it bisect safe.

v14
  - Select CONFIG_HAVE_IMA_KEXEC for CONFIG_KEXEC_FILE, for powerpc
and arm64, if CONFIG_IMA is enabled.
  - Use IS_ENABLED() macro instead of "#ifdef" in remove_ima_buffer(),
ima_get_kexec_buffer(), and ima_free_kexec_buffer().
  - Call of_kexec_setup_new_fdt() from setup_new_fdt_ppc64() and
remove setup_new_fdt() in "arch/powerpc/kexec/file_load.c".

v13
  - Moved the arch independent functions to drivers/of/kexec.c
and then refactored the code.
  - Moved arch_ima_add_kexec_buffer() to
security/integrity/ima/ima_kexec.c

v12
  - Use fdt_appendprop_addrrange() in setup_ima_buffer()
to setup the IMA measurement list property in
the device tree.
  - Moved architecture independent functions from
"arch/powerpc/kexec/ima.c" to "drivers/of/kexec."
  - Deleted "arch/powerpc/kexec/ima.c" and
"arch/powerpc/include/asm/ima.h".

v11
  - Rebased the changes on the kexec code refactoring done by
Rob Herring in his "dt/kexec" branch
  - Removed "extern" keyword in function declarations
  - Removed unnecessary header files included in C files
  - Updated patch descriptions per Thiago's comments

v10
  - Moved delete_fdt_mem_rsv(), remove_ima_buffer(),
get_ima_kexec_buffer, and get_root_addr_size_cells()
to drivers/of/kexec.c
  - Moved arch_ima_add_kexec_buffer() to
security/integrity/ima/ima_kexec.c
  - Conditionally define IMA buffer fields in struct kimage_arch

v9
  - Moved delete_fdt_mem_rsv() to drivers/of/kexec_fdt.c
  - Defined a new function get_ima_kexec_buffer() in
drivers/of/ima_kexec.c to replace do_get_kexec_buffer()
  - Changed remove_ima_kexec_buffer() to the original function name
remove_ima_buffer()
  - Moved remove_ima_buffer() to drivers/of/ima_kexec.c
  - Moved ima_get_kexec_buffer() and ima_free_kexec_buffer()
to security/integrity/ima/ima_kexec.c

v8:
  - Moved remove_ima_kexec_buffer(), do_get_kexec_buffer(), and
delete_fdt_mem_rsv() to drivers/of/fdt.c
  - Moved ima_dump_measurement_list() and ima_add_kexec_buffer()
back to security/integrity/ima/ima_kexec.c

v7:
  - Renamed remove_ima_buffer() to remove_ima_kexec_buffer() and moved
this function definition to kernel.
  - Moved delete_fdt_mem_rsv() definition to kernel
  - Moved ima_dump_measurement_list() and ima_add_kexec_buffer() to
a new file namely ima_kexec_fdt.c in IMA

v6:
  - Remove any existing FDT_PROP_IMA_KEXEC_BUFFER property in the device
tree and also its corresponding memory reservation in the currently
running kernel.
  - Moved the function remove_ima_buffer() defined for powerpc to IMA
and renamed the function to ima_remove_kexec_buffer(). Also, moved
delete_fdt_mem_rsv() from powerpc to IMA.

v5:
  - Merged get_addr_size_cells() and do_get_kexec_buffer() into a single
function when moving the arch independent code from powerpc to IMA
  - Reverted the change to use FDT functions in powerpc code and added
back the original code in get_addr_size_cells() and
do_get_kexec_buffer() for powerpc.
  - Added fdt_add_mem_rsv() for ARM64 to reserve the memory for
the IMA

[PATCH v16 01/12] powerpc: Rename kexec elfcorehdr_addr to elf_headers_mem

2021-02-04 Thread Lakshmi Ramasubramanian

From: Rob Herring 

The architecture specific field, elfcorehdr_addr in struct kimage_arch,
that holds the address of the buffer in memory for ELF core header for
powerpc has a different name than the one used for arm64.  This makes
it hard to have a common code for setting up the device tree for
kexec system call.

Rename elfcorehdr_addr to elf_headers_mem to align with arm64 name so
common code can use it.

Signed-off-by: Rob Herring 
Reviewed-by: Thiago Jung Bauermann 
Reviewed-by: Lakshmi Ramasubramanian 
---
 arch/powerpc/include/asm/kexec.h  | 2 +-
 arch/powerpc/kexec/file_load.c| 4 ++--
 arch/powerpc/kexec/file_load_64.c | 4 ++--
 3 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 55d6ede30c19..dbf09d2f36d0 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -108,7 +108,7 @@ struct kimage_arch {
unsigned long backup_start;
void *backup_buf;
 
-   unsigned long elfcorehdr_addr;
+   unsigned long elf_headers_mem;
unsigned long elf_headers_sz;
void *elf_headers;
 
diff --git a/arch/powerpc/kexec/file_load.c b/arch/powerpc/kexec/file_load.c
index 9a232bc36c8f..e452b11df631 100644
--- a/arch/powerpc/kexec/file_load.c
+++ b/arch/powerpc/kexec/file_load.c
@@ -45,7 +45,7 @@ char *setup_kdump_cmdline(struct kimage *image, char *cmdline,
return NULL;
 
elfcorehdr_strlen = sprintf(cmdline_ptr, "elfcorehdr=0x%lx ",
-   image->arch.elfcorehdr_addr);
+   image->arch.elf_headers_mem);
 
if (elfcorehdr_strlen + cmdline_len > COMMAND_LINE_SIZE) {
pr_err("Appending elfcorehdr= exceeds cmdline size\n");
@@ -263,7 +263,7 @@ int setup_new_fdt(const struct kimage *image, void *fdt,
 * Avoid elfcorehdr from being stomped on in kdump kernel by
 * setting up memory reserve map.
 */
-   ret = fdt_add_mem_rsv(fdt, image->arch.elfcorehdr_addr,
+   ret = fdt_add_mem_rsv(fdt, image->arch.elf_headers_mem,
  image->arch.elf_headers_sz);
if (ret) {
pr_err("Error reserving elfcorehdr memory: %s\n",
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index c69bcf9b547a..a05c19b3cc60 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -815,7 +815,7 @@ static int load_elfcorehdr_segment(struct kimage *image, 
struct kexec_buf *kbuf)
goto out;
}
 
-   image->arch.elfcorehdr_addr = kbuf->mem;
+   image->arch.elf_headers_mem = kbuf->mem;
image->arch.elf_headers_sz = headers_sz;
image->arch.elf_headers = headers;
 out:
@@ -851,7 +851,7 @@ int load_crashdump_segments_ppc64(struct kimage *image,
return ret;
}
pr_debug("Loaded elf core header at 0x%lx, bufsz=0x%lx memsz=0x%lx\n",
-image->arch.elfcorehdr_addr, kbuf->bufsz, kbuf->memsz);
+image->arch.elf_headers_mem, kbuf->bufsz, kbuf->memsz);
 
return 0;
 }
-- 
2.30.0

[PATCH v16 02/12] of: Add a common kexec FDT setup function

2021-02-04 Thread Lakshmi Ramasubramanian

From: Rob Herring 

Both arm64 and powerpc do essentially the same FDT /chosen setup for
kexec.  The differences are either omissions that arm64 should have
or additional properties that will be ignored.  The setup code can be
combined and shared by both powerpc and arm64.

The differences relative to the arm64 version:
 - If /chosen doesn't exist, it will be created (should never happen).
 - Any old dtb and initrd reserved memory will be released.
 - The new initrd and elfcorehdr are marked reserved.
 - "linux,booted-from-kexec" is set.

The differences relative to the powerpc version:
 - "kaslr-seed" and "rng-seed" may be set.
 - "linux,elfcorehdr" is set.
 - Any existing "linux,usable-memory-range" is removed.

Combine the code for setting up the /chosen node in the FDT and updating
the memory reservation for kexec, for powerpc and arm64, in
of_kexec_setup_new_fdt() and move it to "drivers/of/kexec.c".

Signed-off-by: Rob Herring 
Reviewed-by: Thiago Jung Bauermann 
Reviewed-by: Lakshmi Ramasubramanian 
---
 drivers/of/Makefile |   1 +
 drivers/of/kexec.c  | 236 
 include/linux/of.h  |   4 +
 3 files changed, 241 insertions(+)
 create mode 100644 drivers/of/kexec.c

diff --git a/drivers/of/Makefile b/drivers/of/Makefile
index 6e1e5212f058..8ce11955afde 100644
--- a/drivers/of/Makefile
+++ b/drivers/of/Makefile
@@ -13,5 +13,6 @@ obj-$(CONFIG_OF_RESERVED_MEM) += of_reserved_mem.o
 obj-$(CONFIG_OF_RESOLVE)  += resolver.o
 obj-$(CONFIG_OF_OVERLAY) += overlay.o
 obj-$(CONFIG_OF_NUMA) += of_numa.o
+obj-$(CONFIG_KEXEC_FILE) += kexec.o
 
 obj-$(CONFIG_OF_UNITTEST) += unittest-data/
diff --git a/drivers/of/kexec.c b/drivers/of/kexec.c
new file mode 100644
index ..4afd3cc1c04a
--- /dev/null
+++ b/drivers/of/kexec.c
@@ -0,0 +1,236 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright (C) 2020 Arm Limited
+ *
+ * Based on arch/arm64/kernel/machine_kexec_file.c:
+ *  Copyright (C) 2018 Linaro Limited
+ *
+ * And arch/powerpc/kexec/file_load.c:
+ *  Copyright (C) 2016  IBM Corporation
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+/* relevant device tree properties */
+#define FDT_PROP_KEXEC_ELFHDR  "linux,elfcorehdr"
+#define FDT_PROP_MEM_RANGE "linux,usable-memory-range"
+#define FDT_PROP_INITRD_START  "linux,initrd-start"
+#define FDT_PROP_INITRD_END"linux,initrd-end"
+#define FDT_PROP_BOOTARGS  "bootargs"
+#define FDT_PROP_KASLR_SEED"kaslr-seed"
+#define FDT_PROP_RNG_SEED  "rng-seed"
+#define RNG_SEED_SIZE  128
+
+/**
+ * fdt_find_and_del_mem_rsv - delete memory reservation with given address and 
size
+ *
+ * @fdt:   Flattened device tree for the current kernel.
+ * @start: Starting address of the reserved memory.
+ * @size:  Size of the reserved memory.
+ *
+ * Return: 0 on success, or negative errno on error.
+ */
+static int fdt_find_and_del_mem_rsv(void *fdt, unsigned long start, unsigned 
long size)
+{
+   int i, ret, num_rsvs = fdt_num_mem_rsv(fdt);
+
+   for (i = 0; i < num_rsvs; i++) {
+   u64 rsv_start, rsv_size;
+
+   ret = fdt_get_mem_rsv(fdt, i, _start, _size);
+   if (ret) {
+   pr_err("Malformed device tree.\n");
+   return -EINVAL;
+   }
+
+   if (rsv_start == start && rsv_size == size) {
+   ret = fdt_del_mem_rsv(fdt, i);
+   if (ret) {
+   pr_err("Error deleting device tree 
reservation.\n");
+   return -EINVAL;
+   }
+
+   return 0;
+   }
+   }
+
+   return -ENOENT;
+}
+
+/*
+ * of_kexec_setup_new_fdt - modify /chosen and memory reservation for the next 
kernel
+ *
+ * @image: kexec image being loaded.
+ * @fdt:   Flattened device tree for the next kernel.
+ * @initrd_load_addr:  Address where the next initrd will be loaded.
+ * @initrd_len:Size of the next initrd, or 0 if there will be 
none.
+ * @cmdline:   Command line for the next kernel, or NULL if there will
+ * be none.
+ *
+ * Return: 0 on success, or negative errno on error.
+ */
+int of_kexec_setup_new_fdt(const struct kimage *image, void *fdt,
+  unsigned long initrd_load_addr, unsigned long 
initrd_len,
+  const char *cmdline)
+{
+   int ret, chosen_node;
+   const void *prop;
+
+   /* Remove memory reservation for the current device tree. */
+   ret = fdt_find_and_del_mem_rsv(fdt, __pa(initial_boot_params),
+  fdt_totalsize(initial_boot_params));
+   if (ret == -EINVAL)
+   return ret;
+
+   chosen_node = fdt_path_offset(fdt, "/chosen");
+   if (chosen_node == -FDT_ERR_NOTFOUND)
+   chosen_node = fdt_add_subnode(fdt,

Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction

2021-02-04 Thread Naveen N. Rao

On 2021/02/04 06:38PM, Naveen N. Rao wrote:
> On 2021/02/04 04:17PM, Ravi Bangoria wrote:
> > Don't allow Uprobe on 2nd word of a prefixed instruction. As per
> > ISA 3.1, prefixed instruction should not cross 64-byte boundary.
> > So don't allow Uprobe on such prefixed instruction as well.
> > 
> > There are two ways probed instruction is changed in mapped pages.
> > First, when Uprobe is activated, it searches for all the relevant
> > pages and replace instruction in them. In this case, if we notice
> > that probe is on the 2nd word of prefixed instruction, error out
> > directly. Second, when Uprobe is already active and user maps a
> > relevant page via mmap(), instruction is replaced via mmap() code
> > path. But because Uprobe is invalid, entire mmap() operation can
> > not be stopped. In this case just print an error and continue.
> > 
> > Signed-off-by: Ravi Bangoria 
> > ---
> > v1: 
> > http://lore.kernel.org/r/20210119091234.76317-1-ravi.bango...@linux.ibm.com
> > v1->v2:
> >   - Instead of introducing new arch hook from verify_opcode(), use
> > existing hook arch_uprobe_analyze_insn().
> >   - Add explicit check for prefixed instruction crossing 64-byte
> > boundary. If probe is on such instruction, throw an error.
> > 
> >  arch/powerpc/kernel/uprobes.c | 66 ++-
> >  1 file changed, 65 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/powerpc/kernel/uprobes.c b/arch/powerpc/kernel/uprobes.c
> > index e8a63713e655..485d19a2a31f 100644
> > --- a/arch/powerpc/kernel/uprobes.c
> > +++ b/arch/powerpc/kernel/uprobes.c
> > @@ -7,6 +7,7 @@
> >   * Adapted from the x86 port by Ananth N Mavinakayanahalli 
> > 
> >   */
> >  #include 
> > +#include 
> >  #include 
> >  #include 
> >  #include 
> > @@ -28,6 +29,69 @@ bool is_trap_insn(uprobe_opcode_t *insn)
> > return (is_trap(*insn));
> >  }
> >  
> > +#ifdef CONFIG_PPC64
> > +static int get_instr(struct mm_struct *mm, unsigned long addr, u32 *instr)
> > +{
> > +   struct page *page;
> > +   struct vm_area_struct *vma;
> > +   void *kaddr;
> > +   unsigned int gup_flags = FOLL_FORCE | FOLL_SPLIT_PMD;
> > +
> > +   if (get_user_pages_remote(mm, addr, 1, gup_flags, , , NULL) <= 
> > 0)
> > +   return -EINVAL;
> > +
> > +   kaddr = kmap_atomic(page);
> > +   *instr = *((u32 *)(kaddr + (addr & ~PAGE_MASK)));
> > +   kunmap_atomic(kaddr);
> > +   put_page(page);
> > +   return 0;
> > +}
> > +
> > +static int validate_prefixed_instr(struct mm_struct *mm, unsigned long 
> > addr)
> > +{
> > +   struct ppc_inst inst;
> > +   u32 prefix, suffix;
> > +
> > +   /*
> > +* No need to check if addr is pointing to beginning of the
> > +* page. Even if probe is on a suffix of page-unaligned
> > +* prefixed instruction, hw will raise exception and kernel
> > +* will send SIGBUS.
> > +*/
> > +   if (!(addr & ~PAGE_MASK))
> > +   return 0;
> > +
> > +   if (get_instr(mm, addr, ) < 0)
> > +   return -EINVAL;
> > +   if (get_instr(mm, addr + 4, ) < 0)
> > +   return -EINVAL;
> > +
> > +   inst = ppc_inst_prefix(prefix, suffix);
> > +   if (ppc_inst_prefixed(inst) && (addr & 0x3F) == 0x3C) {
> > +   printk_ratelimited("Cannot register a uprobe on 64 byte "
>   ^^ pr_info_ratelimited()
> 
> It should be sufficient to check the primary opcode to determine if it 
> is a prefixed instruction. You don't have to read the suffix. I see that 
> we don't have a helper to do this currently, so you could do:
> 
>   if (ppc_inst_primary_opcode(ppc_inst(prefix)) == 1)

Seeing the kprobes code, I realized that we have to check for another 
scenario (Thanks, Jordan!). If this is the suffix of a prefix 
instruction for which a uprobe has already been installed, then the 
previous word will be a 'trap' instruction. You need to check if there 
is a uprobe at the previous word, and if the original instruction there 
was a prefix instruction.

- Naveen

Re: [PATCH] vio: make remove callback return void

2021-02-04 Thread Greg Kroah-Hartman

On Wed, Jan 27, 2021 at 10:50:10PM +0100, Uwe Kleine-König wrote:
> The driver core ignores the return value of struct bus_type::remove()
> because there is only little that can be done. To simplify the quest to
> make this function return void, let struct vio_driver::remove() return
> void, too. All users already unconditionally return 0, this commit makes
> it obvious that returning an error code is a bad idea and makes it
> obvious for future driver authors that returning an error code isn't
> intended.
> 
> Note there are two nominally different implementations for a vio bus:
> one in arch/sparc/kernel/vio.c and the other in
> arch/powerpc/platforms/pseries/vio.c. I didn't care to check which
> driver is using which of these busses (or if even some of them can be
> used with both) and simply adapt all drivers and the two bus codes in
> one go.
> 
> Note that for the powerpc implementation there is a semantical change:
> Before this patch for a device that was bound to a driver without a
> remove callback vio_cmo_bus_remove(viodev) wasn't called. As the device
> core still considers the device unbound after vio_bus_remove() returns
> calling this unconditionally is the consistent behaviour which is
> implemented here.
> 
> Signed-off-by: Uwe Kleine-König 
> ---
> Hello,
> 
> note that this change depends on
> https://lore.kernel.org/r/20210121062005.53271-1-...@linux.ibm.com which 
> removes
> an (ignored) return -EBUSY in drivers/net/ethernet/ibm/ibmvnic.c.
> I don't know when/if this latter patch will be applied, so it might take
> some time until my patch can go in.

Acked-by: Greg Kroah-Hartman

Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction

2021-02-04 Thread Ananth N Mavinakayanahalli


On 2/4/21 4:19 PM, Ravi Bangoria wrote:



On 2/4/21 4:17 PM, Ravi Bangoria wrote:

Don't allow Uprobe on 2nd word of a prefixed instruction. As per
ISA 3.1, prefixed instruction should not cross 64-byte boundary.
So don't allow Uprobe on such prefixed instruction as well.

There are two ways probed instruction is changed in mapped pages.
First, when Uprobe is activated, it searches for all the relevant
pages and replace instruction in them. In this case, if we notice
that probe is on the 2nd word of prefixed instruction, error out
directly. Second, when Uprobe is already active and user maps a
relevant page via mmap(), instruction is replaced via mmap() code
path. But because Uprobe is invalid, entire mmap() operation can
not be stopped. In this case just print an error and continue.


@mpe,

arch_uprobe_analyze_insn() can return early if
cpu_has_feature(CPU_FTR_ARCH_31) is not set. But that will
miss out a rare scenario of user running binary with prefixed
instruction on p10 predecessors. Please let me know if I
should add cpu_has_feature(CPU_FTR_ARCH_31) or not.


Wouldn't that binary get a SIGILL in any case? I concur with Naveen...
it makes sense to add the check.


--
Ananth

Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction

2021-02-04 Thread Naveen N. Rao

On 2021/02/04 04:19PM, Ravi Bangoria wrote:
> 
> 
> On 2/4/21 4:17 PM, Ravi Bangoria wrote:
> > Don't allow Uprobe on 2nd word of a prefixed instruction. As per
> > ISA 3.1, prefixed instruction should not cross 64-byte boundary.
> > So don't allow Uprobe on such prefixed instruction as well.
> > 
> > There are two ways probed instruction is changed in mapped pages.
> > First, when Uprobe is activated, it searches for all the relevant
> > pages and replace instruction in them. In this case, if we notice
> > that probe is on the 2nd word of prefixed instruction, error out
> > directly. Second, when Uprobe is already active and user maps a
> > relevant page via mmap(), instruction is replaced via mmap() code
> > path. But because Uprobe is invalid, entire mmap() operation can
> > not be stopped. In this case just print an error and continue.
> 
> @mpe,
> 
> arch_uprobe_analyze_insn() can return early if
> cpu_has_feature(CPU_FTR_ARCH_31) is not set. But that will
> miss out a rare scenario of user running binary with prefixed
> instruction on p10 predecessors. Please let me know if I
> should add cpu_has_feature(CPU_FTR_ARCH_31) or not.

The check you are adding is very specific to prefixed instructions, so 
it makes sense to add a cpu feature check for v3.1.

On older processors, those are invalid instructions like any other. The 
instruction emulation infrastructure will refuse to emulate it and the 
instruction will be single stepped.

- Naveen

Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction

2021-02-04 Thread Naveen N. Rao

On 2021/02/04 04:17PM, Ravi Bangoria wrote:
> Don't allow Uprobe on 2nd word of a prefixed instruction. As per
> ISA 3.1, prefixed instruction should not cross 64-byte boundary.
> So don't allow Uprobe on such prefixed instruction as well.
> 
> There are two ways probed instruction is changed in mapped pages.
> First, when Uprobe is activated, it searches for all the relevant
> pages and replace instruction in them. In this case, if we notice
> that probe is on the 2nd word of prefixed instruction, error out
> directly. Second, when Uprobe is already active and user maps a
> relevant page via mmap(), instruction is replaced via mmap() code
> path. But because Uprobe is invalid, entire mmap() operation can
> not be stopped. In this case just print an error and continue.
> 
> Signed-off-by: Ravi Bangoria 
> ---
> v1: 
> http://lore.kernel.org/r/20210119091234.76317-1-ravi.bango...@linux.ibm.com
> v1->v2:
>   - Instead of introducing new arch hook from verify_opcode(), use
> existing hook arch_uprobe_analyze_insn().
>   - Add explicit check for prefixed instruction crossing 64-byte
> boundary. If probe is on such instruction, throw an error.
> 
>  arch/powerpc/kernel/uprobes.c | 66 ++-
>  1 file changed, 65 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/uprobes.c b/arch/powerpc/kernel/uprobes.c
> index e8a63713e655..485d19a2a31f 100644
> --- a/arch/powerpc/kernel/uprobes.c
> +++ b/arch/powerpc/kernel/uprobes.c
> @@ -7,6 +7,7 @@
>   * Adapted from the x86 port by Ananth N Mavinakayanahalli 
> 
>   */
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -28,6 +29,69 @@ bool is_trap_insn(uprobe_opcode_t *insn)
>   return (is_trap(*insn));
>  }
>  
> +#ifdef CONFIG_PPC64
> +static int get_instr(struct mm_struct *mm, unsigned long addr, u32 *instr)
> +{
> + struct page *page;
> + struct vm_area_struct *vma;
> + void *kaddr;
> + unsigned int gup_flags = FOLL_FORCE | FOLL_SPLIT_PMD;
> +
> + if (get_user_pages_remote(mm, addr, 1, gup_flags, , , NULL) <= 
> 0)
> + return -EINVAL;
> +
> + kaddr = kmap_atomic(page);
> + *instr = *((u32 *)(kaddr + (addr & ~PAGE_MASK)));
> + kunmap_atomic(kaddr);
> + put_page(page);
> + return 0;
> +}
> +
> +static int validate_prefixed_instr(struct mm_struct *mm, unsigned long addr)
> +{
> + struct ppc_inst inst;
> + u32 prefix, suffix;
> +
> + /*
> +  * No need to check if addr is pointing to beginning of the
> +  * page. Even if probe is on a suffix of page-unaligned
> +  * prefixed instruction, hw will raise exception and kernel
> +  * will send SIGBUS.
> +  */
> + if (!(addr & ~PAGE_MASK))
> + return 0;
> +
> + if (get_instr(mm, addr, ) < 0)
> + return -EINVAL;
> + if (get_instr(mm, addr + 4, ) < 0)
> + return -EINVAL;
> +
> + inst = ppc_inst_prefix(prefix, suffix);
> + if (ppc_inst_prefixed(inst) && (addr & 0x3F) == 0x3C) {
> + printk_ratelimited("Cannot register a uprobe on 64 byte "
^^ pr_info_ratelimited()

It should be sufficient to check the primary opcode to determine if it 
is a prefixed instruction. You don't have to read the suffix. I see that 
we don't have a helper to do this currently, so you could do:

if (ppc_inst_primary_opcode(ppc_inst(prefix)) == 1)

In the future, it might be worthwhile to add IS_PREFIX() as a macro 
similar to IS_MTMSRD() if there are more such uses.

Along with this, if you also add the below to the start of this 
function, you can get rid of the #ifdef:

if (!IS_ENABLED(CONFIG_PPC64))
return 0;


- Naveen

[PATCH kernel v3] powerpc/uaccess: Skip might_fault() when user access is enabled

2021-02-04 Thread Alexey Kardashevskiy

The amount of code executed with enabled user space access (unlocked KUAP)
should be minimal. However with CONFIG_PROVE_LOCKING or
CONFIG_DEBUG_ATOMIC_SLEEP enabled, might_fault() may end up replaying
interrupts which in turn may access the user space and forget to restore
the KUAP state.

The problem places are:
1. strncpy_from_user (and similar) which unlock KUAP and call
unsafe_get_user -> __get_user_allowed -> __get_user_nocheck()
with do_allow=false to skip KUAP as the caller took care of it.
2. __put_user_nocheck_goto() which is called with unlocked KUAP.

This changes __get_user_nocheck() to look at @do_allow to decide whether
to skip might_fault(). Since strncpy_from_user/etc call might_fault()
anyway before unlocking KUAP, there should be no visible change.

This drops might_fault() in __put_user_nocheck_goto() as it is only
called from unsafe_xxx helpers which manage KUAP themselves.

Since keeping might_fault() is still desireable, this adds those
to user_access_begin/read/write which is the last point where
we can safely do so.

Fixes: 334710b1496a ("powerpc/uaccess: Implement unsafe_put_user() using 'asm 
goto'")
Signed-off-by: Alexey Kardashevskiy 
---
Changes:
v3:
* removed might_fault() from __put_user_nocheck_goto
* added might_fault() to user(_|_read_|_write_)access_begin

v2:
* s/!do_allow/do_allow/

---

Here is more detail about the issue:
https://lore.kernel.org/linuxppc-dev/20210203084503.gx6...@kitsune.suse.cz/T/

Another example of the problem:

Kernel attempted to write user page (22c3) - exploit attempt? (uid: 0)
[ cut here ]
Bug: Write fault blocked by KUAP!
WARNING: CPU: 1 PID: 16712 at 
/home/aik/p/kernel-syzkaller/arch/powerpc/mm/fault.c:229 
__do_page_fault+0xca4/0xf10

NIP [c06ff804] filldir64+0x484/0x820
LR [c06ff7fc] filldir64+0x47c/0x820
--- interrupt: 300
[c000589f3b40] [c08131b0] proc_fill_cache+0xf0/0x2b0
[c000589f3c60] [c0814658] proc_pident_readdir+0x1f8/0x390
[c000589f3cc0] [c06fd8e8] iterate_dir+0x108/0x370
[c000589f3d20] [c06fe3d8] sys_getdents64+0xa8/0x410
[c000589f3db0] [c004b708] system_call_exception+0x178/0x2b0
[c000589f3e10] [c000e060] system_call_common+0xf0/0x27c
---
 arch/powerpc/include/asm/uaccess.h | 10 +++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/uaccess.h 
b/arch/powerpc/include/asm/uaccess.h
index 501c9a79038c..a789601998d3 100644
--- a/arch/powerpc/include/asm/uaccess.h
+++ b/arch/powerpc/include/asm/uaccess.h
@@ -216,8 +216,6 @@ do {
\
 #define __put_user_nocheck_goto(x, ptr, size, label)   \
 do {   \
__typeof__(*(ptr)) __user *__pu_addr = (ptr);   \
-   if (!is_kernel_addr((unsigned long)__pu_addr))  \
-   might_fault();  \
__chk_user_ptr(ptr);\
__put_user_size_goto((x), __pu_addr, (size), label);\
 } while (0)
@@ -313,7 +311,7 @@ do {
\
__typeof__(size) __gu_size = (size);\
\
__chk_user_ptr(__gu_addr);  \
-   if (!is_kernel_addr((unsigned long)__gu_addr))  \
+   if (do_allow && !is_kernel_addr((unsigned long)__gu_addr)) \
might_fault();  \
barrier_nospec();   \
if (do_allow)   
\
@@ -508,6 +506,8 @@ static __must_check inline bool user_access_begin(const 
void __user *ptr, size_t
 {
if (unlikely(!access_ok(ptr, len)))
return false;
+   if (!is_kernel_addr((unsigned long)ptr))
+   might_fault();
allow_read_write_user((void __user *)ptr, ptr, len);
return true;
 }
@@ -521,6 +521,8 @@ user_read_access_begin(const void __user *ptr, size_t len)
 {
if (unlikely(!access_ok(ptr, len)))
return false;
+   if (!is_kernel_addr((unsigned long)ptr))
+   might_fault();
allow_read_from_user(ptr, len);
return true;
 }
@@ -532,6 +534,8 @@ user_write_access_begin(const void __user *ptr, size_t len)
 {
if (unlikely(!access_ok(ptr, len)))
return false;
+   if (!is_kernel_addr((unsigned long)ptr))
+   might_fault();
allow_write_to_user((void __user *)ptr, len);
return true;
 }
-- 
2.17.1

Re: [PATCH 3/3] tools/perf: Add perf tools support to expose Performance Monitor Counter SPRs as part of extended regs

2021-02-04 Thread Athira Rajeev




> On 03-Feb-2021, at 9:55 PM, Arnaldo Carvalho de Melo  wrote:
> 
> Em Wed, Feb 03, 2021 at 01:55:37AM -0500, Athira Rajeev escreveu:
>> To enable presenting of Performance Monitor Counter Registers
>> (PMC1 to PMC6) as part of extended regsiters, patch adds these
>> to sample_reg_mask in the tool side (to use with -I? option).
>> 
>> Simplified the PERF_REG_PMU_MASK_300/31 definition. Excluded the
>> unsupported SPRs (MMCR3, SIER2, SIER3) from extended mask value for
>> CPU_FTR_ARCH_300.
> 
> Applied just 3/3, the tooling part, to my local branch, please holler if
> I should wait a bit more.
> 
> - Arnaldo
> 

Thanks Arnaldo for taking the tool side changes.

Athira.

>> Signed-off-by: Athira Rajeev 
>> ---
>> tools/arch/powerpc/include/uapi/asm/perf_regs.h | 28 
>> +++--
>> tools/perf/arch/powerpc/include/perf_regs.h |  6 ++
>> tools/perf/arch/powerpc/util/perf_regs.c|  6 ++
>> 3 files changed, 34 insertions(+), 6 deletions(-)
>> 
>> diff --git a/tools/arch/powerpc/include/uapi/asm/perf_regs.h 
>> b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
>> index bdf5f10f8b9f..578b3ee86105 100644
>> --- a/tools/arch/powerpc/include/uapi/asm/perf_regs.h
>> +++ b/tools/arch/powerpc/include/uapi/asm/perf_regs.h
>> @@ -55,17 +55,33 @@ enum perf_event_powerpc_regs {
>>  PERF_REG_POWERPC_MMCR3,
>>  PERF_REG_POWERPC_SIER2,
>>  PERF_REG_POWERPC_SIER3,
>> +PERF_REG_POWERPC_PMC1,
>> +PERF_REG_POWERPC_PMC2,
>> +PERF_REG_POWERPC_PMC3,
>> +PERF_REG_POWERPC_PMC4,
>> +PERF_REG_POWERPC_PMC5,
>> +PERF_REG_POWERPC_PMC6,
>>  /* Max regs without the extended regs */
>>  PERF_REG_POWERPC_MAX = PERF_REG_POWERPC_MMCRA + 1,
>> };
>> 
>> #define PERF_REG_PMU_MASK((1ULL << PERF_REG_POWERPC_MAX) - 1)
>> 
>> -/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300 */
>> -#define PERF_REG_PMU_MASK_300   (((1ULL << (PERF_REG_POWERPC_MMCR2 + 1)) - 
>> 1) - PERF_REG_PMU_MASK)
>> -/* PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31 */
>> -#define PERF_REG_PMU_MASK_31   (((1ULL << (PERF_REG_POWERPC_SIER3 + 1)) - 
>> 1) - PERF_REG_PMU_MASK)
>> +/* Exclude MMCR3, SIER2, SIER3 for CPU_FTR_ARCH_300 */
>> +#define PERF_EXCLUDE_REG_EXT_300(7ULL << PERF_REG_POWERPC_MMCR3)
>> 
>> -#define PERF_REG_MAX_ISA_300   (PERF_REG_POWERPC_MMCR2 + 1)
>> -#define PERF_REG_MAX_ISA_31(PERF_REG_POWERPC_SIER3 + 1)
>> +/*
>> + * PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_300
>> + * includes 9 SPRS from MMCR0 to PMC6 excluding the
>> + * unsupported SPRS in PERF_EXCLUDE_REG_EXT_300.
>> + */
>> +#define PERF_REG_PMU_MASK_300   ((0xfffULL << PERF_REG_POWERPC_MMCR0) - 
>> PERF_EXCLUDE_REG_EXT_300)
>> +
>> +/*
>> + * PERF_REG_EXTENDED_MASK value for CPU_FTR_ARCH_31
>> + * includes 12 SPRs from MMCR0 to PMC6.
>> + */
>> +#define PERF_REG_PMU_MASK_31   (0xfffULL << PERF_REG_POWERPC_MMCR0)
>> +
>> +#define PERF_REG_EXTENDED_MAX  (PERF_REG_POWERPC_PMC6 + 1)
>> #endif /* _UAPI_ASM_POWERPC_PERF_REGS_H */
>> diff --git a/tools/perf/arch/powerpc/include/perf_regs.h 
>> b/tools/perf/arch/powerpc/include/perf_regs.h
>> index 63f3ac91049f..98b6f9eabfc3 100644
>> --- a/tools/perf/arch/powerpc/include/perf_regs.h
>> +++ b/tools/perf/arch/powerpc/include/perf_regs.h
>> @@ -71,6 +71,12 @@
>>  [PERF_REG_POWERPC_MMCR3] = "mmcr3",
>>  [PERF_REG_POWERPC_SIER2] = "sier2",
>>  [PERF_REG_POWERPC_SIER3] = "sier3",
>> +[PERF_REG_POWERPC_PMC1] = "pmc1",
>> +[PERF_REG_POWERPC_PMC2] = "pmc2",
>> +[PERF_REG_POWERPC_PMC3] = "pmc3",
>> +[PERF_REG_POWERPC_PMC4] = "pmc4",
>> +[PERF_REG_POWERPC_PMC5] = "pmc5",
>> +[PERF_REG_POWERPC_PMC6] = "pmc6",
>> };
>> 
>> static inline const char *perf_reg_name(int id)
>> diff --git a/tools/perf/arch/powerpc/util/perf_regs.c 
>> b/tools/perf/arch/powerpc/util/perf_regs.c
>> index 2b6d4704e3aa..8116a253f91f 100644
>> --- a/tools/perf/arch/powerpc/util/perf_regs.c
>> +++ b/tools/perf/arch/powerpc/util/perf_regs.c
>> @@ -68,6 +68,12 @@
>>  SMPL_REG(mmcr3, PERF_REG_POWERPC_MMCR3),
>>  SMPL_REG(sier2, PERF_REG_POWERPC_SIER2),
>>  SMPL_REG(sier3, PERF_REG_POWERPC_SIER3),
>> +SMPL_REG(pmc1, PERF_REG_POWERPC_PMC1),
>> +SMPL_REG(pmc2, PERF_REG_POWERPC_PMC2),
>> +SMPL_REG(pmc3, PERF_REG_POWERPC_PMC3),
>> +SMPL_REG(pmc4, PERF_REG_POWERPC_PMC4),
>> +SMPL_REG(pmc5, PERF_REG_POWERPC_PMC5),
>> +SMPL_REG(pmc6, PERF_REG_POWERPC_PMC6),
>>  SMPL_REG_END
>> };
>> 
>> -- 
>> 1.8.3.1
>> 
> 
> -- 
> 
> - Arnaldo

Re: [PATCH] tools/perf: Fix powerpc gap between kernel end and module start

2021-02-04 Thread Athira Rajeev




> On 03-Feb-2021, at 9:01 PM, Arnaldo Carvalho de Melo  wrote:
> 
> Thanks, collected the Tested-by from Kajol and the Acked-by from Jiri
> and applied to my local tree for testing, then up to my perf/core
> branch.
> 
> - Arnaldo

Thanks Arnaldo for taking this fix.

Re: [PATCH RFC v1 2/6] swiotlb: convert variables to arrays

2021-02-04 Thread Robin Murphy


On 2021-02-04 07:29, Christoph Hellwig wrote:

On Wed, Feb 03, 2021 at 03:37:05PM -0800, Dongli Zhang wrote:

This patch converts several swiotlb related variables to arrays, in
order to maintain stat/status for different swiotlb buffers. Here are
variables involved:

- io_tlb_start and io_tlb_end
- io_tlb_nslabs and io_tlb_used
- io_tlb_list
- io_tlb_index
- max_segment
- io_tlb_orig_addr
- no_iotlb_memory

There is no functional change and this is to prepare to enable 64-bit
swiotlb.


Claire Chang (on Cc) already posted a patch like this a month ago,
which looks much better because it actually uses a struct instead
of all the random variables.


Indeed, I skimmed the cover letter and immediately thought that this 
whole thing is just the restricted DMA pool concept[1] again, only from 
a slightly different angle.


Robin.

[1] 
https://lore.kernel.org/linux-iommu/20210106034124.30560-1-tien...@chromium.org/

Re: [PATCH v7 39/42] powerpc: move NMI entry/exit code into wrapper

2021-02-04 Thread Nicholas Piggin

Excerpts from Michael Ellerman's message of February 4, 2021 8:15 pm:
> Nicholas Piggin  writes:
>> This moves the common NMI entry and exit code into the interrupt handler
>> wrappers.
>>
>> This changes the behaviour of soft-NMI (watchdog) and HMI interrupts, and
>> also MCE interrupts on 64e, by adding missing parts of the NMI entry to
>> them.
>>
>> Signed-off-by: Nicholas Piggin 
>> ---
>>  arch/powerpc/include/asm/interrupt.h | 28 ++
>>  arch/powerpc/kernel/mce.c| 11 -
>>  arch/powerpc/kernel/traps.c  | 35 +---
>>  arch/powerpc/kernel/watchdog.c   | 10 
>>  4 files changed, 38 insertions(+), 46 deletions(-)
> 
> This is unhappy when injecting SLB multi-hits:
> 
>   root@p86-2:~# echo PPC_SLB_MULTIHIT > /sys/kernel/debug/provoke-crash/DIRECT
>   [  312.496026][ T1344] kernel BUG at 
> arch/powerpc/include/asm/interrupt.h:152!
>   [  312.496037][ T1344] Oops: Exception in kernel mode, sig: 5 [#1]
>   [  312.496045][ T1344] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA 
> pSeries

pseries hash. Blast!

> 147 static inline void interrupt_nmi_exit_prepare(struct pt_regs *regs, 
> struct interrupt_nmi_state *state)
> 148 {
> 149   if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64) ||
> 150   !firmware_has_feature(FW_FEATURE_LPAR) ||
> 151   radix_enabled() || (mfmsr() & MSR_DR))
> 152   nmi_exit();
> 
> 
> So presumably it's:
> 
> #define __nmi_exit()  \
>   do {\
>   BUG_ON(!in_nmi());  \

Yes that would be it, pseries machine check enables MMU half way through 
so only one side of this triggers.

The MSR_DR check is supposed to catch the other NMIs that run with MMU 
on (perf, watchdog, etc). Suppose it could test TRAP(regs) explicitly
although I wonder if we should also do this to keep things balanced

diff --git a/arch/powerpc/platforms/pseries/ras.c 
b/arch/powerpc/platforms/pseries/ras.c
index 149cec2212e6..f57ca0c570be 100644
--- a/arch/powerpc/platforms/pseries/ras.c
+++ b/arch/powerpc/platforms/pseries/ras.c
@@ -719,6 +719,7 @@ static int mce_handle_err_virtmode(struct pt_regs *regs,
 
 static int mce_handle_error(struct pt_regs *regs, struct rtas_error_log *errp)
 {
+   unsigned long msr;
struct pseries_errorlog *pseries_log;
struct pseries_mc_errorlog *mce_log = NULL;
int disposition = rtas_error_disposition(errp);
@@ -747,9 +748,12 @@ static int mce_handle_error(struct pt_regs *regs, struct 
rtas_error_log *errp)
 *   SLB multihit is done by now.
 */
 out:
-   mtmsr(mfmsr() | MSR_IR | MSR_DR);
+   msr = mfmsr();
+   mtmsr(msr | MSR_IR | MSR_DR);
disposition = mce_handle_err_virtmode(regs, errp, mce_log,
  disposition);
+   mtmsr(msr);
+
return disposition;
 }

[PATCH] powerpc/kexec_file: fix FDT size estimation for kdump kernel

2021-02-04 Thread Hari Bathini

On systems with large amount of memory, loading kdump kernel through
kexec_file_load syscall may fail with the below error:

"Failed to update fdt with linux,drconf-usable-memory property"

This happens because the size estimation for kdump kernel's FDT does
not account for the additional space needed to setup usable memory
properties. Fix it by accounting for the space needed to include
linux,usable-memory & linux,drconf-usable-memory properties while
estimating kdump kernel's FDT size.

Fixes: 6ecd0163d360 ("powerpc/kexec_file: Add appropriate regions for memory 
reserve map")
Cc: sta...@vger.kernel.org
Signed-off-by: Hari Bathini 
---
 arch/powerpc/include/asm/kexec.h  |1 +
 arch/powerpc/kexec/elf_64.c   |2 +-
 arch/powerpc/kexec/file_load_64.c |   34 ++
 3 files changed, 36 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kexec.h b/arch/powerpc/include/asm/kexec.h
index 55d6ede30c19..9ab344d29a54 100644
--- a/arch/powerpc/include/asm/kexec.h
+++ b/arch/powerpc/include/asm/kexec.h
@@ -136,6 +136,7 @@ int load_crashdump_segments_ppc64(struct kimage *image,
 int setup_purgatory_ppc64(struct kimage *image, const void *slave_code,
  const void *fdt, unsigned long kernel_load_addr,
  unsigned long fdt_load_addr);
+unsigned int kexec_fdt_totalsize_ppc64(struct kimage *image);
 int setup_new_fdt_ppc64(const struct kimage *image, void *fdt,
unsigned long initrd_load_addr,
unsigned long initrd_len, const char *cmdline);
diff --git a/arch/powerpc/kexec/elf_64.c b/arch/powerpc/kexec/elf_64.c
index d0e459bb2f05..9842e33533df 100644
--- a/arch/powerpc/kexec/elf_64.c
+++ b/arch/powerpc/kexec/elf_64.c
@@ -102,7 +102,7 @@ static void *elf64_load(struct kimage *image, char 
*kernel_buf,
pr_debug("Loaded initrd at 0x%lx\n", initrd_load_addr);
}
 
-   fdt_size = fdt_totalsize(initial_boot_params) * 2;
+   fdt_size = kexec_fdt_totalsize_ppc64(image);
fdt = kmalloc(fdt_size, GFP_KERNEL);
if (!fdt) {
pr_err("Not enough memory for the device tree.\n");
diff --git a/arch/powerpc/kexec/file_load_64.c 
b/arch/powerpc/kexec/file_load_64.c
index c69bcf9b547a..67fa7bfcfa30 100644
--- a/arch/powerpc/kexec/file_load_64.c
+++ b/arch/powerpc/kexec/file_load_64.c
@@ -21,6 +21,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -925,6 +926,39 @@ int setup_purgatory_ppc64(struct kimage *image, const void 
*slave_code,
return ret;
 }
 
+/**
+ * kexec_fdt_totalsize_ppc64 - Return the estimated size needed to setup FDT
+ * for kexec/kdump kernel.
+ * @image: kexec image being loaded.
+ *
+ * Returns the estimated size needed for kexec/kdump kernel FDT.
+ */
+unsigned int kexec_fdt_totalsize_ppc64(struct kimage *image)
+{
+   unsigned int fdt_size;
+   uint64_t usm_entries;
+
+   /*
+* The below estimate more than accounts for a typical kexec case where
+* the additional space is to accommodate things like kexec cmdline,
+* chosen node with properties for initrd start & end addresses and
+* a property to indicate kexec boot..
+*/
+   fdt_size = fdt_totalsize(initial_boot_params) + (2 * COMMAND_LINE_SIZE);
+   if (image->type != KEXEC_TYPE_CRASH)
+   return fdt_size;
+
+   /*
+* For kdump kernel, also account for linux,usable-memory and
+* linux,drconf-usable-memory properties. Get an approximate on the
+* number of usable memory entries and use for FDT size estimation.
+*/
+   usm_entries = ((memblock_end_of_DRAM() / drmem_lmb_size()) +
+  (2 * (resource_size(_res) / drmem_lmb_size(;
+   fdt_size += (unsigned int)(usm_entries * sizeof(uint64_t));
+   return fdt_size;
+}
+
 /**
  * setup_new_fdt_ppc64 - Update the flattend device-tree of the kernel
  *   being loaded.

Re: [PATCH] powerpc64/idle: Fix SP offsets when saving GPRs

2021-02-04 Thread Nicholas Piggin

Excerpts from Michael Ellerman's message of February 4, 2021 9:13 pm:
> Nicholas Piggin  writes:
>> Excerpts from Christopher M. Riedl's message of February 4, 2021 4:59 pm:
>>> On Sat Jan 30, 2021 at 7:44 AM CST, Nicholas Piggin wrote:
 Excerpts from Michael Ellerman's message of January 30, 2021 9:32 pm:
 > "Christopher M. Riedl"  writes:
 >> The idle entry/exit code saves/restores GPRs in the stack "red zone"
 >> (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset
 >> used for the first GPR is incorrect and overwrites the back chain - the
 >> Protected Zone actually starts below the current SP. In practice this is
 >> probably not an issue, but it's still incorrect so fix it.
 > 
 > Nice catch.
 > 
 > Corrupting the back chain means you can't backtrace from there, which
 > could be confusing for debugging one day.

 Yeah, we seem to have got away without noticing because the CPU will
 wake up and return out of here before it tries to unwind the stack,
 but if you tried to walk it by hand if the CPU got stuck in idle or
 something, then we'd get confused.

 > It does make me wonder why we don't just create a stack frame and use
 > the normal macros? It would use a bit more stack space, but we shouldn't
 > be short of stack space when going idle.
 > 
 > Nick, was there a particular reason for using the red zone?

 I don't recall a particular reason, I think a normal stack frame is
 probably a good idea.
>>> 
>>> I'll send a version using STACKFRAMESIZE - I assume that's the "normal"
>>> stack frame :)
>>> 
>>
>> I think STACKFRAMESIZE is STACK_FRAME_OVERHEAD + NVGPRs. LR and CR can 
>> be saved in the caller's frame so that should be okay.
> 
> TBH I didn't know/had forgotten we had STACKFRAMESIZE.
> 
> The safest is SWITCH_FRAME_SIZE, because it's calculated based on
> pt_regs (which can change size):
> 
>   DEFINE(SWITCH_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct 
> pt_regs));
> 
> But the name is obviously terrible for your usage, and it will allocate
> a bit more space than you need, because pt_regs has more than just the GPRs.

I don't see why that's safer if we're not using pt_regs.

> 
>>> I admit I am a bit confused when I saw the similar but much smaller
>>> STACK_FRAME_OVERHEAD which is also used in _some_ cases to save/restore
>>> a few registers.
>>
>> Yeah if you don't need to save all nvgprs you can use caller's frame 
>> plus a few bytes in the minimum frame as volatile storage.
>>
>> STACK_FRAME_OVERHEAD should be 32 on LE, but I think the problem is a 
>> lot of asm uses it and hasn't necessarily been audited to make sure it's 
>> not assuming it's bigger.
> 
> Yeah it's a total mess. See this ~3.5 year old issue :/
> 
> https://github.com/linuxppc/issues/issues/113
> 
>> You could actually use STACK_FRAME_MIN_SIZE for new code, maybe we add
>> a STACK_FRAME_MIN_NVGPR_SIZE to match and use that.
> 
> Something like that makes me nervous because it's so easy to
> accidentally use one of the macros that expects a full pt_regs.
> 
> I think ideally you have just two options.
> 
> Option 1:
> 
> You use:
>   STACK_FRAME_WITH_PT_REGS = STACK_FRAME_MIN_SIZE + sizeof(struct pt_regs)
> 
> And then you can use all the macros in asm-offsets.c generated with
> STACK_PT_REGS_OFFSET.

I don't see a good reason to use pt_regs here, but in general sure this 
would be good to have and begin using.

> Option 2:
> 
> If you want to be fancy you manually allocate your frame with just
> the right amount of space, but with the size there in the code, so for
> example the idle code that wants 19 GPRs would do:
> 
>   stdur1, -(STACK_FRAME_MIN_SIZE + 8 * 19)(r1)
> 
> And then it's very obvious that you have a non-standard frame size and
> need to be more careful.

I would be happy with this for the idle code.

Thanks,
Nick

Re: [PATCH] powerpc64/idle: Fix SP offsets when saving GPRs

2021-02-04 Thread Michael Ellerman

Nicholas Piggin  writes:
> Excerpts from Christopher M. Riedl's message of February 4, 2021 4:59 pm:
>> On Sat Jan 30, 2021 at 7:44 AM CST, Nicholas Piggin wrote:
>>> Excerpts from Michael Ellerman's message of January 30, 2021 9:32 pm:
>>> > "Christopher M. Riedl"  writes:
>>> >> The idle entry/exit code saves/restores GPRs in the stack "red zone"
>>> >> (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset
>>> >> used for the first GPR is incorrect and overwrites the back chain - the
>>> >> Protected Zone actually starts below the current SP. In practice this is
>>> >> probably not an issue, but it's still incorrect so fix it.
>>> > 
>>> > Nice catch.
>>> > 
>>> > Corrupting the back chain means you can't backtrace from there, which
>>> > could be confusing for debugging one day.
>>>
>>> Yeah, we seem to have got away without noticing because the CPU will
>>> wake up and return out of here before it tries to unwind the stack,
>>> but if you tried to walk it by hand if the CPU got stuck in idle or
>>> something, then we'd get confused.
>>>
>>> > It does make me wonder why we don't just create a stack frame and use
>>> > the normal macros? It would use a bit more stack space, but we shouldn't
>>> > be short of stack space when going idle.
>>> > 
>>> > Nick, was there a particular reason for using the red zone?
>>>
>>> I don't recall a particular reason, I think a normal stack frame is
>>> probably a good idea.
>> 
>> I'll send a version using STACKFRAMESIZE - I assume that's the "normal"
>> stack frame :)
>> 
>
> I think STACKFRAMESIZE is STACK_FRAME_OVERHEAD + NVGPRs. LR and CR can 
> be saved in the caller's frame so that should be okay.

TBH I didn't know/had forgotten we had STACKFRAMESIZE.

The safest is SWITCH_FRAME_SIZE, because it's calculated based on
pt_regs (which can change size):

DEFINE(SWITCH_FRAME_SIZE, STACK_FRAME_OVERHEAD + sizeof(struct 
pt_regs));

But the name is obviously terrible for your usage, and it will allocate
a bit more space than you need, because pt_regs has more than just the GPRs.

>> I admit I am a bit confused when I saw the similar but much smaller
>> STACK_FRAME_OVERHEAD which is also used in _some_ cases to save/restore
>> a few registers.
>
> Yeah if you don't need to save all nvgprs you can use caller's frame 
> plus a few bytes in the minimum frame as volatile storage.
>
> STACK_FRAME_OVERHEAD should be 32 on LE, but I think the problem is a 
> lot of asm uses it and hasn't necessarily been audited to make sure it's 
> not assuming it's bigger.

Yeah it's a total mess. See this ~3.5 year old issue :/

https://github.com/linuxppc/issues/issues/113

> You could actually use STACK_FRAME_MIN_SIZE for new code, maybe we add
> a STACK_FRAME_MIN_NVGPR_SIZE to match and use that.

Something like that makes me nervous because it's so easy to
accidentally use one of the macros that expects a full pt_regs.

I think ideally you have just two options.

Option 1:

You use:
  STACK_FRAME_WITH_PT_REGS = STACK_FRAME_MIN_SIZE + sizeof(struct pt_regs)

And then you can use all the macros in asm-offsets.c generated with
STACK_PT_REGS_OFFSET.


Option 2:

If you want to be fancy you manually allocate your frame with just
the right amount of space, but with the size there in the code, so for
example the idle code that wants 19 GPRs would do:

stdur1, -(STACK_FRAME_MIN_SIZE + 8 * 19)(r1)

And then it's very obvious that you have a non-standard frame size and
need to be more careful.

cheers

[PATCH 18/20] crypto: nx: nx_debugfs: Header comments should not be kernel-doc

2021-02-04 Thread Lee Jones

Fixes the following W=1 kernel build warning(s):

 drivers/crypto/nx/nx_debugfs.c:34: warning: Function parameter or member 'drv' 
not described in 'nx_debugfs_init'
 drivers/crypto/nx/nx_debugfs.c:34: warning: expecting prototype for Nest 
Accelerators driver(). Prototype was for nx_debugfs_init() instead

Cc: "Breno Leitão" 
Cc: Nayna Jain 
Cc: Paulo Flabiano Smorigo 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: Kent Yoder 
Cc: linux-cry...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Lee Jones 
---
 drivers/crypto/nx/nx_debugfs.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/nx/nx_debugfs.c b/drivers/crypto/nx/nx_debugfs.c
index 1975bcbee9974..ee7cd88bb10a7 100644
--- a/drivers/crypto/nx/nx_debugfs.c
+++ b/drivers/crypto/nx/nx_debugfs.c
@@ -1,5 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
-/**
+/*
  * debugfs routines supporting the Power 7+ Nest Accelerators driver
  *
  * Copyright (C) 2011-2012 International Business Machines Inc.
-- 
2.25.1

[PATCH 16/20] crypto: vmx: Source headers are not good kernel-doc candidates

2021-02-04 Thread Lee Jones

Fixes the following W=1 kernel build warning(s):

 drivers/crypto/vmx/vmx.c:23: warning: expecting prototype for Routines 
supporting VMX instructions on the Power 8(). Prototype was for p8_init() 
instead

Cc: "Breno Leitão" 
Cc: Nayna Jain 
Cc: Paulo Flabiano Smorigo 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: Henrique Cerri 
Cc: linux-cry...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Lee Jones 
---
 drivers/crypto/vmx/vmx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/vmx/vmx.c b/drivers/crypto/vmx/vmx.c
index a40d08e75fc0b..7eb713cc87c8c 100644
--- a/drivers/crypto/vmx/vmx.c
+++ b/drivers/crypto/vmx/vmx.c
@@ -1,5 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
-/**
+/*
  * Routines supporting VMX instructions on the Power 8
  *
  * Copyright (C) 2015 International Business Machines Inc.
-- 
2.25.1

[PATCH 17/20] crypto: nx: nx-aes-cbc: Headers comments should not be kernel-doc

2021-02-04 Thread Lee Jones

Fixes the following W=1 kernel build warning(s):

 drivers/crypto/nx/nx-aes-cbc.c:24: warning: Function parameter or member 'tfm' 
not described in 'cbc_aes_nx_set_key'
 drivers/crypto/nx/nx-aes-cbc.c:24: warning: Function parameter or member 
'in_key' not described in 'cbc_aes_nx_set_key'
 drivers/crypto/nx/nx-aes-cbc.c:24: warning: Function parameter or member 
'key_len' not described in 'cbc_aes_nx_set_key'
 drivers/crypto/nx/nx-aes-cbc.c:24: warning: expecting prototype for Nest 
Accelerators driver(). Prototype was for cbc_aes_nx_set_key() instead

Cc: "Breno Leitão" 
Cc: Nayna Jain 
Cc: Paulo Flabiano Smorigo 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: Kent Yoder 
Cc: linux-cry...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Lee Jones 
---
 drivers/crypto/nx/nx-aes-cbc.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/nx/nx-aes-cbc.c b/drivers/crypto/nx/nx-aes-cbc.c
index 92e921eceed75..d6314ea9ae896 100644
--- a/drivers/crypto/nx/nx-aes-cbc.c
+++ b/drivers/crypto/nx/nx-aes-cbc.c
@@ -1,5 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
-/**
+/*
  * AES CBC routines supporting the Power 7+ Nest Accelerators driver
  *
  * Copyright (C) 2011-2012 International Business Machines Inc.
-- 
2.25.1

[PATCH 19/20] crypto: nx: Demote header comment and add description for 'nbytes'

2021-02-04 Thread Lee Jones

Fixes the following W=1 kernel build warning(s):

 drivers/crypto/nx/nx.c:31: warning: Incorrect use of kernel-doc format:  * 
nx_hcall_sync - make an H_COP_OP hcall for the passed in op structure
 drivers/crypto/nx/nx.c:43: warning: Function parameter or member 'nx_ctx' not 
described in 'nx_hcall_sync'
 drivers/crypto/nx/nx.c:43: warning: Function parameter or member 'op' not 
described in 'nx_hcall_sync'
 drivers/crypto/nx/nx.c:43: warning: Function parameter or member 'may_sleep' 
not described in 'nx_hcall_sync'
 drivers/crypto/nx/nx.c:43: warning: expecting prototype for Nest Accelerators 
driver(). Prototype was for nx_hcall_sync() instead
 drivers/crypto/nx/nx.c:209: warning: Function parameter or member 'nbytes' not 
described in 'trim_sg_list'

Cc: "Breno Leitão" 
Cc: Nayna Jain 
Cc: Paulo Flabiano Smorigo 
Cc: Michael Ellerman 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: Kent Yoder 
Cc: linux-cry...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Signed-off-by: Lee Jones 
---
 drivers/crypto/nx/nx.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/nx/nx.c b/drivers/crypto/nx/nx.c
index 0d2dc5be7f192..010be6793c9fc 100644
--- a/drivers/crypto/nx/nx.c
+++ b/drivers/crypto/nx/nx.c
@@ -1,5 +1,5 @@
 // SPDX-License-Identifier: GPL-2.0-only
-/**
+/*
  * Routines supporting the Power 7+ Nest Accelerators driver
  *
  * Copyright (C) 2011-2012 International Business Machines Inc.
@@ -200,7 +200,8 @@ struct nx_sg *nx_walk_and_build(struct nx_sg   *nx_dst,
  * @sg: sg list head
  * @end: sg lisg end
  * @delta:  is the amount we need to crop in order to bound the list.
- *
+ * @nbytes: length of data in the scatterlists or data length - whichever
+ *  is greater.
  */
 static long int trim_sg_list(struct nx_sg *sg,
 struct nx_sg *end,
-- 
2.25.1

[PATCH 00/20] Rid W=1 warnings in Crypto

2021-02-04 Thread Lee Jones

This set is part of a larger effort attempting to clean-up W=1
kernel builds, which are currently overwhelmingly riddled with
niggly little warnings.

This is set 1 of 2 sets required to fully clean Crypto.

Lee Jones (20):
  crypto: hisilicon: sec_drv: Supply missing description for
'sec_queue_empty()'s 'queue' param
  crypto: bcm: util: Repair a couple of documentation formatting issues
  crypto: chelsio: chcr_core: File headers are not good candidates for
kernel-doc
  crypto: ux500: hash: hash_core: Fix worthy kernel-doc headers and
remove others
  crypto: bcm: spu: Fix formatting and misspelling issues
  crypto: keembay: ocs-hcu: Fix incorrectly named functions/structs
  crypto: bcm: spu2: Fix a whole host of kernel-doc misdemeanours
  crypto: ux500: cryp: Demote some conformant non-kernel headers fix
another
  crypto: ux500: cryp_irq: File headers are not good kernel-doc
candidates
  crypto: chelsio: chcr_algo: Fix a couple of kernel-doc issues caused
by doc-rot
  crypto: ux500: cryp_core: Fix formatting issue and add description for
'session_id'
  crypto: atmel-ecc: Struct headers need to start with keyword 'struct'
  crypto: bcm: cipher: Provide description for 'req' and fix formatting
issues
  crypto: caam: caampkc: Provide the name of the function
  crypto: caam: caamalg_qi2: Supply a couple of 'fallback' related
descriptions
  crypto: vmx: Source headers are not good kernel-doc candidates
  crypto: nx: nx-aes-cbc: Headers comments should not be kernel-doc
  crypto: nx: nx_debugfs: Header comments should not be kernel-doc
  crypto: nx: Demote header comment and add description for 'nbytes'
  crypto: cavium: nitrox_isr: Demote non-compliant kernel-doc headers

 drivers/crypto/atmel-ecc.c|  2 +-
 drivers/crypto/bcm/cipher.c   |  7 ++--
 drivers/crypto/bcm/spu.c  | 16 -
 drivers/crypto/bcm/spu2.c | 43 +--
 drivers/crypto/bcm/util.c |  4 +--
 drivers/crypto/caam/caamalg_qi2.c |  2 ++
 drivers/crypto/caam/caampkc.c |  3 +-
 drivers/crypto/cavium/nitrox/nitrox_isr.c |  4 +--
 drivers/crypto/chelsio/chcr_algo.c|  8 ++---
 drivers/crypto/chelsio/chcr_core.c|  2 +-
 drivers/crypto/hisilicon/sec/sec_drv.c|  1 +
 drivers/crypto/keembay/ocs-hcu.c  |  6 ++--
 drivers/crypto/nx/nx-aes-cbc.c|  2 +-
 drivers/crypto/nx/nx.c|  5 +--
 drivers/crypto/nx/nx_debugfs.c|  2 +-
 drivers/crypto/ux500/cryp/cryp.c  |  5 +--
 drivers/crypto/ux500/cryp/cryp_core.c |  5 +--
 drivers/crypto/ux500/cryp/cryp_irq.c  |  2 +-
 drivers/crypto/ux500/hash/hash_core.c | 15 +++-
 drivers/crypto/vmx/vmx.c  |  2 +-
 20 files changed, 71 insertions(+), 65 deletions(-)

Cc: Alexandre Belloni 
Cc: Andreas Westin 
Cc: Atul Gupta 
Cc: Aymen Sghaier 
Cc: Ayush Sawal 
Cc: Benjamin Herrenschmidt 
Cc: Berne Hebark 
Cc: "Breno Leitão" 
Cc: Daniele Alessandrelli 
Cc: "David S. Miller" 
Cc: Declan Murphy 
Cc: "Gustavo A. R. Silva" 
Cc: Harsh Jain 
Cc: Henrique Cerri 
Cc: Herbert Xu 
Cc: "Horia Geantă" 
Cc: Jitendra Lulla 
Cc: Joakim Bech 
Cc: Jonas Linde 
Cc: Jonathan Cameron 
Cc: Kent Yoder 
Cc: linux-arm-ker...@lists.infradead.org
Cc: linux-cry...@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: Ludovic Desroches 
Cc: Manoj Malviya 
Cc: Michael Ellerman 
Cc: M R Gowda 
Cc: Nayna Jain 
Cc: Nicolas Ferre 
Cc: Niklas Hernaeus 
Cc: Paul Mackerras 
Cc: Paulo Flabiano Smorigo 
Cc: Rob Rice 
Cc: Rohit Maheshwari 
Cc: Shujuan Chen 
Cc: Takashi Iwai 
Cc: Tudor Ambarus 
Cc: Vinay Kumar Yadav 
Cc: Zaibo Xu 
-- 
2.25.1

Re: [PATCH v2] powerpc/uprobes: Validation for prefixed instruction

2021-02-04 Thread Ravi Bangoria





On 2/4/21 4:17 PM, Ravi Bangoria wrote:

Don't allow Uprobe on 2nd word of a prefixed instruction. As per
ISA 3.1, prefixed instruction should not cross 64-byte boundary.
So don't allow Uprobe on such prefixed instruction as well.

There are two ways probed instruction is changed in mapped pages.
First, when Uprobe is activated, it searches for all the relevant
pages and replace instruction in them. In this case, if we notice
that probe is on the 2nd word of prefixed instruction, error out
directly. Second, when Uprobe is already active and user maps a
relevant page via mmap(), instruction is replaced via mmap() code
path. But because Uprobe is invalid, entire mmap() operation can
not be stopped. In this case just print an error and continue.


@mpe,

arch_uprobe_analyze_insn() can return early if
cpu_has_feature(CPU_FTR_ARCH_31) is not set. But that will
miss out a rare scenario of user running binary with prefixed
instruction on p10 predecessors. Please let me know if I
should add cpu_has_feature(CPU_FTR_ARCH_31) or not.

- Ravi

[PATCH v2] powerpc/uprobes: Validation for prefixed instruction

2021-02-04 Thread Ravi Bangoria

Don't allow Uprobe on 2nd word of a prefixed instruction. As per
ISA 3.1, prefixed instruction should not cross 64-byte boundary.
So don't allow Uprobe on such prefixed instruction as well.

There are two ways probed instruction is changed in mapped pages.
First, when Uprobe is activated, it searches for all the relevant
pages and replace instruction in them. In this case, if we notice
that probe is on the 2nd word of prefixed instruction, error out
directly. Second, when Uprobe is already active and user maps a
relevant page via mmap(), instruction is replaced via mmap() code
path. But because Uprobe is invalid, entire mmap() operation can
not be stopped. In this case just print an error and continue.

Signed-off-by: Ravi Bangoria 
---
v1: http://lore.kernel.org/r/20210119091234.76317-1-ravi.bango...@linux.ibm.com
v1->v2:
  - Instead of introducing new arch hook from verify_opcode(), use
existing hook arch_uprobe_analyze_insn().
  - Add explicit check for prefixed instruction crossing 64-byte
boundary. If probe is on such instruction, throw an error.

 arch/powerpc/kernel/uprobes.c | 66 ++-
 1 file changed, 65 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/uprobes.c b/arch/powerpc/kernel/uprobes.c
index e8a63713e655..485d19a2a31f 100644
--- a/arch/powerpc/kernel/uprobes.c
+++ b/arch/powerpc/kernel/uprobes.c
@@ -7,6 +7,7 @@
  * Adapted from the x86 port by Ananth N Mavinakayanahalli 
  */
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -28,6 +29,69 @@ bool is_trap_insn(uprobe_opcode_t *insn)
return (is_trap(*insn));
 }
 
+#ifdef CONFIG_PPC64
+static int get_instr(struct mm_struct *mm, unsigned long addr, u32 *instr)
+{
+   struct page *page;
+   struct vm_area_struct *vma;
+   void *kaddr;
+   unsigned int gup_flags = FOLL_FORCE | FOLL_SPLIT_PMD;
+
+   if (get_user_pages_remote(mm, addr, 1, gup_flags, , , NULL) <= 
0)
+   return -EINVAL;
+
+   kaddr = kmap_atomic(page);
+   *instr = *((u32 *)(kaddr + (addr & ~PAGE_MASK)));
+   kunmap_atomic(kaddr);
+   put_page(page);
+   return 0;
+}
+
+static int validate_prefixed_instr(struct mm_struct *mm, unsigned long addr)
+{
+   struct ppc_inst inst;
+   u32 prefix, suffix;
+
+   /*
+* No need to check if addr is pointing to beginning of the
+* page. Even if probe is on a suffix of page-unaligned
+* prefixed instruction, hw will raise exception and kernel
+* will send SIGBUS.
+*/
+   if (!(addr & ~PAGE_MASK))
+   return 0;
+
+   if (get_instr(mm, addr, ) < 0)
+   return -EINVAL;
+   if (get_instr(mm, addr + 4, ) < 0)
+   return -EINVAL;
+
+   inst = ppc_inst_prefix(prefix, suffix);
+   if (ppc_inst_prefixed(inst) && (addr & 0x3F) == 0x3C) {
+   printk_ratelimited("Cannot register a uprobe on 64 byte "
+  "unaligned prefixed instruction\n");
+   return -EINVAL;
+   }
+
+   suffix = prefix;
+   if (get_instr(mm, addr - 4, ) < 0)
+   return -EINVAL;
+
+   inst = ppc_inst_prefix(prefix, suffix);
+   if (ppc_inst_prefixed(inst)) {
+   printk_ratelimited("Cannot register a uprobe on the second "
+  "word of prefixed instruction\n");
+   return -EINVAL;
+   }
+   return 0;
+}
+#else
+static int validate_prefixed_instr(struct mm_struct *mm, unsigned long addr)
+{
+   return 0;
+}
+#endif
+
 /**
  * arch_uprobe_analyze_insn
  * @mm: the probed address space.
@@ -41,7 +105,7 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe,
if (addr & 0x03)
return -EINVAL;
 
-   return 0;
+   return validate_prefixed_instr(mm, addr);
 }
 
 /*
-- 
2.26.2

Re: [PATCH 01/13] powerpc/powernv: remove get_cxl_module

2021-02-04 Thread Michael Ellerman

Christoph Hellwig  writes:
> The static inline get_cxl_module function is entirely unused since commit
> 8bf6b91a5125a ("Revert "powerpc/powernv: Add support for the cxl kernel
> api on the real phb"), so remove it.
>
> Signed-off-by: Christoph Hellwig 
> Reviewed-by: Andrew Donnellan 
> ---
>  arch/powerpc/platforms/powernv/pci-cxl.c | 22 --
>  1 file changed, 22 deletions(-)

Acked-by: Michael Ellerman 

cheers

RE: [PATCH v2 1/3] powerpc: sstep: Fix load and update emulation

2021-02-04 Thread David Laight

From: Segher Boessenkool
> Sent: 03 February 2021 21:18
...
> Power9 does:
> 
>   Load with Update Instructions (RA = 0)
> EA is placed into R0.

Does that change the value of 0?
Rather reminds me of some 1960s era systems that had the small integers
at fixed (global) addresses.
FORTRAN always passes by reference, pass 0 and the address of the global
zero location was passed, the called function could change 0 to 1 for
the entire computer!

>   Load with Update Instructions (RA = RT)
> The storage operand addressed by EA is accessed. The displacement
> field is added to the data returned by the load and placed into RT.

Shame that isn't standard - could be used to optimise some code.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)

Re: [PATCH v7 39/42] powerpc: move NMI entry/exit code into wrapper

2021-02-04 Thread Michael Ellerman

Nicholas Piggin  writes:
> This moves the common NMI entry and exit code into the interrupt handler
> wrappers.
>
> This changes the behaviour of soft-NMI (watchdog) and HMI interrupts, and
> also MCE interrupts on 64e, by adding missing parts of the NMI entry to
> them.
>
> Signed-off-by: Nicholas Piggin 
> ---
>  arch/powerpc/include/asm/interrupt.h | 28 ++
>  arch/powerpc/kernel/mce.c| 11 -
>  arch/powerpc/kernel/traps.c  | 35 +---
>  arch/powerpc/kernel/watchdog.c   | 10 
>  4 files changed, 38 insertions(+), 46 deletions(-)

This is unhappy when injecting SLB multi-hits:

  root@p86-2:~# echo PPC_SLB_MULTIHIT > /sys/kernel/debug/provoke-crash/DIRECT
  [  312.496026][ T1344] kernel BUG at arch/powerpc/include/asm/interrupt.h:152!
  [  312.496037][ T1344] Oops: Exception in kernel mode, sig: 5 [#1]
  [  312.496045][ T1344] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
  [  312.496053][ T1344] Modules linked in: squashfs dm_multipath scsi_dh_rdac 
scsi_dh_alua pseries_rng rng_core vmx_crypto gf128mul crc32c_vpmsum ip_tables 
x_tables autofs4
  [  312.496085][ T1344] CPU: 19 PID: 1344 Comm: bash Not tainted 
5.11.0-rc2-gcc-8.2.0-00123-g3fadced17474-dirty #638
  [  312.496096][ T1344] NIP:  c0ef1618 LR: c0ef1600 CTR: 

  [  312.496104][ T1344] REGS: c0001eb4ba00 TRAP: 0700   Not tainted  
(5.11.0-rc2-gcc-8.2.0-00123-g3fadced17474-dirty)
  [  312.496113][ T1344] MSR:  80021033   CR: 
48428284  XER: 
  [  312.496132][ T1344] CFAR: c0ef28b8 IRQMASK: 3
  [  312.496132][ T1344] GPR00: c0ef15e4 c0001eb4bca0 
c1a53900 0001
  [  312.496132][ T1344] GPR04: c17e7230 4000 
3ffe 6d66
  [  312.496132][ T1344] GPR08: c0001ec6fe80 0001 
ce72d380 1001
  [  312.496132][ T1344] GPR12: 0010 c0001ec6fe80 
0100235ad1d0 00013c2fb738
  [  312.496132][ T1344] GPR16: 00013c210ae0  
2200 0100235af740
  [  312.496132][ T1344] GPR20:  0001 
00013c2a3ca0 c00033c50040
  [  312.496132][ T1344] GPR24: c000100fbd80  
c1a7dc78 0001
  [  312.496132][ T1344] GPR28: c0001eb4bd60 0001 
 0001
  [  312.496229][ T1344] NIP [c0ef1618] machine_check_early+0x138/0x1f0
  [  312.496241][ T1344] LR [c0ef1600] machine_check_early+0x120/0x1f0
  [  312.496249][ T1344] Call Trace:
  [  312.496254][ T1344] [c0001eb4bca0] [c0ef15e4] 
machine_check_early+0x104/0x1f0 (unreliable)
  [  312.496267][ T1344] [c0001eb4bcf0] [c0008394] 
machine_check_early_common+0x134/0x1f0
  [  312.496279][ T1344] --- interrupt: 200 at 
lkdtm_PPC_SLB_MULTIHIT+0x128/0x138
  [  312.496290][ T1344] NIP:  c09cfea8 LR: c09cfea0 CTR: 

  [  312.496298][ T1344] REGS: c0001eb4bd60 TRAP: 0200   Not tainted  
(5.11.0-rc2-gcc-8.2.0-00123-g3fadced17474-dirty)
  [  312.496307][ T1344] MSR:  80209033   CR: 
24428284  XER: 
  [  312.496326][ T1344] CFAR: 021c DAR: c0080388 DSISR: 
0080 IRQMASK: 0
  [  312.496326][ T1344] GPR00: c09cfea0 c000100fbb80 
c1a53900 c0080388
  [  312.496326][ T1344] GPR04: 00b0 c00401023440 
8e01c53300c0 00bf50d9
  [  312.496326][ T1344] GPR08: 0fff 0021 
001c c0080488
  [  312.496326][ T1344] GPR12: 48428222 c0001ec6fe80 
0100235ad1d0 00013c2fb738
  [  312.496326][ T1344] GPR16: 00013c210ae0  
2200 0100235af740
  [  312.496326][ T1344] GPR20:  0001 
00013c2a3ca0 c00033c50040
  [  312.496326][ T1344] GPR24: c000100fbd80  
0011 c1a2ffb8
  [  312.496326][ T1344] GPR28: 04b0 c00033c5 
c1105298 c0080388
  [  312.496427][ T1344] NIP [c09cfea8] 
lkdtm_PPC_SLB_MULTIHIT+0x128/0x138
  [  312.496437][ T1344] LR [c09cfea0] 
lkdtm_PPC_SLB_MULTIHIT+0x120/0x138
  [  312.496446][ T1344] --- interrupt: 200
  [  312.496451][ T1344] [c000100fbbf0] [c09cb530] 
lkdtm_do_action+0x40/0x80
  [  312.496462][ T1344] [c000100fbc10] [c09cbdfc] 
direct_entry+0x16c/0x350
  [  312.496473][ T1344] [c000100fbcc0] [c07a7590] 
full_proxy_write+0x90/0xe0
  [  312.496484][ T1344] [c000100fbd10] [c046b090] 
vfs_write+0xf0/0x310
  [  312.496496][ T1344] [c000100fbd60] [c046b48c] 
ksys_write+0x7c/0x140
  [  312.496507][ T1344] [c000100fbdb0] [c0036340] 
system_call_exception+0x1a0/0x2e0
  [  312.496518][ T1344] [c000100fbe10] [c000d360] 
system_call_common+0xf0/0x27c
  [  312.496528][ T1344]

Re: [PATCH] powerpc64/idle: Fix SP offsets when saving GPRs

2021-02-04 Thread Nicholas Piggin

Excerpts from Christopher M. Riedl's message of February 4, 2021 4:59 pm:
> On Sat Jan 30, 2021 at 7:44 AM CST, Nicholas Piggin wrote:
>> Excerpts from Michael Ellerman's message of January 30, 2021 9:32 pm:
>> > "Christopher M. Riedl"  writes:
>> >> The idle entry/exit code saves/restores GPRs in the stack "red zone"
>> >> (Protected Zone according to PowerPC64 ELF ABI v2). However, the offset
>> >> used for the first GPR is incorrect and overwrites the back chain - the
>> >> Protected Zone actually starts below the current SP. In practice this is
>> >> probably not an issue, but it's still incorrect so fix it.
>> > 
>> > Nice catch.
>> > 
>> > Corrupting the back chain means you can't backtrace from there, which
>> > could be confusing for debugging one day.
>>
>> Yeah, we seem to have got away without noticing because the CPU will
>> wake up and return out of here before it tries to unwind the stack,
>> but if you tried to walk it by hand if the CPU got stuck in idle or
>> something, then we'd get confused.
>>
>> > It does make me wonder why we don't just create a stack frame and use
>> > the normal macros? It would use a bit more stack space, but we shouldn't
>> > be short of stack space when going idle.
>> > 
>> > Nick, was there a particular reason for using the red zone?
>>
>> I don't recall a particular reason, I think a normal stack frame is
>> probably a good idea.
> 
> I'll send a version using STACKFRAMESIZE - I assume that's the "normal"
> stack frame :)
> 

I think STACKFRAMESIZE is STACK_FRAME_OVERHEAD + NVGPRs. LR and CR can 
be saved in the caller's frame so that should be okay.

> I admit I am a bit confused when I saw the similar but much smaller
> STACK_FRAME_OVERHEAD which is also used in _some_ cases to save/restore
> a few registers.

Yeah if you don't need to save all nvgprs you can use caller's frame 
plus a few bytes in the minimum frame as volatile storage.

STACK_FRAME_OVERHEAD should be 32 on LE, but I think the problem is a 
lot of asm uses it and hasn't necessarily been audited to make sure it's 
not assuming it's bigger. You could actually use STACK_FRAME_MIN_SIZE
for new code, maybe we add a STACK_FRAME_MIN_NVGPR_SIZE to match and
use that.

Thanks,
Nick

Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C

2021-02-04 Thread Nicholas Piggin

Excerpts from Christophe Leroy's message of February 4, 2021 6:03 pm:
> 
> 
> Le 04/02/2021 à 04:27, Nicholas Piggin a écrit :
>> Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:
>>>
>>>
>>> Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :
 Implement the bulk of interrupt return logic in C. The asm return code
 must handle a few cases: restoring full GPRs, and emulating stack store.

>>>
>>>
 +notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs, 
 unsigned long msr)
 +{
 +  unsigned long *ti_flagsp = _thread_info()->flags;
 +  unsigned long flags;
 +
 +  if (IS_ENABLED(CONFIG_PPC_BOOK3S) && unlikely(!(regs->msr & MSR_RI)))
 +  unrecoverable_exception(regs);
 +  BUG_ON(regs->msr & MSR_PR);
 +  BUG_ON(!FULL_REGS(regs));
 +
 +  local_irq_save(flags);
 +
 +  if (regs->softe == IRQS_ENABLED) {
 +  /* Returning to a kernel context with local irqs enabled. */
 +  WARN_ON_ONCE(!(regs->msr & MSR_EE));
 +again:
 +  if (IS_ENABLED(CONFIG_PREEMPT)) {
 +  /* Return to preemptible kernel context */
 +  if (unlikely(*ti_flagsp & _TIF_NEED_RESCHED)) {
 +  if (preempt_count() == 0)
 +  preempt_schedule_irq();
 +  }
 +  }
 +
 +  trace_hardirqs_on();
 +  __hard_EE_RI_disable();
 +  if (unlikely(lazy_irq_pending())) {
 +  __hard_RI_enable();
 +  irq_soft_mask_set(IRQS_ALL_DISABLED);
 +  trace_hardirqs_off();
 +  local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
 +  /*
 +   * Can't local_irq_enable in case we are in interrupt
 +   * context. Must replay directly.
 +   */
 +  replay_soft_interrupts();
 +  irq_soft_mask_set(flags);
 +  /* Took an interrupt, may have more exit work to do. */
 +  goto again;
 +  }
 +  local_paca->irq_happened = 0;
 +  irq_soft_mask_set(IRQS_ENABLED);
 +  } else {
 +  /* Returning to a kernel context with local irqs disabled. */
 +  trace_hardirqs_on();
 +  __hard_EE_RI_disable();
 +  if (regs->msr & MSR_EE)
 +  local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS;
 +  }
 +
 +
 +#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
 +  local_paca->tm_scratch = regs->msr;
 +#endif
 +
 +  /*
 +   * We don't need to restore AMR on the way back to userspace for KUAP.
 +   * The value of AMR only matters while we're in the kernel.
 +   */
 +  kuap_restore_amr(regs);
>>>
>>> Is that correct to restore KUAP state here ? Shouldn't we have it at lower 
>>> level in assembly ?
>>>
>>> Isn't there a risk that someone manages to call 
>>> interrupt_exit_kernel_prepare() or the end of it in
>>> a way or another, and get the previous KUAP state restored by this way ?
>> 
>> I'm not sure if there much more risk if it's here rather than the
>> instruction being in another place in the code.
>> 
>> There's a lot of user access around the kernel too if you want to find a
>> gadget to unlock KUAP then I suppose there is a pretty large attack
>> surface.
> 
> My understanding is that user access scope is strictly limited, for instance 
> we enforce the 
> begin/end of user access to be in the same function, and we refrain from 
> calling any other function 
> inside the user access window. x86 even have 'objtool' to enforce it at build 
> time. So in theory 
> there is no way to get out of the function while user access is open.
> 
> Here with the interrupt exit function it is free beer. You have a place where 
> you re-open user 
> access and return with a simple blr. So that's open bar. If someone manages 
> to just call the 
> interrupt exit function, then user access remains open

Hmm okay maybe that's a good point.

>>> Also, it looks a bit strange to have kuap_save_amr_and_lock() done at 
>>> lowest level in assembly, and
>>> kuap_restore_amr() done in upper level. That looks unbalanced.
>> 
>> I'd like to bring the entry assembly into C.
>> 
> 
> I really think it's not a good idea. We'll get better control and readability 
> by keeping it at the 
> lowest possible level in assembly.
> 
> x86 only save and restore SMAC state in assembly.

Okay we could go the other way and move the unlock into asm then.

Thanks,
Nick

Re: [PATCH RFC v1 5/6] xen-swiotlb: convert variables to arrays

2021-02-04 Thread Christoph Hellwig

So one thing that has been on my mind for a while:  I'd really like
to kill the separate dma ops in Xen swiotlb.  If we compare xen-swiotlb
to swiotlb the main difference seems to be:

 - additional reasons to bounce I/O vs the plain DMA capable
 - the possibility to do a hypercall on arm/arm64
 - an extra translation layer before doing the phys_to_dma and vice
   versa
 - an special memory allocator

I wonder if inbetween a few jump labels or other no overhead enablement
options and possibly better use of the dma_range_map we could kill
off most of swiotlb-xen instead of maintaining all this code duplication?

Re: [PATCH v4 1/2] powerpc: sstep: Fix load-store and update emulation

2021-02-04 Thread Naveen N. Rao

On 2021/02/04 01:37PM, Sandipan Das wrote:
> The Power ISA says that the fixed-point load and update
> instructions must neither use R0 for the base address (RA)
> nor have the destination (RT) and the base address (RA) as
> the same register. Similarly, for fixed-point stores and
> floating-point loads and stores, the instruction is invalid
> when R0 is used as the base address (RA).
> 
> This is applicable to the following instructions.
>   * Load Byte and Zero with Update (lbzu)
>   * Load Byte and Zero with Update Indexed (lbzux)
>   * Load Halfword and Zero with Update (lhzu)
>   * Load Halfword and Zero with Update Indexed (lhzux)
>   * Load Halfword Algebraic with Update (lhau)
>   * Load Halfword Algebraic with Update Indexed (lhaux)
>   * Load Word and Zero with Update (lwzu)
>   * Load Word and Zero with Update Indexed (lwzux)
>   * Load Word Algebraic with Update Indexed (lwaux)
>   * Load Doubleword with Update (ldu)
>   * Load Doubleword with Update Indexed (ldux)
>   * Load Floating Single with Update (lfsu)
>   * Load Floating Single with Update Indexed (lfsux)
>   * Load Floating Double with Update (lfdu)
>   * Load Floating Double with Update Indexed (lfdux)
>   * Store Byte with Update (stbu)
>   * Store Byte with Update Indexed (stbux)
>   * Store Halfword with Update (sthu)
>   * Store Halfword with Update Indexed (sthux)
>   * Store Word with Update (stwu)
>   * Store Word with Update Indexed (stwux)
>   * Store Doubleword with Update (stdu)
>   * Store Doubleword with Update Indexed (stdux)
>   * Store Floating Single with Update (stfsu)
>   * Store Floating Single with Update Indexed (stfsux)
>   * Store Floating Double with Update (stfdu)
>   * Store Floating Double with Update Indexed (stfdux)
> 
> E.g. the following behaviour is observed for an invalid
> load and update instruction having RA = RT.
> 
> While a userspace program having an instruction word like
> 0xe9ce0001, i.e. ldu r14, 0(r14), runs without getting
> receiving a SIGILL on a Power system (observed on P8 and
> P9), the outcome of executing that instruction word varies
> and its behaviour can be considered to be undefined.
> 
> Attaching an uprobe at that instruction's address results
> in emulation which currently performs the load as well as
> writes the effective address back to the base register.
> This might not match the outcome from hardware.
> 
> To remove any inconsistencies, this adds additional checks
> for the aforementioned instructions to make sure that the
> emulation infrastructure treats them as unknown. The kernel
> can then fallback to executing such instructions on hardware.
> 
> Fixes: 0016a4cf5582 ("powerpc: Emulate most Book I instructions in 
> emulate_step()")
> Signed-off-by: Sandipan Das 
> ---
> Previous versions can be found at:
> v3: 
> https://lore.kernel.org/linuxppc-dev/20210204071432.116439-1-sandi...@linux.ibm.com/
> v2: 
> https://lore.kernel.org/linuxppc-dev/20210203063841.431063-1-sandi...@linux.ibm.com/
> v1: 
> https://lore.kernel.org/linuxppc-dev/20201119054139.244083-1-sandi...@linux.ibm.com/
> 
> Changes in v4:
> - Fixed grammar and switch-case alignment.
> 
> Changes in v3:
> - Consolidated the checks as suggested by Naveen.
> - Consolidated load/store changes into a single patch.
> - Included floating-point load/store and update instructions.
> 
> Changes in v2:
> - Jump to unknown_opcode instead of returning -1 for invalid
>   instruction forms.
> 
> ---
>  arch/powerpc/lib/sstep.c | 14 ++
>  1 file changed, 14 insertions(+)

For the series:
Reviewed-by: Naveen N. Rao 

- Naveen

Re: [PATCH v2 1/3] powerpc: sstep: Fix load and update emulation

2021-02-04 Thread Naveen N. Rao

On 2021/02/03 03:17PM, Segher Boessenkool wrote:
> On Wed, Feb 03, 2021 at 03:19:09PM +0530, Naveen N. Rao wrote:
> > On 2021/02/03 12:08PM, Sandipan Das wrote:
> > > The Power ISA says that the fixed-point load and update
> > > instructions must neither use R0 for the base address (RA)
> > > nor have the destination (RT) and the base address (RA) as
> > > the same register. In these cases, the instruction is
> > > invalid.
> 
> > > However, the following behaviour is observed using some
> > > invalid opcodes where RA = RT.
> > > 
> > > An userspace program using an invalid instruction word like
> > > 0xe9ce0001, i.e. "ldu r14, 0(r14)", runs and exits without
> > > getting terminated abruptly. The instruction performs the
> > > load operation but does not write the effective address to
> > > the base address register. 
> > 
> > While the processor (p8 in my test) doesn't seem to be throwing an 
> > exception, I don't think it is necessarily loading the value. Qemu 
> > throws an exception though. It's probably best to term the behavior as 
> > being undefined.
> 
> Power8 does:
> 
>   Load with Update Instructions (RA = 0)
> EA is placed into R0.
>   Load with Update Instructions (RA = RT)
> EA is placed into RT. The storage operand addressed by EA is
> accessed, but the data returned by the load is discarded.

I'm actually not seeing that. This is what I am testing with:
li  8,0xaaa
mr  6,1
std 8,64(6)
#ldu6,64(6)
.long   0xe8c60041

And, r6 always ends up with 0xaea. It changes with the value I put into 
r6 though.

Granted, this is all up in the air, but it does look like there is more 
going on and the value isn't the EA or the value at the address.

> 
> Power9 does:
> 
>   Load with Update Instructions (RA = 0)
> EA is placed into R0.
>   Load with Update Instructions (RA = RT)
> The storage operand addressed by EA is accessed. The displacement
> field is added to the data returned by the load and placed into RT.
> 
> Both UMs also say
> 
>   Invalid Forms
> In general, the POWER9 core handles invalid forms of instructions in
> the manner that is most convenient for the particular case (within
> the scope of meeting the boundedly-undefined definition described in
> the Power ISA). This document specifies the behavior for these
> cases.  However, it is not recommended that software or other system
> facilities make use of the POWER9 behavior in these cases because
> such behavior might be different in another processor that
> implements the Power ISA.
> 
> (or POWER8 instead of POWER9 of course).  Always complaining about most
> invalid forms seems wise, certainly if not all recent CPUs behave the
> same :-)

Agreed.

- Naveen

Re: [PATCH v2 0/5] shoot lazy tlbs

2021-02-04 Thread Nicholas Piggin

I'll ask Andrew to put this in -mm if no objections.

The series now doesn't touch other archs in non-trivial ways, and core code
is functionally not changed much / at all if the option is not selected so
it's actually pretty simple aside from the powerpc change.

Thanks,
Nick

Excerpts from Nicholas Piggin's message of December 14, 2020 4:53 pm:
> This is another rebase, on top of mainline now (don't need the
> asm-generic tree), and without any x86 or membarrier changes.
> This makes the series far smaller and more manageable and
> without the controversial bits.
> 
> Thanks,
> Nick
> 
> Nicholas Piggin (5):
>   lazy tlb: introduce lazy mm refcount helper functions
>   lazy tlb: allow lazy tlb mm switching to be configurable
>   lazy tlb: shoot lazies, a non-refcounting lazy tlb option
>   powerpc: use lazy mm refcount helper functions
>   powerpc/64s: enable MMU_LAZY_TLB_SHOOTDOWN
> 
>  arch/Kconfig | 30 ++
>  arch/arm/mach-rpc/ecard.c|  2 +-
>  arch/powerpc/Kconfig |  1 +
>  arch/powerpc/kernel/smp.c|  2 +-
>  arch/powerpc/mm/book3s64/radix_tlb.c |  4 +-
>  fs/exec.c|  4 +-
>  include/linux/sched/mm.h | 20 +++
>  kernel/cpu.c |  2 +-
>  kernel/exit.c|  2 +-
>  kernel/fork.c| 52 
>  kernel/kthread.c | 11 ++--
>  kernel/sched/core.c  | 88 
>  kernel/sched/sched.h |  4 +-
>  13 files changed, 184 insertions(+), 38 deletions(-)
> 
> -- 
> 2.23.0
> 
>

[PATCH v4 2/2] powerpc: sstep: Fix darn emulation

2021-02-04 Thread Sandipan Das

Commit 8813ff49607e ("powerpc/sstep: Check instruction
validity against ISA version before emulation") introduced
a proper way to skip unknown instructions. This makes sure
that the same is used for the darn instruction when the
range selection bits have a reserved value.

Fixes: a23987ef267a ("powerpc: sstep: Add support for darn instruction")
Signed-off-by: Sandipan Das 
---
 arch/powerpc/lib/sstep.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 11f14b209d7f..683f7c20f74b 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -1916,7 +1916,7 @@ int analyse_instr(struct instruction_op *op, const struct 
pt_regs *regs,
goto compute_done;
}
 
-   return -1;
+   goto unknown_opcode;
 #ifdef __powerpc64__
case 777:   /* modsd */
if (!cpu_has_feature(CPU_FTR_ARCH_300))
-- 
2.25.1

[PATCH v4 1/2] powerpc: sstep: Fix load-store and update emulation

2021-02-04 Thread Sandipan Das

The Power ISA says that the fixed-point load and update
instructions must neither use R0 for the base address (RA)
nor have the destination (RT) and the base address (RA) as
the same register. Similarly, for fixed-point stores and
floating-point loads and stores, the instruction is invalid
when R0 is used as the base address (RA).

This is applicable to the following instructions.
  * Load Byte and Zero with Update (lbzu)
  * Load Byte and Zero with Update Indexed (lbzux)
  * Load Halfword and Zero with Update (lhzu)
  * Load Halfword and Zero with Update Indexed (lhzux)
  * Load Halfword Algebraic with Update (lhau)
  * Load Halfword Algebraic with Update Indexed (lhaux)
  * Load Word and Zero with Update (lwzu)
  * Load Word and Zero with Update Indexed (lwzux)
  * Load Word Algebraic with Update Indexed (lwaux)
  * Load Doubleword with Update (ldu)
  * Load Doubleword with Update Indexed (ldux)
  * Load Floating Single with Update (lfsu)
  * Load Floating Single with Update Indexed (lfsux)
  * Load Floating Double with Update (lfdu)
  * Load Floating Double with Update Indexed (lfdux)
  * Store Byte with Update (stbu)
  * Store Byte with Update Indexed (stbux)
  * Store Halfword with Update (sthu)
  * Store Halfword with Update Indexed (sthux)
  * Store Word with Update (stwu)
  * Store Word with Update Indexed (stwux)
  * Store Doubleword with Update (stdu)
  * Store Doubleword with Update Indexed (stdux)
  * Store Floating Single with Update (stfsu)
  * Store Floating Single with Update Indexed (stfsux)
  * Store Floating Double with Update (stfdu)
  * Store Floating Double with Update Indexed (stfdux)

E.g. the following behaviour is observed for an invalid
load and update instruction having RA = RT.

While a userspace program having an instruction word like
0xe9ce0001, i.e. ldu r14, 0(r14), runs without getting
receiving a SIGILL on a Power system (observed on P8 and
P9), the outcome of executing that instruction word varies
and its behaviour can be considered to be undefined.

Attaching an uprobe at that instruction's address results
in emulation which currently performs the load as well as
writes the effective address back to the base register.
This might not match the outcome from hardware.

To remove any inconsistencies, this adds additional checks
for the aforementioned instructions to make sure that the
emulation infrastructure treats them as unknown. The kernel
can then fallback to executing such instructions on hardware.

Fixes: 0016a4cf5582 ("powerpc: Emulate most Book I instructions in 
emulate_step()")
Signed-off-by: Sandipan Das 
---
Previous versions can be found at:
v3: 
https://lore.kernel.org/linuxppc-dev/20210204071432.116439-1-sandi...@linux.ibm.com/
v2: 
https://lore.kernel.org/linuxppc-dev/20210203063841.431063-1-sandi...@linux.ibm.com/
v1: 
https://lore.kernel.org/linuxppc-dev/20201119054139.244083-1-sandi...@linux.ibm.com/

Changes in v4:
- Fixed grammar and switch-case alignment.

Changes in v3:
- Consolidated the checks as suggested by Naveen.
- Consolidated load/store changes into a single patch.
- Included floating-point load/store and update instructions.

Changes in v2:
- Jump to unknown_opcode instead of returning -1 for invalid
  instruction forms.

---
 arch/powerpc/lib/sstep.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index e96cff845ef7..11f14b209d7f 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -3017,6 +3017,20 @@ int analyse_instr(struct instruction_op *op, const 
struct pt_regs *regs,
 
}
 
+   if (OP_IS_LOAD_STORE(op->type) && (op->type & UPDATE)) {
+   switch (GETTYPE(op->type)) {
+   case LOAD:
+   if (ra == rd)
+   goto unknown_opcode;
+   fallthrough;
+   case STORE:
+   case LOAD_FP:
+   case STORE_FP:
+   if (ra == 0)
+   goto unknown_opcode;
+   }
+   }
+
 #ifdef CONFIG_VSX
if ((GETTYPE(op->type) == LOAD_VSX ||
 GETTYPE(op->type) == STORE_VSX) &&
-- 
2.25.1

Re: [PATCH v3 28/32] powerpc/64s: interrupt implement exit logic in C

2021-02-04 Thread Christophe Leroy





Le 04/02/2021 à 04:27, Nicholas Piggin a écrit :

Excerpts from Christophe Leroy's message of February 4, 2021 2:25 am:



Le 25/02/2020 à 18:35, Nicholas Piggin a écrit :

Implement the bulk of interrupt return logic in C. The asm return code
must handle a few cases: restoring full GPRs, and emulating stack store.





+notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs, 
unsigned long msr)
+{
+   unsigned long *ti_flagsp = _thread_info()->flags;
+   unsigned long flags;
+
+   if (IS_ENABLED(CONFIG_PPC_BOOK3S) && unlikely(!(regs->msr & MSR_RI)))
+   unrecoverable_exception(regs);
+   BUG_ON(regs->msr & MSR_PR);
+   BUG_ON(!FULL_REGS(regs));
+
+   local_irq_save(flags);
+
+   if (regs->softe == IRQS_ENABLED) {
+   /* Returning to a kernel context with local irqs enabled. */
+   WARN_ON_ONCE(!(regs->msr & MSR_EE));
+again:
+   if (IS_ENABLED(CONFIG_PREEMPT)) {
+   /* Return to preemptible kernel context */
+   if (unlikely(*ti_flagsp & _TIF_NEED_RESCHED)) {
+   if (preempt_count() == 0)
+   preempt_schedule_irq();
+   }
+   }
+
+   trace_hardirqs_on();
+   __hard_EE_RI_disable();
+   if (unlikely(lazy_irq_pending())) {
+   __hard_RI_enable();
+   irq_soft_mask_set(IRQS_ALL_DISABLED);
+   trace_hardirqs_off();
+   local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
+   /*
+* Can't local_irq_enable in case we are in interrupt
+* context. Must replay directly.
+*/
+   replay_soft_interrupts();
+   irq_soft_mask_set(flags);
+   /* Took an interrupt, may have more exit work to do. */
+   goto again;
+   }
+   local_paca->irq_happened = 0;
+   irq_soft_mask_set(IRQS_ENABLED);
+   } else {
+   /* Returning to a kernel context with local irqs disabled. */
+   trace_hardirqs_on();
+   __hard_EE_RI_disable();
+   if (regs->msr & MSR_EE)
+   local_paca->irq_happened &= ~PACA_IRQ_HARD_DIS;
+   }
+
+
+#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
+   local_paca->tm_scratch = regs->msr;
+#endif
+
+   /*
+* We don't need to restore AMR on the way back to userspace for KUAP.
+* The value of AMR only matters while we're in the kernel.
+*/
+   kuap_restore_amr(regs);


Is that correct to restore KUAP state here ? Shouldn't we have it at lower 
level in assembly ?

Isn't there a risk that someone manages to call interrupt_exit_kernel_prepare() 
or the end of it in
a way or another, and get the previous KUAP state restored by this way ?


I'm not sure if there much more risk if it's here rather than the
instruction being in another place in the code.

There's a lot of user access around the kernel too if you want to find a
gadget to unlock KUAP then I suppose there is a pretty large attack
surface.


My understanding is that user access scope is strictly limited, for instance we enforce the 
begin/end of user access to be in the same function, and we refrain from calling any other function 
inside the user access window. x86 even have 'objtool' to enforce it at build time. So in theory 
there is no way to get out of the function while user access is open.


Here with the interrupt exit function it is free beer. You have a place where you re-open user 
access and return with a simple blr. So that's open bar. If someone manages to just call the 
interrupt exit function, then user access remains open





Also, it looks a bit strange to have kuap_save_amr_and_lock() done at lowest 
level in assembly, and
kuap_restore_amr() done in upper level. That looks unbalanced.


I'd like to bring the entry assembly into C.



I really think it's not a good idea. We'll get better control and readability by keeping it at the 
lowest possible level in assembly.


x86 only save and restore SMAC state in assembly.

Christophe

80 matches

Mail list logo