[GIT PULL] Please pull powerpc/linux.git powerpc-5.3-5 tag

2019-09-05 Thread Michael Ellerman
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Hi Linus,

Please pull some more powerpc fixes for 5.3:

The following changes since commit ed4289e8b48845888ee46377bd2b55884a55e60b:

  Revert "powerpc: slightly improve cache helpers" (2019-07-31 22:56:27 +1000)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
tags/powerpc-5.3-5

for you to fetch changes up to a8318c13e79badb92bc6640704a64cc022a6eb97:

  powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts 
(2019-09-04 22:31:13 +1000)

- --
powerpc fixes for 5.3 #5

One fix for a boot hang on some Freescale machines when PREEMPT is enabled.

Two CVE fixes for bugs in our handling of FP registers and transactional memory,
both of which can result in corrupted FP state, or FP state leaking between
processes.

Thanks to:
  Chris Packham, Christophe Leroy, Gustavo Romero, Michael Neuling.

- --
Christophe Leroy (1):
  powerpc/64e: Drop stale call to smp_processor_id() which hangs SMP startup

Gustavo Romero (2):
  powerpc/tm: Fix FP/VMX unavailable exceptions inside a transaction
  powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts


 arch/powerpc/kernel/process.c | 21 -
 arch/powerpc/mm/nohash/tlb.c  |  1 -
 2 files changed, 4 insertions(+), 18 deletions(-)
-BEGIN PGP SIGNATURE-

iQIzBAEBCAAdFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl1x48cACgkQUevqPMjh
pYBCuRAAo/buJyqBbDaN8qnw1L0gusb8KF3/j9rmoMQYTmwJROtodWnK7Yxf79LI
t8Fj94ENYaEZRazpJ379Yp2qWCIExfmWpv4GdVoLKuZwVi6aM4H5iRTSZ6SSbTY6
Rdseae17IbV4oXwEzYLjYdDtgVdrJbcWEqbdxLkffkn+km35Idz3jD5WeWXx0RQy
H0YtaZfj79caxB6Db78xY/z9ocq4zPNljI2Ghd0bvC7NmsELAVUl0/8RFn6qjRM9
LZM+Oi64Z7JnLz/FRtD/RNZSK4xAwq7vh/BdSzDGiGbaNX1o0OysxQHzv8ePwABC
GE42CqQ326vx4uICkaA7uFJ6F94s6PF1F+6XjiI0hm2STPsn0QuHd+yWKkHXMx23
VU/SvZWK3PC6HJgyCQQvmdY7g/UrcXpA0pr2IUhCnxmT2MrbFceYopSnueoag7An
WAVombwtfaFLRv8g8yV2E0y8K1X3AmfIpZBFK95zMg3uQPTiuA5Z2lFcU6L131g0
pr3K3OkR9vOn5crxOba8osjhwseNcpcvynkT5xzpRewrIpUSkl0tzwwue5jox6NX
KPV2eooGNfEcdYhuum41k+2Ps9y2aNdIXkhAdqXqTArpOdTSjdDNd+CxlHg8ZGl7
S9LffbbZhsxY4++6xSiLhhfAYQi/QjEuL6HQjy+DENEdSc+BmF8=
=nAqV
-END PGP SIGNATURE-


Re: [PATCH v2 3/4] powerpc/numa: Early request for home node associativity

2019-09-05 Thread Srikar Dronamraju
> >
> > While here, fix a problem where of_node_put could be called even when
> > of_get_cpu_node was not successful.
> 
> of_node_put() handles NULL arguments, so this should not be necessary.
> 

Ok 

> > @@ -875,7 +908,7 @@ void __init mem_topology_setup(void)
> > reset_numa_cpu_lookup_table();
> >  
> > for_each_present_cpu(cpu)
> > -   numa_setup_cpu(cpu);
> > +   numa_setup_cpu(cpu, false);
> >  }
> 
> I'm open to other points of view here, but I would prefer two separate
> functions, something like vphn_get_nid() for runtime and
> vphn_get_nid_early() (which could be __init) for boot-time
> initialization. Propagating a somewhat unexpressive boolean flag through
> two levels of function calls in this code is unappealing...
> 

Somehow not convinced that we need to duplicate function just to avoid
passing a bool.

If propagating a boolean flag in two levels of function calls is an issue,
we could decipher the logic in numa_setup_cpu itself

Something like this
static int numa_setup_cpu(unsigned long lcpu, bool get_hwid)
{


if (firmware_has_feature(FW_FEATURE_VPHN)) {
long hwid;

if (get_hwid)
hwid = get_hard_smp_processor_id(cpu);
else
hwid = cpu_to_phys_id[cpu];

nid = vphn_get_nid(lcpu, hwid);
}


Would this help?

> Regardless, I have an annoying question :-) Isn't it possible that,
> while Linux is calling vphn_get_nid() for each logical cpu in sequence,
> the platform could change a virtual processor's node assignment,
> potentially causing sibling threads to get different node assignments
> and producing an incoherent topology (which then leads to sched domain
> assertions etc)?
> 

Right, its certainly possible for node assignment to change while we iterate
through the siblings. Do you have an recommendations?

> If so, I think more care is needed. The algorithm should make the vphn
> call only once per cpu node, I think?

I didn't get "once per cpu node", How do we know which all cpus are part of
that cpu node? Or did you mean once per cpu core?

-- 
Thanks and Regards
Srikar Dronamraju



Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Al Viro
On Fri, Sep 06, 2019 at 12:49:44AM +0100, Al Viro wrote:
> On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote:
> > > > +   return -EFAULT;
> > > > +   }
> > > > +   /* Copy the interoperable parts of the struct. */
> > > > +   if (__copy_to_user(dst, src, size))
> > > > +   return -EFAULT;
> > > 
> > > Why not simply clear_user() and copy_to_user()?
> > 
> > I'm not sure I understand what you mean -- are you asking why we need to
> > do memchr_inv(src + size, 0, rest) earlier?
> 
> I'm asking why bother with __ and separate access_ok().
> 
> > >   if ((unsigned long)addr & 1) {
> > >   u8 v;
> > >   if (get_user(v, (__u8 __user *)addr))
> > >   return -EFAULT;
> > >   if (v)
> > >   return -E2BIG;
> > >   addr++;
> > >   }
> > >   if ((unsigned long)addr & 2) {
> > >   u16 v;
> > >   if (get_user(v, (__u16 __user *)addr))
> > >   return -EFAULT;
> > >   if (v)
> > >   return -E2BIG;
> > >   addr +=2;
> > >   }
> > >   if ((unsigned long)addr & 4) {
> > >   u32 v;
> > >   if (get_user(v, (__u32 __user *)addr))
> > >   return -EFAULT;
> > >   if (v)
> > >   return -E2BIG;
> > >   }
> > >   
> 
> Actually, this is a dumb way to do it - page size on anything
> is going to be a multiple of 8, so you could just as well
> read 8 bytes from an address aligned down.  Then mask the
> bytes you don't want to check out and see if there's anything
> left.
> 
> You can have readability boundaries inside a page - it's either
> the entire page (let alone a single word) being readable, or
> it's EFAULT for all parts.
> 
> > > would be saner, and things like x86 could trivially add an
> > > asm variant - it's not hard.  Incidentally, memchr_inv() is
> > > an overkill in this case...
> > 
> > Why is memchr_inv() overkill?
> 
> Look at its implementation; you only care if there are
> non-zeroes, you don't give a damn where in the buffer
> the first one would be.  All you need is the same logics
> as in "from userland" case
>   if (!count)
>   return true;
>   offset = (unsigned long)from & 7
>   p = (u64 *)(from - offset);
>   v = *p++;
>   if (offset) {   // unaligned
>   count += offset;
>   v &= ~aligned_byte_mask(offset); // see strnlen_user.c
>   }
>   while (count > 8) {
>   if (v)
>   return false;
>   v = *p++;
>   count -= 8;
>   }
>   if (count != 8)
>   v &= aligned_byte_mask(count);
>   return v == 0;
> 
> All there is to it...

... and __user case would be pretty much this with
if (user_access_begin(from, count)) {

user_access_end();
}
wrapped around the damn thing - again, see strnlen_user.c, with
unsafe_get_user(v, p++, efault);
instead of those
v = *p++;

Calling conventions might need some thinking - it might be
* all read, all zeroes
* non-zero found
* read failed
so we probably want to map the "all zeroes" case to 0,
"read failed" to -EFAULT and "non-zero found" to something
else.  Might be positive, might be some other -E - not
sure if E2BIG (or EFBIG) makes much sense here.  Need to
look at the users...


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Aleksa Sarai
On 2019-09-06, Al Viro  wrote:
> On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote:
> > > > +   return -EFAULT;
> > > > +   }
> > > > +   /* Copy the interoperable parts of the struct. */
> > > > +   if (__copy_to_user(dst, src, size))
> > > > +   return -EFAULT;
> > > 
> > > Why not simply clear_user() and copy_to_user()?
> > 
> > I'm not sure I understand what you mean -- are you asking why we need to
> > do memchr_inv(src + size, 0, rest) earlier?
> 
> I'm asking why bother with __ and separate access_ok().

Ah right, it was a dumb "optimisation" (since we need to do access_ok()
anyway since we should early -EFAULT in that case). I've dropped the __
usages in my working copy.

> > >   if ((unsigned long)addr & 1) {
> > >   u8 v;
> > >   if (get_user(v, (__u8 __user *)addr))
> > >   return -EFAULT;
> > >   if (v)
> > >   return -E2BIG;
> > >   addr++;
> > >   }
> > >   if ((unsigned long)addr & 2) {
> > >   u16 v;
> > >   if (get_user(v, (__u16 __user *)addr))
> > >   return -EFAULT;
> > >   if (v)
> > >   return -E2BIG;
> > >   addr +=2;
> > >   }
> > >   if ((unsigned long)addr & 4) {
> > >   u32 v;
> > >   if (get_user(v, (__u32 __user *)addr))
> > >   return -EFAULT;
> > >   if (v)
> > >   return -E2BIG;
> > >   }
> > >   
> 
> Actually, this is a dumb way to do it - page size on anything
> is going to be a multiple of 8, so you could just as well
> read 8 bytes from an address aligned down.  Then mask the
> bytes you don't want to check out and see if there's anything
> left.
> 
> You can have readability boundaries inside a page - it's either
> the entire page (let alone a single word) being readable, or
> it's EFAULT for all parts.
> 
> > > would be saner, and things like x86 could trivially add an
> > > asm variant - it's not hard.  Incidentally, memchr_inv() is
> > > an overkill in this case...
> > 
> > Why is memchr_inv() overkill?
> 
> Look at its implementation; you only care if there are
> non-zeroes, you don't give a damn where in the buffer
> the first one would be.  All you need is the same logics
> as in "from userland" case
>   if (!count)
>   return true;
>   offset = (unsigned long)from & 7
>   p = (u64 *)(from - offset);
>   v = *p++;
>   if (offset) {   // unaligned
>   count += offset;
>   v &= ~aligned_byte_mask(offset); // see strnlen_user.c
>   }
>   while (count > 8) {
>   if (v)
>   return false;
>   v = *p++;
>   count -= 8;
>   }
>   if (count != 8)
>   v &= aligned_byte_mask(count);
>   return v == 0;
> 
> All there is to it...

Alright, will do (for some reason I hadn't made the connection that
memchr_inv() is doing effectively the same word-by-word comparison but
also detecting where the first byte is).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Al Viro
On Fri, Sep 06, 2019 at 09:00:03AM +1000, Aleksa Sarai wrote:
> > > + return -EFAULT;
> > > + }
> > > + /* Copy the interoperable parts of the struct. */
> > > + if (__copy_to_user(dst, src, size))
> > > + return -EFAULT;
> > 
> > Why not simply clear_user() and copy_to_user()?
> 
> I'm not sure I understand what you mean -- are you asking why we need to
> do memchr_inv(src + size, 0, rest) earlier?

I'm asking why bother with __ and separate access_ok().

> > if ((unsigned long)addr & 1) {
> > u8 v;
> > if (get_user(v, (__u8 __user *)addr))
> > return -EFAULT;
> > if (v)
> > return -E2BIG;
> > addr++;
> > }
> > if ((unsigned long)addr & 2) {
> > u16 v;
> > if (get_user(v, (__u16 __user *)addr))
> > return -EFAULT;
> > if (v)
> > return -E2BIG;
> > addr +=2;
> > }
> > if ((unsigned long)addr & 4) {
> > u32 v;
> > if (get_user(v, (__u32 __user *)addr))
> > return -EFAULT;
> > if (v)
> > return -E2BIG;
> > }
> > 

Actually, this is a dumb way to do it - page size on anything
is going to be a multiple of 8, so you could just as well
read 8 bytes from an address aligned down.  Then mask the
bytes you don't want to check out and see if there's anything
left.

You can have readability boundaries inside a page - it's either
the entire page (let alone a single word) being readable, or
it's EFAULT for all parts.

> > would be saner, and things like x86 could trivially add an
> > asm variant - it's not hard.  Incidentally, memchr_inv() is
> > an overkill in this case...
> 
> Why is memchr_inv() overkill?

Look at its implementation; you only care if there are
non-zeroes, you don't give a damn where in the buffer
the first one would be.  All you need is the same logics
as in "from userland" case
if (!count)
return true;
offset = (unsigned long)from & 7
p = (u64 *)(from - offset);
v = *p++;
if (offset) {   // unaligned
count += offset;
v &= ~aligned_byte_mask(offset); // see strnlen_user.c
}
while (count > 8) {
if (v)
return false;
v = *p++;
count -= 8;
}
if (count != 8)
v &= aligned_byte_mask(count);
return v == 0;

All there is to it...


Re: [PATCH] KVM: PPC: Book3S HV: add smp_mb() in kvmppc_set_host_ipi()

2019-09-05 Thread Michael Roth
Quoting Michael Ellerman (2019-09-04 22:04:48)
> Hi Mike,
> 
> Thanks for the patch & great change log, just a few comments.

Hi Michael,

Thank you for the suggestions. I will roll them into v2 unless otherwise
noted below.

> 
> Michael Roth  writes:
> > On a 2-socket Witherspoon system with 128 cores and 1TB of memory
>^
>Power9 (not everyone knows what a Witherspoon is)
> 
> > running the following guest configs:
> >
> >   guest A:
> > - 224GB of memory
> > - 56 VCPUs (sockets=1,cores=28,threads=2), where:
> >   VCPUs 0-1 are pinned to CPUs 0-3,
> >   VCPUs 2-3 are pinned to CPUs 4-7,
> >   ...
> >   VCPUs 54-55 are pinned to CPUs 108-111
> >
> >   guest B:
> > - 4GB of memory
> > - 4 VCPUs (sockets=1,cores=4,threads=1)
> >
> > with the following workloads (with KSM and THP enabled in all):
> >
> >   guest A:
> > stress --cpu 40 --io 20 --vm 20 --vm-bytes 512M
> >
> >   guest B:
> > stress --cpu 4 --io 4 --vm 4 --vm-bytes 512M
> >
> >   host:
> > stress --cpu 4 --io 4 --vm 2 --vm-bytes 256M
> >
> > the below soft-lockup traces were observed after an hour or so and
> > persisted until the host was reset (this was found to be reliably
> > reproducible for this configuration, for kernels 4.15, 4.18, 5.0,
> > and 5.3-rc5):
> >
> >   [ 1253.183290] rcu: INFO: rcu_sched self-detected stall on CPU
> >   [ 1253.183319] rcu: 124-: (5250 ticks this GP) 
> > idle=10a/1/0x4002 softirq=5408/5408 fqs=1941
> >   [ 1256.287426] watchdog: BUG: soft lockup - CPU#105 stuck for 23s! [CPU 
> > 52/KVM:19709]
> >   [ 1264.075773] watchdog: BUG: soft lockup - CPU#24 stuck for 23s! 
> > [worker:19913]
> >   [ 1264.079769] watchdog: BUG: soft lockup - CPU#31 stuck for 23s! 
> > [worker:20331]
> >   [ 1264.095770] watchdog: BUG: soft lockup - CPU#45 stuck for 23s! 
> > [worker:20338]
> >   [ 1264.131773] watchdog: BUG: soft lockup - CPU#64 stuck for 23s! 
> > [avocado:19525]
> >   [ 1280.408480] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> > [ksmd:791]
> >   [ 1316.198012] rcu: INFO: rcu_sched self-detected stall on CPU
> >   [ 1316.198032] rcu: 124-: (21003 ticks this GP) 
> > idle=10a/1/0x4002 softirq=5408/5408 fqs=8243
> >   [ 1340.411024] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> > [ksmd:791]
> >   [ 1379.212609] rcu: INFO: rcu_sched self-detected stall on CPU
> >   [ 1379.212629] rcu: 124-: (36756 ticks this GP) 
> > idle=10a/1/0x4002 softirq=5408/5408 fqs=14714
> >   [ 1404.413615] watchdog: BUG: soft lockup - CPU#124 stuck for 22s! 
> > [ksmd:791]
> >   [ 1442.227095] rcu: INFO: rcu_sched self-detected stall on CPU
> >   [ 1442.227115] rcu: 124-: (52509 ticks this GP) 
> > idle=10a/1/0x4002 softirq=5408/5408 fqs=21403
> >   [ 1455.111787] INFO: task worker:19907 blocked for more than 120 seconds.
> >   [ 1455.111822]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.111833] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.111884] INFO: task worker:19908 blocked for more than 120 seconds.
> >   [ 1455.111905]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.111925] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.111966] INFO: task worker:20328 blocked for more than 120 seconds.
> >   [ 1455.111986]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.111998] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.112048] INFO: task worker:20330 blocked for more than 120 seconds.
> >   [ 1455.112068]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.112097] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.112138] INFO: task worker:20332 blocked for more than 120 seconds.
> >   [ 1455.112159]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.112179] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.112210] INFO: task worker:20333 blocked for more than 120 seconds.
> >   [ 1455.112231]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.112242] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.112282] INFO: task worker:20335 blocked for more than 120 seconds.
> >   [ 1455.112303]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> >   [ 1455.112332] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" 
> > disables this message.
> >   [ 1455.112372] INFO: task worker:20336 blocked for more than 120 seconds.
> >   [ 1455.112392]   Tainted: G L5.3.0-rc5-mdr-vanilla+ #1
> 
> There should be stack traces here, did they get lost or you omitted them
> for brevity?

Both, I suppose :) This kernel was booted with 'quiet' option 

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Aleksa Sarai
On 2019-09-05, Al Viro  wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have 
> > already
> > + * checked access_ok(p, size).
> > + */
> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > +   const char zeros[BUFFER_SIZE] = {};
> > +   while (s > 0) {
> > +   size_t n = min(s, sizeof(zeros));
> > +
> > +   if (__copy_to_user(p, zeros, n))
> > +   return -EFAULT;
> > +
> > +   p += n;
> > +   s -= n;
> > +   }
> > +   return 0;
> > +}
> 
> That's called clear_user().

Already switched, I didn't know about clear_user() -- I assumed it
would've been called bzero_user() or memzero_user() and didn't find it
when looking.

> > +int copy_struct_to_user(void __user *dst, size_t usize,
> > +   const void *src, size_t ksize)
> > +{
> > +   size_t size = min(ksize, usize);
> > +   size_t rest = abs(ksize - usize);
> > +
> > +   if (unlikely(usize > PAGE_SIZE))
> > +   return -EFAULT;
> 
> Why?
> 
> > +   } else if (usize > ksize) {
> > +   if (__memzero_user(dst + size, rest))
> > +   return -EFAULT;
> > +   }
> > +   /* Copy the interoperable parts of the struct. */
> > +   if (__copy_to_user(dst, src, size))
> > +   return -EFAULT;
> 
> Why not simply clear_user() and copy_to_user()?

I'm not sure I understand what you mean -- are you asking why we need to
do memchr_inv(src + size, 0, rest) earlier?

> 
> > +int copy_struct_from_user(void *dst, size_t ksize,
> > + const void __user *src, size_t usize)
> > +{
> > +   size_t size = min(ksize, usize);
> > +   size_t rest = abs(ksize - usize);
> 
> Cute, but... you would be just as well without that 'rest' thing.

I would argue it's harder to mess up using "rest" compared to getting
"ksize - usize" and "usize - ksize" mixed up (and it's a bit more
readable).

> > +
> > +   if (unlikely(usize > PAGE_SIZE))
> > +   return -EFAULT;
> 
> Again, why?

As discussed in a sister thread, I will leave this in the callers
(though I would argue callers should always do some kind of sanity check
like this).

> 
> > +   if (unlikely(!access_ok(src, usize)))
> > +   return -EFAULT;
> 
> Why not simply copy_from_user() here?
> 
> > +   /* Deal with trailing bytes. */
> > +   if (usize < ksize)
> > +   memset(dst + size, 0, rest);
> > +   else if (usize > ksize) {
> > +   const void __user *addr = src + size;
> > +   char buffer[BUFFER_SIZE] = {};
> > +
> > +   while (rest > 0) {
> > +   size_t bufsize = min(rest, sizeof(buffer));
> > +
> > +   if (__copy_from_user(buffer, addr, bufsize))
> > +   return -EFAULT;
> > +   if (memchr_inv(buffer, 0, bufsize))
> > +   return -E2BIG;
> 
> Frankly, that looks like a candidate for is_all_zeroes_user().
> With the loop like above serving as a dumb default.  And on
> badly alighed address it _will_ be dumb.  Probably too much
> so - something like
>   if ((unsigned long)addr & 1) {
>   u8 v;
>   if (get_user(v, (__u8 __user *)addr))
>   return -EFAULT;
>   if (v)
>   return -E2BIG;
>   addr++;
>   }
>   if ((unsigned long)addr & 2) {
>   u16 v;
>   if (get_user(v, (__u16 __user *)addr))
>   return -EFAULT;
>   if (v)
>   return -E2BIG;
>   addr +=2;
>   }
>   if ((unsigned long)addr & 4) {
>   u32 v;
>   if (get_user(v, (__u32 __user *)addr))
>   return -EFAULT;
>   if (v)
>   return -E2BIG;
>   }
>   
> would be saner, and things like x86 could trivially add an
> asm variant - it's not hard.  Incidentally, memchr_inv() is
> an overkill in this case...

Why is memchr_inv() overkill?

But yes, breaking this out to an asm-generic is_all_zeroes_user()
wouldn't hurt -- and I'll put a cleaned-up version of the alignment
handling there too. Should I drop it in asm-generic/uaccess.h, or
somewhere else?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Al Viro
On Fri, Sep 06, 2019 at 05:56:18AM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Al Viro  wrote:
> > On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> > 
> > > Because every caller of that function right now has that limit set
> > > anyway iirc. So we can either remove it from here and place it back for
> > > the individual callers or leave it in the helper.
> > > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > > bound on the size (for a long time probably) or are you disagreeing with
> > > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > > bpf, and clone3 and in a few other places.
> > 
> > For a primitive that can be safely used with any size (OK, any within
> > the usual 2Gb limit)?  Why push the random policy into the place where
> > it doesn't belong?
> > 
> > Seriously, what's the point?  If they want to have a large chunk of
> > userland memory zeroed or checked for non-zeroes - why would that
> > be a problem?
> 
> Thinking about it some more, there isn't really any r/w amplification --
> so there isn't much to gain by passing giant structs. Though, if we are
> going to permit 2GB buffers, isn't that also an argument to use
> memchr_inv()? :P

I'm not sure I understand the last bit.  If you look at what copy_from_user()
does on misaligned source/destination, especially on architectures that
really, really do not like unaligned access...

Case in point: alpha (and it's not unusual in that respect).  What it boils
down to is
copy bytes until the destination is aligned
if source and destination are both aligned
copy word by word
else
read word by word, storing the mix of two adjacent words
copy the rest byte by byte

The unpleasant case (to and from having different remainders modulo 8) is
basically

if (count >= 8) {
u64 *aligned = (u64 *)(from & ~7);
u64 *dest = (u64 *)to;
int bitshift = (from & 7) * 8;
u64 prev, next;

prev = aligned[0];
do {   
next = aligned[1];
prev <<= bitshift;
prev |= next >> (64 - bitshift);
*dest++ = prev;
aligned++;  
prev = next;
from += 8;
to += 8;
count -= 8;
} while (count >= 8);
}

Now, mix that with "... and do memchr_inv() on the copy to find if we'd
copied any non-zeroes, nevermind where" and it starts looking really
ridiculous.

We should just read the fscking source, aligned down to word boundary
and check each word being read.  The first and the last ones - masked.
All there is to it.  On almost all architectures that'll work well
enough; s390 might want something more elaborate (there even word-by-word
copies are costly, but I'd suggest talking to them for details).

Something like bool all_zeroes_user(const void __user *p, size_t count)
would probably be a sane API...


Re: [PATCH v3] powerpc/64: system call implement the bulk of the logic in C

2019-09-05 Thread Nicholas Piggin
Christophe Leroy's on September 6, 2019 2:29 am:
> 
> 
> Le 05/09/2019 à 14:35, Nicholas Piggin a écrit :
>> -.Lsyscall_error_cont:
>> -ld  r7,_NIP(r1)
>>   BEGIN_FTR_SECTION
>>  stdcx.  r0,0,r1 /* to clear the reservation */
>>   END_FTR_SECTION_IFCLR(CPU_FTR_STCX_CHECKS_ADDRESS)
>> -andi.   r6,r8,MSR_PR
>> -ld  r4,_LINK(r1)
>>   
>> -kuap_check_amr r10, r11
>> +mtspr   SPRN_SRR0,r4
>> +mtspr   SPRN_SRR1,r5
> 
> That looks dangerous. Once you have modified SRR0 and SRR1, your 
> exception becomes unrecoverable, so this should be done as close as 
> possible to the rfi.

Worse than that even, when we set MSR[RI]=0 the exception becomes
unrecoverable, and that happens in the C code!

> Here you seem to do many thinks inbetween, 
> including restoring some registers from the stack.

The code it replaces has a pretty large non-recoverable window as
well, interrupt entry has large ones particularly for KVM. It's an
unfortunate quirk of the architecture but it's not worth getting
too worried about or sacrificing too much performance for.

I do have some later patches which improves this a lot, but that will
wait until after the interrupt return is also changed to C.

Thanks,
Nick



[PATCH] net/ibmvnic: free reset work of removed device from queue

2019-09-05 Thread Juliet Kim
Commit 36f1031c51a2 ("ibmvnic: Do not process reset during or after
 device removal") made the change to exit reset if the driver has been
removed, but does not free reset work items of the adapter from queue.

Ensure all reset work items are freed when breaking out of the loop early.

Fixes: 36f1031c51a2 ("ibmnvic: Do not process reset during or after
device removal”)

Signed-off-by: Juliet Kim 
---
 drivers/net/ethernet/ibm/ibmvnic.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/ibm/ibmvnic.c 
b/drivers/net/ethernet/ibm/ibmvnic.c
index fa4bb940665c..6644cabc8e75 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -1985,7 +1985,10 @@ static void __ibmvnic_reset(struct work_struct *work)
while (rwi) {
if (adapter->state == VNIC_REMOVING ||
adapter->state == VNIC_REMOVED)
-   goto out;
+   kfree(rwi);
+   rc = EBUSY;
+   break;
+   }
 
if (adapter->force_reset_recovery) {
adapter->force_reset_recovery = false;
@@ -2011,7 +2014,7 @@ static void __ibmvnic_reset(struct work_struct *work)
netdev_dbg(adapter->netdev, "Reset failed\n");
free_all_rwi(adapter);
}
-out:
+
adapter->resetting = false;
if (we_lock_rtnl)
rtnl_unlock();
-- 
2.16.4



Re: [PATCH v2 3/4] powerpc/numa: Early request for home node associativity

2019-09-05 Thread Nathan Lynch
Hi Srikar,

Srikar Dronamraju  writes:
> Currently the kernel detects if its running on a shared lpar platform
> and requests home node associativity before the scheduler sched_domains
> are setup. However between the time NUMA setup is initialized and the
> request for home node associativity, workqueue initializes its per node
> cpumask. The per node workqueue possible cpumask may turn invalid
> after home node associativity resulting in weird situations like
> workqueue possible cpumask being a subset of workqueue online cpumask.
>
> This can be fixed by requesting home node associativity earlier just
> before NUMA setup. However at the NUMA setup time, kernel may not be in
> a position to detect if its running on a shared lpar platform. So
> request for home node associativity and if the request fails, fallback
> on the device tree property.
>
> While here, fix a problem where of_node_put could be called even when
> of_get_cpu_node was not successful.

of_node_put() handles NULL arguments, so this should not be necessary.

> +static int vphn_get_nid(unsigned long cpu, bool get_hwid)

[...]

> +static int numa_setup_cpu(unsigned long lcpu, bool get_hwid)

[...]

> @@ -528,7 +561,7 @@ static int ppc_numa_cpu_prepare(unsigned int cpu)
>  {
>   int nid;
>  
> - nid = numa_setup_cpu(cpu);
> + nid = numa_setup_cpu(cpu, true);
>   verify_cpu_node_mapping(cpu, nid);
>   return 0;
>  }
> @@ -875,7 +908,7 @@ void __init mem_topology_setup(void)
>   reset_numa_cpu_lookup_table();
>  
>   for_each_present_cpu(cpu)
> - numa_setup_cpu(cpu);
> + numa_setup_cpu(cpu, false);
>  }

I'm open to other points of view here, but I would prefer two separate
functions, something like vphn_get_nid() for runtime and
vphn_get_nid_early() (which could be __init) for boot-time
initialization. Propagating a somewhat unexpressive boolean flag through
two levels of function calls in this code is unappealing...

Regardless, I have an annoying question :-) Isn't it possible that,
while Linux is calling vphn_get_nid() for each logical cpu in sequence,
the platform could change a virtual processor's node assignment,
potentially causing sibling threads to get different node assignments
and producing an incoherent topology (which then leads to sched domain
assertions etc)?

If so, I think more care is needed. The algorithm should make the vphn
call only once per cpu node, I think?


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Aleksa Sarai
On 2019-09-05, Al Viro  wrote:
> On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:
> 
> > Because every caller of that function right now has that limit set
> > anyway iirc. So we can either remove it from here and place it back for
> > the individual callers or leave it in the helper.
> > Also, I'm really asking, why not? Is it unreasonable to have an upper
> > bound on the size (for a long time probably) or are you disagreeing with
> > PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> > bpf, and clone3 and in a few other places.
> 
> For a primitive that can be safely used with any size (OK, any within
> the usual 2Gb limit)?  Why push the random policy into the place where
> it doesn't belong?
> 
> Seriously, what's the point?  If they want to have a large chunk of
> userland memory zeroed or checked for non-zeroes - why would that
> be a problem?

Thinking about it some more, there isn't really any r/w amplification --
so there isn't much to gain by passing giant structs. Though, if we are
going to permit 2GB buffers, isn't that also an argument to use
memchr_inv()? :P

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


[PATCH v2 1/1] powerpc/pci: Fix pcibios_setup_device() ordering

2019-09-05 Thread Shawn Anastasio
Move PCI device setup from pcibios_add_device() and pcibios_fixup_bus() to
pcibios_bus_add_device(). This ensures that platform-specific DMA and IOMMU
setup occurs after the device has been registered in sysfs, which is a
requirement for IOMMU group assignment to work

This fixes IOMMU group assignment for hotplugged devices on pseries, where
the existing behavior results in IOMMU assignment before registration.

Thanks to Lukas Wunner  for the suggestion.

Signed-off-by: Shawn Anastasio 
---
 arch/powerpc/kernel/pci-common.c | 25 +
 1 file changed, 9 insertions(+), 16 deletions(-)

diff --git a/arch/powerpc/kernel/pci-common.c b/arch/powerpc/kernel/pci-common.c
index f627e15bb43c..d119c77efb69 100644
--- a/arch/powerpc/kernel/pci-common.c
+++ b/arch/powerpc/kernel/pci-common.c
@@ -261,12 +261,6 @@ int pcibios_sriov_disable(struct pci_dev *pdev)
 
 #endif /* CONFIG_PCI_IOV */
 
-void pcibios_bus_add_device(struct pci_dev *pdev)
-{
-   if (ppc_md.pcibios_bus_add_device)
-   ppc_md.pcibios_bus_add_device(pdev);
-}
-
 static resource_size_t pcibios_io_size(const struct pci_controller *hose)
 {
 #ifdef CONFIG_PPC64
@@ -987,15 +981,17 @@ static void pcibios_setup_device(struct pci_dev *dev)
ppc_md.pci_irq_fixup(dev);
 }
 
-int pcibios_add_device(struct pci_dev *dev)
+void pcibios_bus_add_device(struct pci_dev *pdev)
 {
-   /*
-* We can only call pcibios_setup_device() after bus setup is complete,
-* since some of the platform specific DMA setup code depends on it.
-*/
-   if (dev->bus->is_added)
-   pcibios_setup_device(dev);
+   /* Perform platform-specific device setup */
+   pcibios_setup_device(pdev);
+
+   if (ppc_md.pcibios_bus_add_device)
+   ppc_md.pcibios_bus_add_device(pdev);
+}
 
+int pcibios_add_device(struct pci_dev *dev)
+{
 #ifdef CONFIG_PCI_IOV
if (ppc_md.pcibios_fixup_sriov)
ppc_md.pcibios_fixup_sriov(dev);
@@ -1037,9 +1033,6 @@ void pcibios_fixup_bus(struct pci_bus *bus)
 
/* Now fixup the bus bus */
pcibios_setup_bus_self(bus);
-
-   /* Now fixup devices on that bus */
-   pcibios_setup_bus_devices(bus);
 }
 EXPORT_SYMBOL(pcibios_fixup_bus);
 
-- 
2.20.1



[PATCH v2 0/1] Fix IOMMU setup for hotplugged devices on pseries

2019-09-05 Thread Shawn Anastasio
Changes from v2:
  - Remove pcibios_fixup_dev()
  - Remove pcibios_setup_bus_device() call in pcibios_fixup_bus() since
all device setup will now be handled in pcibios_bus_add_device()

On pseries QEMU guests, IOMMU setup for hotplugged PCI devices is currently
broken for all but the first device on a given bus. The culprit is an ordering
issue in the pseries hotplug path (via pci_rescan_bus()) which results in IOMMU
group assigment occuring before device registration in sysfs. This triggers
the following check in arch/powerpc/kernel/iommu.c:

/*
 * The sysfs entries should be populated before
 * binding IOMMU group. If sysfs entries isn't
 * ready, we simply bail.
 */
if (!device_is_registered(dev))
return -ENOENT;

This fails for hotplugged devices since the pcibios_add_device() call in the
pseries hotplug path (in pci_device_add()) occurs before device_add().
Since the IOMMU groups are set up in pcibios_add_device(), this means that a
sysfs entry will not yet be present and it will fail.

There is a special case that allows the first hotplugged device on a bus to
succeed, though. The powerpc pcibios_add_device() implementation will skip
initializing the device if bus setup is not yet complete.
Later, the pci core will call pcibios_fixup_bus() which will perform setup
for the first (and only) device on the bus and since it has already been
registered in sysfs, the IOMMU setup will succeed.

The current solution is to move all device setup to pcibios_bus_add_device()
which will occur after all devices have been registered.

Shawn Anastasio (1):
  powerpc/pci: Fix pcibios_setup_device() ordering

 arch/powerpc/kernel/pci-common.c | 25 +
 1 file changed, 9 insertions(+), 16 deletions(-)

-- 
2.20.1



[PATCH v2] soc: fsl: dpio: Add support for QBMan ring bulk enqueue.

2019-09-05 Thread Youri Querry
The QBMan frame descriptor enqueuing is changed from array
 mode (a single frame enqueue at a time) to bulk ring mode.

This new mode allows the enqueuing of multiple frames in one operation.
The original interface is kept but use the bulk enqueue of one frame

Signed-off-by: Youri Querry 
---
 drivers/soc/fsl/dpio/dpio-service.c |  69 +++-
 drivers/soc/fsl/dpio/qbman-portal.c | 772 
 drivers/soc/fsl/dpio/qbman-portal.h | 175 +++-
 3 files changed, 935 insertions(+), 81 deletions(-)

diff --git a/drivers/soc/fsl/dpio/dpio-service.c 
b/drivers/soc/fsl/dpio/dpio-service.c
index b9539ef..4eb53ee 100644
--- a/drivers/soc/fsl/dpio/dpio-service.c
+++ b/drivers/soc/fsl/dpio/dpio-service.c
@@ -1,7 +1,7 @@
 // SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause)
 /*
  * Copyright 2014-2016 Freescale Semiconductor Inc.
- * Copyright 2016 NXP
+ * Copyright 2016-2019 NXP
  *
  */
 #include 
@@ -435,6 +435,69 @@ int dpaa2_io_service_enqueue_fq(struct dpaa2_io *d,
 EXPORT_SYMBOL(dpaa2_io_service_enqueue_fq);
 
 /**
+ * dpaa2_io_service_enqueue_multiple_fq() - Enqueue multiple frames
+ * to a frame queue using one fqid.
+ * @d: the given DPIO service.
+ * @fqid: the given frame queue id.
+ * @fd: the list of frame descriptors enqueued.
+ * @nb: number of frames to be enqueued
+ *
+ * Return the number of enqueued frames (0 if EQCR is busy)
+ * or -ENODEV if there is no dpio service.
+ */
+int dpaa2_io_service_enqueue_multiple_fq(struct dpaa2_io *d,
+   u32 fqid,
+   const struct dpaa2_fd *fd,
+   int nb)
+{
+   struct qbman_eq_desc ed;
+
+   d = service_select(d);
+   if (!d)
+   return -ENODEV;
+
+   qbman_eq_desc_clear();
+   qbman_eq_desc_set_no_orp(, 0);
+   qbman_eq_desc_set_fq(, fqid);
+
+   return qbman_swp_enqueue_multiple(d->swp, , fd, 0, nb);
+}
+EXPORT_SYMBOL(dpaa2_io_service_enqueue_multiple_fq);
+
+/**
+ * dpaa2_io_service_enqueue_multiple_desc_fq() - Enqueue multiple frames
+ * to different frame queue using a list of fqids.
+ * @d: the given DPIO service.
+ * @fqid: the given list of frame queue ids.
+ * @fd: the list of frame descriptors enqueued.
+ * @nb: number of frames to be enqueued
+ *
+ * Return the number of enqueued frames (0 if EQCR is busy)
+ * or -ENODEV if there is no dpio service.
+ */
+int dpaa2_io_service_enqueue_multiple_desc_fq(struct dpaa2_io *d,
+   u32 *fqid,
+   const struct dpaa2_fd *fd,
+   int nb)
+{
+   int i;
+   struct qbman_eq_desc_min ed[32];
+
+   d = service_select(d);
+   if (!d)
+   return -ENODEV;
+
+   for (i = 0; i < nb; i++) {
+   qbman_eq_desc_min_clear([i]);
+   qbman_eq_desc_set_no_orp_min([i], 0);
+   qbman_eq_desc_set_min_fq([i], fqid[i]);
+   }
+
+   return qbman_swp_enqueue_multiple_desc(d->swp, [0], fd, nb);
+}
+EXPORT_SYMBOL(dpaa2_io_service_enqueue_multiple_desc_fq);
+
+/**
  * dpaa2_io_service_enqueue_qd() - Enqueue a frame to a QD.
  * @d: the given DPIO service.
  * @qdid: the given queuing destination id.
@@ -528,7 +591,7 @@ EXPORT_SYMBOL_GPL(dpaa2_io_service_acquire);
 
 /**
  * dpaa2_io_store_create() - Create the dma memory storage for dequeue result.
- * @max_frames: the maximum number of dequeued result for frames, must be <= 
16.
+ * @max_frames: the maximum number of dequeued result for frames, must be <= 
32.
  * @dev:the device to allow mapping/unmapping the DMAable region.
  *
  * The size of the storage is "max_frames*sizeof(struct dpaa2_dq)".
@@ -543,7 +606,7 @@ struct dpaa2_io_store *dpaa2_io_store_create(unsigned int 
max_frames,
struct dpaa2_io_store *ret;
size_t size;
 
-   if (!max_frames || (max_frames > 16))
+   if (!max_frames || (max_frames > 32))
return NULL;
 
ret = kmalloc(sizeof(*ret), GFP_KERNEL);
diff --git a/drivers/soc/fsl/dpio/qbman-portal.c 
b/drivers/soc/fsl/dpio/qbman-portal.c
index c66f5b7..0ed2c8f 100644
--- a/drivers/soc/fsl/dpio/qbman-portal.c
+++ b/drivers/soc/fsl/dpio/qbman-portal.c
@@ -1,13 +1,14 @@
 // SPDX-License-Identifier: (GPL-2.0+ OR BSD-3-Clause)
 /*
  * Copyright (C) 2014-2016 Freescale Semiconductor, Inc.
- * Copyright 2016 NXP
+ * Copyright 2016-2019 NXP
  *
  */
 
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include "qbman-portal.h"
@@ -28,6 +29,7 @@
 
 /* CINH register offsets */
 #define QBMAN_CINH_SWP_EQCR_PI  0x800
+#define QBMAN_CINH_SWP_EQCR_CI 0x840
 #define QBMAN_CINH_SWP_EQAR0x8c0
 #define QBMAN_CINH_SWP_CR_RT0x900
 #define QBMAN_CINH_SWP_VDQCR_RT 0x940
@@ -51,6 +53,8 @@
 #define QBMAN_CENA_SWP_CR  0x600
 #define QBMAN_CENA_SWP_RR(vb)  (0x700 + ((u32)(vb) >> 1))
 #define QBMAN_CENA_SWP_VDQCR   0x780
+#define QBMAN_CENA_SWP_EQCR_CI 0x840
+#define QBMAN_CENA_SWP_EQCR_CI_MEMBACK 

Re: [PATCH 0/2] Fix IOMMU setup for hotplugged devices on pseries

2019-09-05 Thread Shawn Anastasio

On 9/5/19 4:38 AM, Lukas Wunner wrote:

On Wed, Sep 04, 2019 at 11:22:13PM -0500, Shawn Anastasio wrote:

If anybody has more insight or a better way to fix this, please let me know.


Have you considered moving the invocation of pcibios_setup_device()
to pcibios_bus_add_device()?

The latter is called from pci_bus_add_device() in drivers/pci/bus.c.
At this point device_add() has been called, so the device exists in
sysfs.

Basically when adding a PCI device, the order is:

* pci_device_add() populates struct pci_dev, calls device_add(),
   binding the device to a driver is prevented
* after pci_device_add() has been called for all discovered devices,
   resources are allocated
* pci_bus_add_device() is called for each device,
   calls pcibios_bus_add_device() and binds the device to a driver


Thank you, this is exactly what I was looking for! Just tested and
this seems to work perfectly. I'll go ahead and submit a v2 that
does this instead.

Thanks again,
Shawn


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Al Viro
On Thu, Sep 05, 2019 at 08:23:03PM +0200, Christian Brauner wrote:

> Because every caller of that function right now has that limit set
> anyway iirc. So we can either remove it from here and place it back for
> the individual callers or leave it in the helper.
> Also, I'm really asking, why not? Is it unreasonable to have an upper
> bound on the size (for a long time probably) or are you disagreeing with
> PAGE_SIZE being used? PAGE_SIZE limit is currently used by sched, perf,
> bpf, and clone3 and in a few other places.

For a primitive that can be safely used with any size (OK, any within
the usual 2Gb limit)?  Why push the random policy into the place where
it doesn't belong?

Seriously, what's the point?  If they want to have a large chunk of
userland memory zeroed or checked for non-zeroes - why would that
be a problem?


[PATCH 3/3] powerpc: Use ftrace_graph_ret_addr() when unwinding

2019-09-05 Thread Naveen N. Rao
With support for HAVE_FUNCTION_GRAPH_RET_ADDR_PTR,
ftrace_graph_ret_addr() provides more robust unwinding when function
graph is in use. Update show_stack() to use the same.

With dump_stack() added to sysrq_sysctl_handler(), before this patch:
  root@(none):/sys/kernel/debug/tracing# cat /proc/sys/kernel/sysrq
  CPU: 0 PID: 218 Comm: cat Not tainted 5.3.0-rc7-00868-g8453ad4a078c-dirty #20
  Call Trace:
  [c000d1e13c30] [c006ab98] return_to_handler+0x0/0x40 
(dump_stack+0xe8/0x164) (unreliable)
  [c000d1e13c80] [c0145680] sysrq_sysctl_handler+0x48/0xb8
  [c000d1e13cd0] [c006ab98] return_to_handler+0x0/0x40 
(proc_sys_call_handler+0x274/0x2a0)
  [c000d1e13d60] [c006ab98] return_to_handler+0x0/0x40 
(return_to_handler+0x0/0x40)
  [c000d1e13d80] [c006ab98] return_to_handler+0x0/0x40 
(__vfs_read+0x3c/0x70)
  [c000d1e13dd0] [c006ab98] return_to_handler+0x0/0x40 
(vfs_read+0xb8/0x1b0)
  [c000d1e13e20] [c006ab98] return_to_handler+0x0/0x40 
(ksys_read+0x7c/0x140)

After this patch:
  Call Trace:
  [c000d1e33c30] [c006ab58] return_to_handler+0x0/0x40 
(dump_stack+0xe8/0x164) (unreliable)
  [c000d1e33c80] [c0145680] sysrq_sysctl_handler+0x48/0xb8
  [c000d1e33cd0] [c006ab58] return_to_handler+0x0/0x40 
(proc_sys_call_handler+0x274/0x2a0)
  [c000d1e33d60] [c006ab58] return_to_handler+0x0/0x40 
(__vfs_read+0x3c/0x70)
  [c000d1e33d80] [c006ab58] return_to_handler+0x0/0x40 
(vfs_read+0xb8/0x1b0)
  [c000d1e33dd0] [c006ab58] return_to_handler+0x0/0x40 
(ksys_read+0x7c/0x140)
  [c000d1e33e20] [c006ab58] return_to_handler+0x0/0x40 
(system_call+0x5c/0x68)

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/kernel/process.c | 19 ++-
 1 file changed, 6 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 24621e7e5033..f289bdd2b562 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2047,10 +2047,8 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
int count = 0;
int firstframe = 1;
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   struct ftrace_ret_stack *ret_stack;
-   extern void return_to_handler(void);
-   unsigned long rth = (unsigned long)return_to_handler;
-   int curr_frame = 0;
+   unsigned long ret_addr;
+   int ftrace_idx = 0;
 #endif
 
if (tsk == NULL)
@@ -2079,15 +2077,10 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
if (!firstframe || ip != lr) {
printk("["REG"] ["REG"] %pS", sp, ip, (void *)ip);
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
-   if ((ip == rth) && curr_frame >= 0) {
-   ret_stack = ftrace_graph_get_ret_stack(current,
- curr_frame++);
-   if (ret_stack)
-   pr_cont(" (%pS)",
-   (void *)ret_stack->ret);
-   else
-   curr_frame = -1;
-   }
+   ret_addr = ftrace_graph_ret_addr(current,
+   _idx, ip, stack);
+   if (ret_addr != ip)
+   pr_cont(" (%pS)", (void *)ret_addr);
 #endif
if (firstframe)
pr_cont(" (unreliable)");
-- 
2.23.0



[PATCH 2/3] powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR

2019-09-05 Thread Naveen N. Rao
This associates entries in the ftrace_ret_stack with corresponding stack
frames, enabling more robust stack unwinding. Also update the only user
of ftrace_graph_ret_addr() to pass the stack pointer.

Signed-off-by: Naveen N. Rao 
---
 arch/powerpc/include/asm/asm-prototypes.h  | 3 ++-
 arch/powerpc/include/asm/ftrace.h  | 2 ++
 arch/powerpc/kernel/stacktrace.c   | 2 +-
 arch/powerpc/kernel/trace/ftrace.c | 5 +++--
 arch/powerpc/kernel/trace/ftrace_32.S  | 1 +
 arch/powerpc/kernel/trace/ftrace_64_mprofile.S | 1 +
 arch/powerpc/kernel/trace/ftrace_64_pg.S   | 1 +
 7 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index 49196d35e3bb..8561498e653c 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -134,7 +134,8 @@ extern int __ucmpdi2(u64, u64);
 
 /* tracing */
 void _mcount(void);
-unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip);
+unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip,
+   unsigned long sp);
 
 void pnv_power9_force_smt4_catch(void);
 void pnv_power9_force_smt4_release(void);
diff --git a/arch/powerpc/include/asm/ftrace.h 
b/arch/powerpc/include/asm/ftrace.h
index 3dfb80b86561..f54a08a2cd70 100644
--- a/arch/powerpc/include/asm/ftrace.h
+++ b/arch/powerpc/include/asm/ftrace.h
@@ -8,6 +8,8 @@
 #define MCOUNT_ADDR((unsigned long)(_mcount))
 #define MCOUNT_INSN_SIZE   4 /* sizeof mcount call */
 
+#define HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
+
 #ifdef __ASSEMBLY__
 
 /* Based off of objdump optput from glibc */
diff --git a/arch/powerpc/kernel/stacktrace.c b/arch/powerpc/kernel/stacktrace.c
index 1e2276963f6d..e2a46cfed5fd 100644
--- a/arch/powerpc/kernel/stacktrace.c
+++ b/arch/powerpc/kernel/stacktrace.c
@@ -182,7 +182,7 @@ static int __save_stack_trace_tsk_reliable(struct 
task_struct *tsk,
 * FIXME: IMHO these tests do not belong in
 * arch-dependent code, they are generic.
 */
-   ip = ftrace_graph_ret_addr(tsk, _idx, ip, NULL);
+   ip = ftrace_graph_ret_addr(tsk, _idx, ip, stack);
 #ifdef CONFIG_KPROBES
/*
 * Mark stacktraces with kretprobed functions on them
diff --git a/arch/powerpc/kernel/trace/ftrace.c 
b/arch/powerpc/kernel/trace/ftrace.c
index be1ca98fce5c..7ea0ca044b65 100644
--- a/arch/powerpc/kernel/trace/ftrace.c
+++ b/arch/powerpc/kernel/trace/ftrace.c
@@ -944,7 +944,8 @@ int ftrace_disable_ftrace_graph_caller(void)
  * Hook the return address and push it in the stack of return addrs
  * in current thread info. Return the address we want to divert to.
  */
-unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip)
+unsigned long prepare_ftrace_return(unsigned long parent, unsigned long ip,
+   unsigned long sp)
 {
unsigned long return_hooker;
 
@@ -956,7 +957,7 @@ unsigned long prepare_ftrace_return(unsigned long parent, 
unsigned long ip)
 
return_hooker = ppc_function_entry(return_to_handler);
 
-   if (!function_graph_enter(parent, ip, 0, NULL))
+   if (!function_graph_enter(parent, ip, 0, (unsigned long *)sp))
parent = return_hooker;
 out:
return parent;
diff --git a/arch/powerpc/kernel/trace/ftrace_32.S 
b/arch/powerpc/kernel/trace/ftrace_32.S
index 183f608efb81..e023ae59c429 100644
--- a/arch/powerpc/kernel/trace/ftrace_32.S
+++ b/arch/powerpc/kernel/trace/ftrace_32.S
@@ -50,6 +50,7 @@ _GLOBAL(ftrace_stub)
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 _GLOBAL(ftrace_graph_caller)
+   addir5, r1, 48
/* load r4 with local address */
lwz r4, 44(r1)
subir4, r4, MCOUNT_INSN_SIZE
diff --git a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S 
b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
index 74acbf16a666..f9fd5f743eba 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_mprofile.S
@@ -294,6 +294,7 @@ _GLOBAL(ftrace_graph_caller)
std r2, 24(r1)
ld  r2, PACATOC(r13)/* get kernel TOC in r2 */
 
+   addir5, r1, 112
mfctr   r4  /* ftrace_caller has moved local addr here */
std r4, 40(r1)
mflrr3  /* ftrace_caller has restored LR from stack */
diff --git a/arch/powerpc/kernel/trace/ftrace_64_pg.S 
b/arch/powerpc/kernel/trace/ftrace_64_pg.S
index e41a7d13c99c..6708e24db0ab 100644
--- a/arch/powerpc/kernel/trace/ftrace_64_pg.S
+++ b/arch/powerpc/kernel/trace/ftrace_64_pg.S
@@ -41,6 +41,7 @@ _GLOBAL(ftrace_stub)
 
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
 _GLOBAL(ftrace_graph_caller)
+   addir5, r1, 112
/* load r4 with local address */
ld  r4, 128(r1)
subi

[PATCH 1/3] ftrace: Look up the address of return_to_handler() using helpers

2019-09-05 Thread Naveen N. Rao
This ensures that we use the right address on architectures that use
function descriptors.

Signed-off-by: Naveen N. Rao 
---
 kernel/trace/fgraph.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/trace/fgraph.c b/kernel/trace/fgraph.c
index 8dfd5021b933..7950a0356042 100644
--- a/kernel/trace/fgraph.c
+++ b/kernel/trace/fgraph.c
@@ -276,7 +276,7 @@ unsigned long ftrace_graph_ret_addr(struct task_struct 
*task, int *idx,
int index = task->curr_ret_stack;
int i;
 
-   if (ret != (unsigned long)return_to_handler)
+   if (ret != (unsigned 
long)dereference_kernel_function_descriptor(return_to_handler))
return ret;
 
if (index < 0)
@@ -294,7 +294,7 @@ unsigned long ftrace_graph_ret_addr(struct task_struct 
*task, int *idx,
 {
int task_idx;
 
-   if (ret != (unsigned long)return_to_handler)
+   if (ret != (unsigned 
long)dereference_kernel_function_descriptor(return_to_handler))
return ret;
 
task_idx = task->curr_ret_stack;
-- 
2.23.0



[PATCH 0/3] powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR

2019-09-05 Thread Naveen N. Rao
Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR for more robust stack unwinding 
when function graph tracer is in use. Convert powerpc show_stack() to 
use ftrace_graph_ret_addr() for better stack unwinding.

- Naveen

Naveen N. Rao (3):
  ftrace: Look up the address of return_to_handler() using helpers
  powerpc/ftrace: Enable HAVE_FUNCTION_GRAPH_RET_ADDR_PTR
  powerpc: Use ftrace_graph_ret_addr() when unwinding

 arch/powerpc/include/asm/asm-prototypes.h |  3 ++-
 arch/powerpc/include/asm/ftrace.h |  2 ++
 arch/powerpc/kernel/process.c | 19 ++-
 arch/powerpc/kernel/stacktrace.c  |  2 +-
 arch/powerpc/kernel/trace/ftrace.c|  5 +++--
 arch/powerpc/kernel/trace/ftrace_32.S |  1 +
 .../powerpc/kernel/trace/ftrace_64_mprofile.S |  1 +
 arch/powerpc/kernel/trace/ftrace_64_pg.S  |  1 +
 kernel/trace/fgraph.c |  4 ++--
 9 files changed, 19 insertions(+), 19 deletions(-)

-- 
2.23.0



Re: [PATCH v3] powerpc/64: system call implement the bulk of the logic in C

2019-09-05 Thread Christophe Leroy




Le 05/09/2019 à 19:46, Michal Suchánek a écrit :

On Thu, 5 Sep 2019 15:25:31 +
Christophe Leroy  wrote:


On 09/05/2019 12:35 PM, Nicholas Piggin wrote:

System call entry and particularly exit code is beyond the limit of what
is reasonable to implement in asm.

This conversion moves all conditional branches out of the asm code,
except for the case that all GPRs should be restored at exit.

Null syscall test is about 5% faster after this patch, because the exit
work is handled under local_irq_disable, and the hard mask and pending
interrupt replay is handled after that, which avoids games with MSR.

Signed-off-by: Nicholas Piggin 


Cannot apply it on latest powerpc/merge. On what does that apply ?


The v2 series had 4 patches so presumably the previous 3 are missing
here.



Yes indeed, the first two patches are already merged, so applying patch 
3 of that series allows to apply this patch.


Christophe


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Al Viro
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> +/*
> + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> + * checked access_ok(p, size).
> + */
> +static int __memzero_user(void __user *p, size_t s)
> +{
> + const char zeros[BUFFER_SIZE] = {};
> + while (s > 0) {
> + size_t n = min(s, sizeof(zeros));
> +
> + if (__copy_to_user(p, zeros, n))
> + return -EFAULT;
> +
> + p += n;
> + s -= n;
> + }
> + return 0;
> +}

That's called clear_user().

> +int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;

Why?

> + } else if (usize > ksize) {
> + if (__memzero_user(dst + size, rest))
> + return -EFAULT;
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_to_user(dst, src, size))
> + return -EFAULT;

Why not simply clear_user() and copy_to_user()?

> +int copy_struct_from_user(void *dst, size_t ksize,
> +   const void __user *src, size_t usize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);

Cute, but... you would be just as well without that 'rest' thing.

> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;

Again, why?

> + if (unlikely(!access_ok(src, usize)))
> + return -EFAULT;

Why not simply copy_from_user() here?

> + /* Deal with trailing bytes. */
> + if (usize < ksize)
> + memset(dst + size, 0, rest);
> + else if (usize > ksize) {
> + const void __user *addr = src + size;
> + char buffer[BUFFER_SIZE] = {};
> +
> + while (rest > 0) {
> + size_t bufsize = min(rest, sizeof(buffer));
> +
> + if (__copy_from_user(buffer, addr, bufsize))
> + return -EFAULT;
> + if (memchr_inv(buffer, 0, bufsize))
> + return -E2BIG;

Frankly, that looks like a candidate for is_all_zeroes_user().
With the loop like above serving as a dumb default.  And on
badly alighed address it _will_ be dumb.  Probably too much
so - something like
if ((unsigned long)addr & 1) {
u8 v;
if (get_user(v, (__u8 __user *)addr))
return -EFAULT;
if (v)
return -E2BIG;
addr++;
}
if ((unsigned long)addr & 2) {
u16 v;
if (get_user(v, (__u16 __user *)addr))
return -EFAULT;
if (v)
return -E2BIG;
addr +=2;
}
if ((unsigned long)addr & 4) {
u32 v;
if (get_user(v, (__u32 __user *)addr))
return -EFAULT;
if (v)
return -E2BIG;
}

would be saner, and things like x86 could trivially add an
asm variant - it's not hard.  Incidentally, memchr_inv() is
an overkill in this case...


Re: [PATCH 0/2] Fix IOMMU setup for hotplugged devices on pseries

2019-09-05 Thread Shawn Anastasio

On 9/5/19 4:08 AM, Alexey Kardashevskiy wrot>
I just tried hotplugging 3 virtio-net devices into a guest system with 
v5.2 kernel and it seems working (i.e. BARs mapped, a driver is bound):



root@le-dbg:~# lspci -v | egrep -i '(virtio|Memory)'
00:00.0 Ethernet controller: Red Hat, Inc Virtio network device
     Memory at 20008004 (32-bit, non-prefetchable) [size=4K]
     Memory at 2100 (64-bit, prefetchable) [size=16K]
     Kernel driver in use: virtio-pci
00:01.0 Ethernet controller: Red Hat, Inc Virtio network device
     Memory at 200080041000 (32-bit, non-prefetchable) [size=4K]
     Memory at 21004000 (64-bit, prefetchable) [size=16K]
     Kernel driver in use: virtio-pci
00:02.0 Ethernet controller: Red Hat, Inc Virtio network device
     Memory at 200080042000 (32-bit, non-prefetchable) [size=4K]
     Memory at 21008000 (64-bit, prefetchable) [size=16K]
     Kernel driver in use: virtio-pci

Can you explain in detail what you are doing exactly and what is failing 
and what qemu/guest kernel/guest distro is used? Thanks,


Sure. I'm on host kernel 5.2.8, guest on 5.3-rc7 (also tested on 5.1.16)
and I'm hotplugging ivshmem devices to a separate spapr-pci-host-bridge
defined as follows:

-device spapr-pci-host-bridge,index=1,id=pci.1

Device hotplug and BAR assignment does work, but IOMMU group assignment
seems to fail. This is evidenced by the kernel log which shows the
following message for the first device but not the second:

[  136.849448] pci 0001:00:00.0: Adding to iommu group 1

Trying to bind the second device to vfio-pci as a result of this
fails:

[  471.691948] vfio-pci: probe of 0001:00:01.0 failed with error -22

I traced that failure to a call to iommu_group_get() which returns
NULL for the second device. I then traced that back to the ordering
issue I described.

For your second and third virtio-net devices, was the
"Adding to iommu group N" kernel message printed?


Re: [PATCH v3] powerpc/64: system call implement the bulk of the logic in C

2019-09-05 Thread Michal Suchánek
On Thu, 5 Sep 2019 15:25:31 +
Christophe Leroy  wrote:

> On 09/05/2019 12:35 PM, Nicholas Piggin wrote:
> > System call entry and particularly exit code is beyond the limit of what
> > is reasonable to implement in asm.
> > 
> > This conversion moves all conditional branches out of the asm code,
> > except for the case that all GPRs should be restored at exit.
> > 
> > Null syscall test is about 5% faster after this patch, because the exit
> > work is handled under local_irq_disable, and the hard mask and pending
> > interrupt replay is handled after that, which avoids games with MSR.
> > 
> > Signed-off-by: Nicholas Piggin   
> 
> Cannot apply it on latest powerpc/merge. On what does that apply ?

The v2 series had 4 patches so presumably the previous 3 are missing
here.

Thanks

Michal


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-05 Thread Gerald Schaefer
On Thu, 5 Sep 2019 14:48:14 +0530
Anshuman Khandual  wrote:

> > [...]  
> >> +
> >> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
> >> +static void pud_clear_tests(pud_t *pudp)
> >> +{
> >> +  memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
> >> +  pud_clear(pudp);
> >> +  WARN_ON(!pud_none(READ_ONCE(*pudp)));
> >> +}  
> > 
> > For pgd/p4d/pud_clear(), we only clear if the page table level is present
> > and not folded. The memset() here overwrites the table type bits, so
> > pud_clear() will not clear anything on s390 and the pud_none() check will
> > fail.
> > Would it be possible to OR a (larger) random value into the table, so that
> > the lower 12 bits would be preserved?  
> 
> So the suggestion is instead of doing memset() on entry with RANDOM_NZVALUE,
> it should OR a large random value preserving lower 12 bits. Hmm, this should
> still do the trick for other platforms, they just need non zero value. So on
> s390, the lower 12 bits on the page table entry already has valid value while
> entering this function which would make sure that pud_clear() really does
> clear the entry ?

Yes, in theory the table entry on s390 would have the type set in the last
4 bits, so preserving those would be enough. If it does not conflict with
others, I would still suggest preserving all 12 bits since those would contain
arch-specific flags in general, just to be sure. For s390, the pte/pmd tests
would also work with the memset, but for consistency I think the same logic
should be used in all pxd_clear_tests.

However, there is another issue on s390 which will make this only work
for pud_clear_tests(), and not for the p4d/pgd_tests. The problem is that
mm_alloc() will only give you a 3-level page table initially on s390.
This means that pudp == p4dp == pgdp, and so the p4d/pgd_tests will
both see the pud level (of course this also affects other tests).

Not sure yet how to fix this, i.e. how to initialize/update the page table
to 5 levels. We can handle 5 level page tables, and it would be good if
all levels could be tested, but using mm_alloc() to establish the page
tables might not work on s390. One option could be to provide an arch-hook
or weak function to allocate/initialize the mm.

IIUC, the (dummy) mm is really only needed to provide an mm->pgd as starting
point, right?

Regards,
Gerald



Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Aleksa Sarai
On 2019-09-05, Peter Zijlstra  wrote:
> On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > On 2019-09-05, Peter Zijlstra  wrote:
> > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > +
> > > > +   while (rest > 0) {
> > > > +   size_t bufsize = min(rest, sizeof(buffer));
> > > > +
> > > > +   if (__copy_from_user(buffer, addr, bufsize))
> > > > +   return -EFAULT;
> > > > +   if (memchr_inv(buffer, 0, bufsize))
> > > > +   return -E2BIG;
> > > > +
> > > > +   addr += bufsize;
> > > > +   rest -= bufsize;
> > > > +   }
> > > 
> > > The perf implementation uses get_user(); but if that is too slow, surely
> > > we can do something with uaccess_try() here?
> > 
> > Is there a non-x86-specific way to do that (unless I'm mistaken only x86
> > has uaccess_try() or the other *_try() wrappers)? The main "performance
> > improvement" (if you can even call it that) is that we use memchr_inv()
> > which finds non-matching characters more efficiently than just doing a
> > loop.
> 
> Oh, you're right, that's x86 only :/

Though, I just had an idea -- am I wrong to think that the following
would work just as well (without the need for an intermediate buffer)?

   if (memchr_inv((const char __force *) src + size, 0, rest))
 return -E2BIG;

Or is this type of thing very much frowned upon? What if it was a
separate memchr_inv_user() instead -- I feel as though there's not a
strong argument for needing to use a buffer when we're single-passing
the __user buffer and doing a basic boolean check.

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v3] powerpc/64: system call implement the bulk of the logic in C

2019-09-05 Thread Christophe Leroy




Le 05/09/2019 à 14:35, Nicholas Piggin a écrit :

System call entry and particularly exit code is beyond the limit of what
is reasonable to implement in asm.

This conversion moves all conditional branches out of the asm code,
except for the case that all GPRs should be restored at exit.

Null syscall test is about 5% faster after this patch, because the exit
work is handled under local_irq_disable, and the hard mask and pending
interrupt replay is handled after that, which avoids games with MSR.

Signed-off-by: Nicholas Piggin 

v3:
- Fix !KUAP build [mpe]
- Fix BookE build/boot [mpe]
- Don't trace irqs with MSR[RI]=0
- Don't allow syscall_exit_prepare to be ftraced, because function
   graph tracing which traces exits barfs after the IRQ state is
   prepared for kernel exit.
- Fix BE syscall table to use normal function descriptors now that they
   are called from C.
- Comment syscall_exit_prepare.
---
  arch/powerpc/include/asm/asm-prototypes.h |  11 -
  .../powerpc/include/asm/book3s/64/kup-radix.h |  14 +-
  arch/powerpc/include/asm/cputime.h|  24 ++
  arch/powerpc/include/asm/hw_irq.h |   4 +
  arch/powerpc/include/asm/ptrace.h |   3 +
  arch/powerpc/include/asm/signal.h |   3 +
  arch/powerpc/include/asm/switch_to.h  |   5 +
  arch/powerpc/include/asm/time.h   |   3 +
  arch/powerpc/kernel/Makefile  |   3 +-
  arch/powerpc/kernel/entry_64.S| 337 +++---
  arch/powerpc/kernel/signal.h  |   2 -
  arch/powerpc/kernel/syscall_64.c  | 195 ++
  arch/powerpc/kernel/systbl.S  |   9 +-
  13 files changed, 300 insertions(+), 313 deletions(-)
  create mode 100644 arch/powerpc/kernel/syscall_64.c



[...]


diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index 5a3e0b5c9ad1..15bc2a872a76 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -63,12 +63,6 @@ exception_marker:
  
  	.globl system_call_common

  system_call_common:
-#ifdef CONFIG_PPC_TRANSACTIONAL_MEM
-BEGIN_FTR_SECTION
-   extrdi. r10, r12, 1, (63-MSR_TS_T_LG) /* transaction active? */
-   bne .Ltabort_syscall
-END_FTR_SECTION_IFSET(CPU_FTR_TM)
-#endif
mr  r10,r1
ld  r1,PACAKSAVE(r13)
std r10,0(r1)
@@ -76,350 +70,111 @@ END_FTR_SECTION_IFSET(CPU_FTR_TM)
std r12,_MSR(r1)
std r0,GPR0(r1)
std r10,GPR1(r1)
+   std r2,GPR2(r1)
  #ifdef CONFIG_PPC_FSL_BOOK3E
  START_BTB_FLUSH_SECTION
BTB_FLUSH(r10)
  END_BTB_FLUSH_SECTION
  #endif
-   ACCOUNT_CPU_USER_ENTRY(r13, r10, r11)
-   std r2,GPR2(r1)
+   ld  r2,PACATOC(r13)
+   mfcrr12
+   li  r11,0
+   /* Can we avoid saving r3-r8 in common case? */
std r3,GPR3(r1)
-   mfcrr2
std r4,GPR4(r1)
std r5,GPR5(r1)
std r6,GPR6(r1)
std r7,GPR7(r1)
std r8,GPR8(r1)
-   li  r11,0
+   /* Zero r9-r12, this should only be required when restoring all GPRs */
std r11,GPR9(r1)
std r11,GPR10(r1)
std r11,GPR11(r1)
std r11,GPR12(r1)
-   std r11,_XER(r1)
-   std r11,_CTR(r1)
std r9,GPR13(r1)
SAVE_NVGPRS(r1)
+   std r11,_XER(r1)
+   std r11,_CTR(r1)
mflrr10
+
/*
 * This clears CR0.SO (bit 28), which is the error indication on
 * return from this system call.
 */
-   rldimi  r2,r11,28,(63-28)
+   rldimi  r12,r11,28,(63-28)
li  r11,0xc00
std r10,_LINK(r1)
std r11,_TRAP(r1)
+   std r12,_CCR(r1)
std r3,ORIG_GPR3(r1)
-   std r2,_CCR(r1)
-   ld  r2,PACATOC(r13)
-   addir9,r1,STACK_FRAME_OVERHEAD
+   addir10,r1,STACK_FRAME_OVERHEAD
ld  r11,exception_marker@toc(r2)
-   std r11,-16(r9) /* "regshere" marker */
-
-   kuap_check_amr r10, r11
-
-#if defined(CONFIG_VIRT_CPU_ACCOUNTING_NATIVE) && defined(CONFIG_PPC_SPLPAR)
-BEGIN_FW_FTR_SECTION
-   /* see if there are any DTL entries to process */
-   ld  r10,PACALPPACAPTR(r13)  /* get ptr to VPA */
-   ld  r11,PACA_DTL_RIDX(r13)  /* get log read index */
-   addir10,r10,LPPACA_DTLIDX
-   LDX_BE  r10,0,r10   /* get log write index */
-   cmpdr11,r10
-   beq+33f
-   bl  accumulate_stolen_time
-   REST_GPR(0,r1)
-   REST_4GPRS(3,r1)
-   REST_2GPRS(7,r1)
-   addir9,r1,STACK_FRAME_OVERHEAD
-33:
-END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
-#endif /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE && CONFIG_PPC_SPLPAR */
-
-   /*
-* A syscall should always be called with interrupts enabled
-* so we just unconditionally hard-enable here. When some kind
-* of irq tracing is used, 

Re: [PATCH v4 02/16] powerpc/pseries: Introduce option to build secure virtual machines

2019-09-05 Thread Thiago Jung Bauermann


Hi Michael,

Michael Ellerman  writes:

> Thiago Jung Bauermann  writes:
>> Michael Ellerman  writes:
>>> On Tue, 2019-08-20 at 02:13:12 UTC, Thiago Jung Bauermann wrote:
 Introduce CONFIG_PPC_SVM to control support for secure guests and include
 Ultravisor-related helpers when it is selected

 Signed-off-by: Thiago Jung Bauermann 
>>>
>>> Patch 2-14 & 16 applied to powerpc next, thanks.
>>>
>>> https://git.kernel.org/powerpc/c/136bc0397ae21dbf63ca02e5775ad353a479cd2f
>>
>> Thank you very much!
>
> No worries. I meant to say, there were some minor differences between
> your patch 15 adding the documentation and Claudio's version. If you
> want those differences applied please send me an incremental patch.

Thanks for pointing it out. There's no need. Claudio's version is the
canonical one. The differences are because I had a slightly older
version at the time I posted my patches.

--
Thiago Jung Bauermann
IBM Linux Technology Center


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Aleksa Sarai
On 2019-09-05, Peter Zijlstra  wrote:
> On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > On 2019-09-05, Peter Zijlstra  wrote:
> > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > +/**
> > > > + * copy_struct_to_user: copy a struct to user space
> > > > + * @dst:   Destination address, in user space.
> > > > + * @usize: Size of @dst struct.
> > > > + * @src:   Source address, in kernel space.
> > > > + * @ksize: Size of @src struct.
> > > > + *
> > > > + * Copies a struct from kernel space to user space, in a way that 
> > > > guarantees
> > > > + * backwards-compatibility for struct syscall arguments (as long as 
> > > > future
> > > > + * struct extensions are made such that all new fields are *appended* 
> > > > to the
> > > > + * old struct, and zeroed-out new fields have the same meaning as the 
> > > > old
> > > > + * struct).
> > > > + *
> > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by 
> > > > user space.
> > > > + * The recommended usage is something like the following:
> > > > + *
> > > > + *   SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > > > + *   {
> > > > + *  int err;
> > > > + *  struct foo karg = {};
> > > > + *
> > > > + *  // do something with karg
> > > > + *
> > > > + *  err = copy_struct_to_user(uarg, usize, , sizeof(karg));
> > > > + *  if (err)
> > > > + *return err;
> > > > + *
> > > > + *  // ...
> > > > + *   }
> > > > + *
> > > > + * There are three cases to consider:
> > > > + *  * If @usize == @ksize, then it's copied verbatim.
> > > > + *  * If @usize < @ksize, then kernel space is "returning" a newer 
> > > > struct to an
> > > > + *older user space. In order to avoid user space getting incomplete
> > > > + *information (new fields might be important), all trailing bytes 
> > > > in @src
> > > > + *(@ksize - @usize) must be zerored
> > > 
> > > s/zerored/zero/, right?
> > 
> > It should've been "zeroed".
> 
> That reads wrong to me; that way it reads like this function must take
> that action and zero out the 'rest'; which is just wrong.
> 
> This function must verify those bytes are zero, not make them zero.

Right, in my head I was thinking "must have been zeroed" which isn't
what it says. I'll switch to "zero".

> > > >  , otherwise -EFBIG is returned.
> > > 
> > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
> > 
> > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
> > a "too big" struct passed to the kernel, and EFBIG for a "too big"
> > struct passed to user-space. I would personally have preferred EMSGSIZE
> > instead of EFBIG, but felt using the existing error codes would be less
> > confusing.
> 
> Sadly a recent commit:
> 
>   1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify 
> sched_read_attr() ABI logic and code")
> 
> Made the situation even 'worse'.

I hadn't seen this patch before, and I have a few questions taking a
look at it:

 * An error code for a particular behaviour was changed (EFBIG ->
   E2BIG). Is this not a userspace breakage (I know Linus went ballistic
   about something similar a while ago[1]), or did I misunderstand what
   the issue was in [1]?
   * At the risk of bike-shedding -- of we are changing it, wouldn't
 -EMSGSIZE be more appropriate? To be fair, picking errno values has
 always been more of an art than a science, but to my ears "Argument
 list too long" doesn't make too much sense in the context of
 "returning" a struct back to userspace (and the cause of the error
 is that the argument passed by user space *isn't big enough*). If
 there was an E2SMALL that would also work. ;)

 * Do you want me to write a patch based on that, to switch it to
   copy_struct_to_user()?

 * That patch removes the "are there non-zero bytes in the tail that
   userspace won't know about" check (which I have included in mine). I
   understand that this caused issues specifically with sched_getattr(2)
   due to the default value not being zero -- how should we rectify that
   (given that we'd hopefully want to port everyone who uses that
   interface to copy_struct_{to,from}_user())?

 * Given that the [uk]attr->size construct is pretty important to the
   usability of the sched and perf interfaces, should we require (or
   encourage) it for all struct-extension syscall setups?

> > > > +   if (unlikely(!access_ok(src, usize)))
> > > > +   return -EFAULT;
> > > > +
> > > > +   /* Deal with trailing bytes. */
> > > > +   if (usize < ksize)
> > > > +   memset(dst + size, 0, rest);
> > > > +   else if (usize > ksize) {
> > > > +   const void __user *addr = src + size;
> > > > +   char buffer[BUFFER_SIZE] = {};
> > > 
> > > Isn't that too big for on-stack?
> > 
> > Is a 64-byte buffer too big? I picked the 

Re: [PATCH v3] powerpc/64: system call implement the bulk of the logic in C

2019-09-05 Thread Christophe Leroy




On 09/05/2019 12:35 PM, Nicholas Piggin wrote:

System call entry and particularly exit code is beyond the limit of what
is reasonable to implement in asm.

This conversion moves all conditional branches out of the asm code,
except for the case that all GPRs should be restored at exit.

Null syscall test is about 5% faster after this patch, because the exit
work is handled under local_irq_disable, and the hard mask and pending
interrupt replay is handled after that, which avoids games with MSR.

Signed-off-by: Nicholas Piggin 


Cannot apply it on latest powerpc/merge. On what does that apply ?

# git show --oneline
62d4496a1ecc (HEAD -> int64, hubppc/merge, merge) Automatic merge of 
branches 'master', 'next' and 'fixes' into merge


# git am -3 
/root/Downloads/v3-powerpc-64-system-call-implement-the-bulk-of-the-logic-in-C.patch 


Applying: powerpc/64: system call implement the bulk of the logic in C
Using index info to reconstruct a base tree...
M   arch/powerpc/include/asm/asm-prototypes.h
M   arch/powerpc/include/asm/ptrace.h
M   arch/powerpc/include/asm/time.h
M   arch/powerpc/kernel/Makefile
M   arch/powerpc/kernel/entry_64.S
Falling back to patching base and 3-way merge...
Auto-merging arch/powerpc/kernel/entry_64.S
CONFLICT (content): Merge conflict in arch/powerpc/kernel/entry_64.S
Auto-merging arch/powerpc/kernel/Makefile
CONFLICT (content): Merge conflict in arch/powerpc/kernel/Makefile
Auto-merging arch/powerpc/include/asm/time.h
Auto-merging arch/powerpc/include/asm/ptrace.h
Auto-merging arch/powerpc/include/asm/asm-prototypes.h
error: Failed to merge in the changes.
Patch failed at 0001 powerpc/64: system call implement the bulk of the 
logic in C

The copy of the patch that failed is found in: .git/rebase-apply/patch
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Christophe



v3:
- Fix !KUAP build [mpe]
- Fix BookE build/boot [mpe]
- Don't trace irqs with MSR[RI]=0
- Don't allow syscall_exit_prepare to be ftraced, because function
   graph tracing which traces exits barfs after the IRQ state is
   prepared for kernel exit.
- Fix BE syscall table to use normal function descriptors now that they
   are called from C.
- Comment syscall_exit_prepare.
---
  arch/powerpc/include/asm/asm-prototypes.h |  11 -
  .../powerpc/include/asm/book3s/64/kup-radix.h |  14 +-
  arch/powerpc/include/asm/cputime.h|  24 ++
  arch/powerpc/include/asm/hw_irq.h |   4 +
  arch/powerpc/include/asm/ptrace.h |   3 +
  arch/powerpc/include/asm/signal.h |   3 +
  arch/powerpc/include/asm/switch_to.h  |   5 +
  arch/powerpc/include/asm/time.h   |   3 +
  arch/powerpc/kernel/Makefile  |   3 +-
  arch/powerpc/kernel/entry_64.S| 337 +++---
  arch/powerpc/kernel/signal.h  |   2 -
  arch/powerpc/kernel/syscall_64.c  | 195 ++
  arch/powerpc/kernel/systbl.S  |   9 +-
  13 files changed, 300 insertions(+), 313 deletions(-)
  create mode 100644 arch/powerpc/kernel/syscall_64.c

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index ec1c97a8e8cb..f00ef8924a99 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -92,14 +92,6 @@ long sys_switch_endian(void);
  notrace unsigned int __check_irq_replay(void);
  void notrace restore_interrupts(void);
  
-/* ptrace */

-long do_syscall_trace_enter(struct pt_regs *regs);
-void do_syscall_trace_leave(struct pt_regs *regs);
-
-/* process */
-void restore_math(struct pt_regs *regs);
-void restore_tm_state(struct pt_regs *regs);
-
  /* prom_init (OpenFirmware) */
  unsigned long __init prom_init(unsigned long r3, unsigned long r4,
   unsigned long pp,
@@ -110,9 +102,6 @@ unsigned long __init prom_init(unsigned long r3, unsigned 
long r4,
  void __init early_setup(unsigned long dt_ptr);
  void early_setup_secondary(void);
  
-/* time */

-void accumulate_stolen_time(void);
-
  /* misc runtime */
  extern u64 __bswapdi2(u64);
  extern s64 __lshrdi3(s64, int);
diff --git a/arch/powerpc/include/asm/book3s/64/kup-radix.h 
b/arch/powerpc/include/asm/book3s/64/kup-radix.h
index f254de956d6a..07058edc5970 100644
--- a/arch/powerpc/include/asm/book3s/64/kup-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/kup-radix.h
@@ -3,6 +3,7 @@
  #define _ASM_POWERPC_BOOK3S_64_KUP_RADIX_H
  
  #include 

+#include 
  
  #define AMR_KUAP_BLOCK_READ	UL(0x4000)

  #define AMR_KUAP_BLOCK_WRITE  UL(0x8000)
@@ -56,7 +57,14 @@
  
  #ifdef CONFIG_PPC_KUAP
  
-#include 

+#include 
+#include 
+
+static inline void kuap_check_amr(void)
+{
+   if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG) && 
mmu_has_feature(MMU_FTR_RADIX_KUAP))
+   

Re: lockdep warning while booting POWER9 PowerNV

2019-09-05 Thread Qian Cai
On Thu, 2019-09-05 at 13:55 +1000, Michael Ellerman wrote:
> Bart Van Assche  writes:
> > On 8/30/19 2:13 PM, Qian Cai wrote:
> > > https://raw.githubusercontent.com/cailca/linux-mm/master/powerpc.config
> > > 
> > > Once in a while, booting an IBM POWER9 PowerNV system (8335-GTH) would
> > > generate
> > > a warning in lockdep_register_key() at,
> > > 
> > > if (WARN_ON_ONCE(static_obj(key)))
> > > 
> > > because
> > > 
> > > key = 0xc19ad118
> > > &_stext = 0xc000
> > > &_end = 0xc49d
> > > 
> > > i.e., it will cause static_obj() returns 1.
> > 
> > (back from a trip)
> > 
> > Hi Qian,
> > 
> > Does this mean that on POWER9 it can happen that a dynamically allocated 
> > object has an address that falls between &_stext and &_end?
> 
> I thought that was true on all arches due to initmem, but seems not.
> 
> I guess we have the same problem as s390 and we need to define
> arch_is_kernel_initmem_freed().

Actually, it is in the .bss section. The commit 2d4f567103ff ("KVM: PPC:
Introduce kvm_tmp framework") adds kvm_tmp[] into the .bss section and then free
the rest of unused spaces back to the page allocator.

kernel_init
  kvm_guest_init
kvm_free_tmp
  free_reserved_area
free_unref_page
  free_unref_page_prepare

Later, alloc_workqueue() happens to allocate some pages from there, and triggers
the warning. Not sure what the best way to solve this.


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Aleksa Sarai
On 2019-09-05, Christian Brauner  wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > A common pattern for syscall extensions is increasing the size of a
> > struct passed from userspace, such that the zero-value of the new fields
> > result in the old kernel behaviour (allowing for a mix of userspace and
> > kernel vintages to operate on one another in most cases). This is done
> > in both directions -- hence two helpers -- though it's more common to
> > have to copy user space structs into kernel space.
> > 
> > Previously there was no common lib/ function that implemented
> > the necessary extension-checking semantics (and different syscalls
> > implemented them slightly differently or incompletely[1]). A future
> > patch replaces all of the common uses of this pattern to use the new
> > copy_struct_{to,from}_user() helpers.
> > 
> > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> >  similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> >  always rejects differently-sized struct arguments.
> > 
> > Suggested-by: Rasmus Villemoes 
> > Signed-off-by: Aleksa Sarai 

[...]

> > +   if (unlikely(!access_ok(src, usize)))
> > +   return -EFAULT;
> > +
> > +   /* Deal with trailing bytes. */
> > +   if (usize < ksize)
> > +   memset(dst + size, 0, rest);
[...]
> That's a change in behavior for clone3() and sched at least, no? Unless
> - which I guess you might have done - you have moved the "error out when
> the struct is too small" part before the call to copy_struct_from_user()
> for them.

Yes, I've put the minimum size check to the callers in all of the
cases (in the case of clone3() I've #define'd a CLONE_ARGS_SIZE_VER0 to
match the others -- see patch 2 of the series).

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


[PATCH v3] powerpc/64: system call implement the bulk of the logic in C

2019-09-05 Thread Nicholas Piggin
System call entry and particularly exit code is beyond the limit of what
is reasonable to implement in asm.

This conversion moves all conditional branches out of the asm code,
except for the case that all GPRs should be restored at exit.

Null syscall test is about 5% faster after this patch, because the exit
work is handled under local_irq_disable, and the hard mask and pending
interrupt replay is handled after that, which avoids games with MSR.

Signed-off-by: Nicholas Piggin 

v3:
- Fix !KUAP build [mpe]
- Fix BookE build/boot [mpe]
- Don't trace irqs with MSR[RI]=0
- Don't allow syscall_exit_prepare to be ftraced, because function
  graph tracing which traces exits barfs after the IRQ state is
  prepared for kernel exit.
- Fix BE syscall table to use normal function descriptors now that they
  are called from C.
- Comment syscall_exit_prepare.
---
 arch/powerpc/include/asm/asm-prototypes.h |  11 -
 .../powerpc/include/asm/book3s/64/kup-radix.h |  14 +-
 arch/powerpc/include/asm/cputime.h|  24 ++
 arch/powerpc/include/asm/hw_irq.h |   4 +
 arch/powerpc/include/asm/ptrace.h |   3 +
 arch/powerpc/include/asm/signal.h |   3 +
 arch/powerpc/include/asm/switch_to.h  |   5 +
 arch/powerpc/include/asm/time.h   |   3 +
 arch/powerpc/kernel/Makefile  |   3 +-
 arch/powerpc/kernel/entry_64.S| 337 +++---
 arch/powerpc/kernel/signal.h  |   2 -
 arch/powerpc/kernel/syscall_64.c  | 195 ++
 arch/powerpc/kernel/systbl.S  |   9 +-
 13 files changed, 300 insertions(+), 313 deletions(-)
 create mode 100644 arch/powerpc/kernel/syscall_64.c

diff --git a/arch/powerpc/include/asm/asm-prototypes.h 
b/arch/powerpc/include/asm/asm-prototypes.h
index ec1c97a8e8cb..f00ef8924a99 100644
--- a/arch/powerpc/include/asm/asm-prototypes.h
+++ b/arch/powerpc/include/asm/asm-prototypes.h
@@ -92,14 +92,6 @@ long sys_switch_endian(void);
 notrace unsigned int __check_irq_replay(void);
 void notrace restore_interrupts(void);
 
-/* ptrace */
-long do_syscall_trace_enter(struct pt_regs *regs);
-void do_syscall_trace_leave(struct pt_regs *regs);
-
-/* process */
-void restore_math(struct pt_regs *regs);
-void restore_tm_state(struct pt_regs *regs);
-
 /* prom_init (OpenFirmware) */
 unsigned long __init prom_init(unsigned long r3, unsigned long r4,
   unsigned long pp,
@@ -110,9 +102,6 @@ unsigned long __init prom_init(unsigned long r3, unsigned 
long r4,
 void __init early_setup(unsigned long dt_ptr);
 void early_setup_secondary(void);
 
-/* time */
-void accumulate_stolen_time(void);
-
 /* misc runtime */
 extern u64 __bswapdi2(u64);
 extern s64 __lshrdi3(s64, int);
diff --git a/arch/powerpc/include/asm/book3s/64/kup-radix.h 
b/arch/powerpc/include/asm/book3s/64/kup-radix.h
index f254de956d6a..07058edc5970 100644
--- a/arch/powerpc/include/asm/book3s/64/kup-radix.h
+++ b/arch/powerpc/include/asm/book3s/64/kup-radix.h
@@ -3,6 +3,7 @@
 #define _ASM_POWERPC_BOOK3S_64_KUP_RADIX_H
 
 #include 
+#include 
 
 #define AMR_KUAP_BLOCK_READUL(0x4000)
 #define AMR_KUAP_BLOCK_WRITE   UL(0x8000)
@@ -56,7 +57,14 @@
 
 #ifdef CONFIG_PPC_KUAP
 
-#include 
+#include 
+#include 
+
+static inline void kuap_check_amr(void)
+{
+   if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG) && 
mmu_has_feature(MMU_FTR_RADIX_KUAP))
+   WARN_ON_ONCE(mfspr(SPRN_AMR) != AMR_KUAP_BLOCKED);
+}
 
 /*
  * We support individually allowing read or write, but we don't support nesting
@@ -101,6 +109,10 @@ static inline bool bad_kuap_fault(struct pt_regs *regs, 
bool is_write)
(regs->kuap & (is_write ? AMR_KUAP_BLOCK_WRITE : 
AMR_KUAP_BLOCK_READ)),
"Bug: %s fault blocked by AMR!", is_write ? "Write" : 
"Read");
 }
+#else /* CONFIG_PPC_KUAP */
+static inline void kuap_check_amr(void)
+{
+}
 #endif /* CONFIG_PPC_KUAP */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/cputime.h 
b/arch/powerpc/include/asm/cputime.h
index 2431b4ada2fa..c43614cffaac 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -60,6 +60,30 @@ static inline void arch_vtime_task_switch(struct task_struct 
*prev)
 }
 #endif
 
+static inline void account_cpu_user_entry(void)
+{
+   unsigned long tb = mftb();
+   struct cpu_accounting_data *acct = get_accounting(current);
+
+   acct->utime += (tb - acct->starttime_user);
+   acct->starttime = tb;
+}
+static inline void account_cpu_user_exit(void)
+{
+   unsigned long tb = mftb();
+   struct cpu_accounting_data *acct = get_accounting(current);
+
+   acct->stime += (tb - acct->starttime);
+   acct->starttime_user = tb;
+}
+
 #endif /* __KERNEL__ */
+#else /* CONFIG_VIRT_CPU_ACCOUNTING_NATIVE */
+static inline void account_cpu_user_entry(void)
+{
+}
+static inline void account_cpu_user_exit(void)
+{
+}
 #endif /* 

Re: [PATCH v5 2/2] powerpc: Add support to initialize ima policy rules

2019-09-05 Thread Nayna




On 09/02/2019 07:52 AM, Michael Ellerman wrote:

Hi Nayna,


Hi Michael,



Some more comments below.

Nayna Jain  writes:

POWER secure boot relies on the kernel IMA security subsystem to
perform the OS kernel image signature verification.

Again this is just a design choice we've made, it's not specified
anywhere or anything like that. And it only applies to bare metal secure
boot, at least so far. AIUI.


Yes. I will make it consistent to use "PowerNV".


Since each secure
boot mode has different IMA policy requirements, dynamic definition of
the policy rules based on the runtime secure boot mode of the system is
required. On systems that support secure boot, but have it disabled,
only measurement policy rules of the kernel image and modules are
defined.

It's probably worth mentioning that we intend to use this in our
Linux-based boot loader, which uses kexec, and that's one of the reasons
why we're particularly interested in defining the rules for kexec?


Yes. Agreed. I will update patch description to add this.




This patch defines the arch-specific implementation to retrieve the
secure boot mode of the system and accordingly configures the IMA policy
rules.

This patch provides arch-specific IMA policies if PPC_SECURE_BOOT
config is enabled.

Signed-off-by: Nayna Jain 
---
  arch/powerpc/Kconfig   |  2 ++
  arch/powerpc/kernel/Makefile   |  2 +-
  arch/powerpc/kernel/ima_arch.c | 50 ++
  include/linux/ima.h|  3 +-
  4 files changed, 55 insertions(+), 2 deletions(-)
  create mode 100644 arch/powerpc/kernel/ima_arch.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index c902a39124dc..42109682b727 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -917,6 +917,8 @@ config PPC_SECURE_BOOT
bool
default n
depends on PPC64
+   depends on IMA
+   depends on IMA_ARCH_POLICY
help
  Linux on POWER with firmware secure boot enabled needs to define
  security policies to extend secure boot to the OS.This config
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index d310ebb4e526..520b1c814197 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -157,7 +157,7 @@ endif
  obj-$(CONFIG_EPAPR_PARAVIRT)  += epapr_paravirt.o epapr_hcalls.o
  obj-$(CONFIG_KVM_GUEST)   += kvm.o kvm_emul.o
  
-obj-$(CONFIG_PPC_SECURE_BOOT)	+= secboot.o

+obj-$(CONFIG_PPC_SECURE_BOOT)  += secboot.o ima_arch.o
  
  # Disable GCOV, KCOV & sanitizers in odd or sensitive code

  GCOV_PROFILE_prom_init.o := n
diff --git a/arch/powerpc/kernel/ima_arch.c b/arch/powerpc/kernel/ima_arch.c
new file mode 100644
index ..ac90fac83338
--- /dev/null
+++ b/arch/powerpc/kernel/ima_arch.c
@@ -0,0 +1,50 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain 
+ *
+ * ima_arch.c
+ *  - initialize ima policies for PowerPC Secure Boot
+ */
+
+#include 
+#include 
+
+bool arch_ima_get_secureboot(void)
+{
+   return get_powerpc_secureboot();
+}
+
+/*
+ * File signature verification is not needed, include only measurements
+ */
+static const char *const default_arch_rules[] = {
+   "measure func=KEXEC_KERNEL_CHECK",
+   "measure func=MODULE_CHECK",
+   NULL
+};

The rules above seem fairly self explanatory.


+
+/* Both file signature verification and measurements are needed */
+static const char *const sb_arch_rules[] = {
+   "measure func=KEXEC_KERNEL_CHECK template=ima-modsig",
+   "appraise func=KEXEC_KERNEL_CHECK appraise_type=imasig|modsig",
+#if IS_ENABLED(CONFIG_MODULE_SIG)
+   "measure func=MODULE_CHECK",
+#else
+   "measure func=MODULE_CHECK template=ima-modsig",
+   "appraise func=MODULE_CHECK appraise_type=imasig|modsig",
+#endif

But these ones are not so obvious, at least to me who knows very little
about IMA.

Can you add a one line comment to each of the ones in here saying what
it does and why we want it?


Sure.




+   NULL
+};
+
+/*
+ * On PowerPC, file measurements are to be added to the IMA measurement list
+ * irrespective of the secure boot state of the system.

Why? Just because we think it's useful? Would be good to provide some
further justification.


Sure. I will clarify this in the next version.

Thanks & Regards,
    - Nayna


Re: [PATCH v2 2/3] powerpc/tm: Fix restoring FP/VMX facility incorrectly on interrupts

2019-09-05 Thread Michael Ellerman
On Wed, 2019-09-04 at 04:55:28 UTC, gromero wrote:
> From: Gustavo Romero 
> 
> When in userspace and MSR FP=0 the hardware FP state is unrelated to
> the current process. This is extended for transactions where if tbegin
> is run with FP=0, the hardware checkpoint FP state will also be
> unrelated to the current process. Due to this, we need to ensure this
> hardware checkpoint is updated with the correct state before we enable
> FP for this process.
...
> 
> This fixes CVE-2019-15031.
> 
> Fixes: a7771176b439 ("powerpc: Don't enable FP/Altivec if not checkpointed")
> Cc: sta...@vger.kernel.org # 4.15+
> Signed-off-by: Gustavo Romero 
> Signed-off-by: Michael Neuling 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/a8318c13e79badb92bc6640704a64cc022a6eb97

cheers


Re: [PATCH v2 1/3] powerpc/tm: Fix FP/VMX unavailable exceptions inside a transaction

2019-09-05 Thread Michael Ellerman
On Wed, 2019-09-04 at 04:55:27 UTC, gromero wrote:
> From: Gustavo Romero 
> 
> When we take an FP unavailable exception in a transaction we have to
> account for the hardware FP TM checkpointed registers being
> incorrect. In this case for this process we know the current and
> checkpointed FP registers must be the same (since FP wasn't used
> inside the transaction) hence in the thread_struct we copy the current
> FP registers to the checkpointed ones.
...
> 
> This fixes CVE-2019-15030.
> 
> Fixes: f48e91e87e67 ("powerpc/tm: Fix FP and VMX register corruption")
> Cc: sta...@vger.kernel.org # 4.12+
> Signed-off-by: Gustavo Romero 
> Signed-off-by: Michael Neuling 

Applied to powerpc fixes, thanks.

https://git.kernel.org/powerpc/c/8205d5d98ef7f155de211f5e2eb6ca03d95a5a60

cheers


Re: [PATCH] powerpc/64: Fix stacktrace on BE when function_graph is enabled

2019-09-05 Thread Naveen N. Rao

Michael Ellerman wrote:

"Naveen N. Rao"  writes:

Michael Ellerman wrote:

Currently if we oops or warn while function_graph is active the stack
trace looks like:
  .trace_graph_return+0xac/0x100
  .ftrace_return_to_handler+0x98/0x140
  .return_to_handler+0x20/0x40
  .return_to_handler+0x0/0x40
  .return_to_handler+0x0/0x40
  .return_to_handler+0x0/0x40
  .return_to_handler+0x0/0x40
  .return_to_handler+0x0/0x40
  .return_to_handler+0x0/0x40
  .cpu_startup_entry+0x34/0x40
  .start_secondary+0x680/0x6f0
  start_secondary_prolog+0x10/0x14

Notice the multiple entries that just show .return_to_handler.

There is logic in show_stack() to detect this case and print the
traced function, but we inadvertently broke it in commit
7d56c65a6ff9 ("powerpc/ftrace: Remove mod_return_to_handler") (2014),
because that commit accidentally removed the dereference of rth which
gets the text address from the function descriptor. Hence this is only
broken on big endian (or technically ELFv1).

Fix it by using the proper accessor, which is ppc_function_entry().
Result is we get a stack trace such as:

  .trace_graph_return+0x134/0x160
  .ftrace_return_to_handler+0x94/0x140
  .return_to_handler+0x20/0x40
  .return_to_handler+0x0/0x40 (.shared_cede_loop+0x48/0x130)
  .return_to_handler+0x0/0x40 (.cpuidle_enter_state+0xa0/0x690)
  .return_to_handler+0x0/0x40 (.cpuidle_enter+0x44/0x70)
  .return_to_handler+0x0/0x40 (.call_cpuidle+0x68/0xc0)
  .return_to_handler+0x0/0x40 (.do_idle+0x37c/0x400)
  .return_to_handler+0x0/0x40 (.cpu_startup_entry+0x30/0x50)
  .rest_init+0x224/0x348

Fixes: 7d56c65a6ff9 ("powerpc/ftrace: Remove mod_return_to_handler")
Signed-off-by: Michael Ellerman 
---
 arch/powerpc/kernel/process.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index 8fc4de0d22b4..1601d7cfe45e 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -2048,7 +2048,7 @@ void show_stack(struct task_struct *tsk, unsigned long 
*stack)
 #ifdef CONFIG_FUNCTION_GRAPH_TRACER
struct ftrace_ret_stack *ret_stack;
extern void return_to_handler(void);
-   unsigned long rth = (unsigned long)return_to_handler;
+   unsigned long rth = ppc_function_entry(return_to_handler);


Thanks! This looks good to me. A small suggestion though -- can we use 
dereference_kernel_function_descriptor() instead? It will be a nop for 
ABIv2, which would be nice, but not really a major deal.


ppc_function_entry() isn't a nop on ABIv2, *if* the function has a local
entry point.

As it happens return_to_handler doesn't have a local entry point, so it
is currently a nop.


What I meant was that we still go read the first two instructions to 
identify if there is a GEP with ppc_function_entry(). But, 
dereference_kernel_function_descriptor() would be compiled out.




But if return_to_handler did have a local entry then
ppc_function_entry() would do the right thing here because we use
ppc_function_entry() in prepare_ftrace_return().

At least I think that's true :)


That's a good point :)
However, I think we should never have return_to_handler() with a GEP/LEP 
since it is not a regular function.


We should switch use of ppc_function_entry() in prepare_ftrace_return() 
to dereference_kernel_function_descriptor(). I will send a patch for 
that.



- Naveen



Re: [PATCH v5 1/2] powerpc: detect the secure boot mode of the system

2019-09-05 Thread Nayna




On 09/02/2019 07:52 AM, Michael Ellerman wrote:

Hi Nayna,


Hi Michael,



Sorry I've taken so long to get to this series, there's just too many
patches that need reviewing :/


No problem. I understand. Thanks for reviewing.



Nayna Jain  writes:

Secure boot on POWER defines different IMA policies based on the secure
boot state of the system.

The terminology throughout is a bit vague, we have POWER, PowerPC, Linux
on POWER etc.

What this patch is talking about is a particular implemention of secure
boot on some OpenPOWER machines running bare metal - am I right?

So saying "Secure boot on POWER defines different IMA policies" is a bit
broad I think. Really we've just decided that a way to implement secure
boot is to use IMA policies.


I think the idea was to convey that the same design can be reused or 
extended as needed.
But I agree for now it is currently only OpenPOWER machines running on 
bare metal, I will fix the wordings to use "PowerNV" consistently.







This patch defines a function to detect the secure boot state of the
system.

The PPC_SECURE_BOOT config represents the base enablement of secureboot
on POWER.

Signed-off-by: Nayna Jain 
---
  arch/powerpc/Kconfig   | 11 +
  arch/powerpc/include/asm/secboot.h | 27 
  arch/powerpc/kernel/Makefile   |  2 +
  arch/powerpc/kernel/secboot.c  | 71 ++
  4 files changed, 111 insertions(+)
  create mode 100644 arch/powerpc/include/asm/secboot.h
  create mode 100644 arch/powerpc/kernel/secboot.c

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 77f6ebf97113..c902a39124dc 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -912,6 +912,17 @@ config PPC_MEM_KEYS
  
  	  If unsure, say y.
  
+config PPC_SECURE_BOOT

+   prompt "Enable PowerPC Secure Boot"

How about "Enable secure boot support"


Yes. Sounds better.




+   bool
+   default n

The default is 'n', so you don't need that default line.


Sure.





+   depends on PPC64

Should it just depend on POWERNV for now? AFAIK there's nothing in here
that's necessarily going to be shared with the guest secure boot code is
there?


Yes. sounds good.





+   help
+ Linux on POWER with firmware secure boot enabled needs to define
+ security policies to extend secure boot to the OS.This config
+ allows user to enable OS Secure Boot on PowerPC systems that
+ have firmware secure boot support.

Again POWER vs PowerPC.

I think something like:

"Enable support for secure boot on some systems that have firmware
support for it. If in doubt say N."


Sure.





diff --git a/arch/powerpc/include/asm/secboot.h 
b/arch/powerpc/include/asm/secboot.h

secure_boot.h would be fine.


Sure.




new file mode 100644
index ..e726261bb00b
--- /dev/null
+++ b/arch/powerpc/include/asm/secboot.h
@@ -0,0 +1,27 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * PowerPC secure boot definitions
+ *
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain 

I prefer to not have email addresses in copyright headers, as they just
bit rot. Your email is in the git log.


Sure.





+ *
+ */
+#ifndef POWERPC_SECBOOT_H
+#define POWERPC_SECBOOT_H

We usually do _ASM_POWERPC_SECBOOT_H (or _ASM_POWERPC_SECURE_BOOT_H).


Sure.




+#ifdef CONFIG_PPC_SECURE_BOOT
+extern struct device_node *is_powerpc_secvar_supported(void);
+extern bool get_powerpc_secureboot(void);

You don't need 'extern' for functions in headers.


Yes. will fix.




+#else
+static inline struct device_node *is_powerpc_secvar_supported(void)
+{
+   return NULL;
+}
+
+static inline bool get_powerpc_secureboot(void)
+{
+   return false;
+}
+
+#endif
+#endif
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index ea0c69236789..d310ebb4e526 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -157,6 +157,8 @@ endif
  obj-$(CONFIG_EPAPR_PARAVIRT)  += epapr_paravirt.o epapr_hcalls.o
  obj-$(CONFIG_KVM_GUEST)   += kvm.o kvm_emul.o
  
+obj-$(CONFIG_PPC_SECURE_BOOT)	+= secboot.o

+
  # Disable GCOV, KCOV & sanitizers in odd or sensitive code
  GCOV_PROFILE_prom_init.o := n
  KCOV_INSTRUMENT_prom_init.o := n
diff --git a/arch/powerpc/kernel/secboot.c b/arch/powerpc/kernel/secboot.c
new file mode 100644
index ..5ea0d52d64ef
--- /dev/null
+++ b/arch/powerpc/kernel/secboot.c
@@ -0,0 +1,71 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2019 IBM Corporation
+ * Author: Nayna Jain 
+ *
+ * secboot.c
+ *  - util function to get powerpc secboot state

That's not really necessary.


Sure.




+ */
+#include 
+#include 
+#include 
+
+struct device_node *is_powerpc_secvar_supported(void)

This is a pretty weird signature. The "is_" implies it will return a
bool, but then it actually returns a device node *.


Yes. Agree. Will fix.




+{
+   struct device_node *np;
+   int status;
+
+   np = 

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Aleksa Sarai
On 2019-09-05, Christian Brauner  wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > A common pattern for syscall extensions is increasing the size of a
> > struct passed from userspace, such that the zero-value of the new fields
> > result in the old kernel behaviour (allowing for a mix of userspace and
> > kernel vintages to operate on one another in most cases). This is done
> > in both directions -- hence two helpers -- though it's more common to
> > have to copy user space structs into kernel space.
> > 
> > Previously there was no common lib/ function that implemented
> > the necessary extension-checking semantics (and different syscalls
> > implemented them slightly differently or incompletely[1]). A future
> > patch replaces all of the common uses of this pattern to use the new
> > copy_struct_{to,from}_user() helpers.
> > 
> > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> >  similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> >  always rejects differently-sized struct arguments.
> > 
> > Suggested-by: Rasmus Villemoes 
> > Signed-off-by: Aleksa Sarai 
> 
> I would probably split this out into a separate patchset. It can very
> well go in before openat2(). Thoughts?

Yeah, I'll split this and the related patches out -- though I will admit
I'm not sure how you're supposed to deal with multiple independent
patchsets that depend on each other. How will folks reviewing openat2(2)
know to include the lib/struct_user.c changes?

Also, whose tree should it go through?

-- 
Aleksa Sarai
Senior Software Engineer (Containers)
SUSE Linux GmbH



signature.asc
Description: PGP signature


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Rasmus Villemoes
On 05/09/2019 13.05, Christian Brauner wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:

>> +if (unlikely(!access_ok(dst, usize)))
>> +return -EFAULT;
>> +
>> +/* Deal with trailing bytes. */
>> +if (usize < ksize) {
>> +if (memchr_inv(src + size, 0, rest))
>> +return -EFBIG;
>> +} else if (usize > ksize) {
>> +if (__memzero_user(dst + size, rest))
>> +return -EFAULT;
> 
> Is zeroing that memory really our job? Seems to me we should just check
> it is zeroed.

Of course it is, otherwise you'd require userspace to clear the output
buffer it gives us, which in the majority of cases is wasted work. It's
much easier to reason about if we just say "the kernel populates [uaddr,
uaddr + usize)".

It's completely symmetric to copy_struct_from_user doing a memset() of
the tail of the kernel buffer in case of ksize>usize - you wouldn't want
to require the kernel callers to pass a zeroed buffer to
copy_struct_from_user() - it's just that when we memset(__user*),
there's an error check to do.

Rasmus


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 11:43:05AM +0200, Peter Zijlstra wrote:
> On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> > On 2019-09-05, Peter Zijlstra  wrote:
> > > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > > +/**
> > > > + * copy_struct_to_user: copy a struct to user space
> > > > + * @dst:   Destination address, in user space.
> > > > + * @usize: Size of @dst struct.
> > > > + * @src:   Source address, in kernel space.
> > > > + * @ksize: Size of @src struct.
> > > > + *
> > > > + * Copies a struct from kernel space to user space, in a way that 
> > > > guarantees
> > > > + * backwards-compatibility for struct syscall arguments (as long as 
> > > > future
> > > > + * struct extensions are made such that all new fields are *appended* 
> > > > to the
> > > > + * old struct, and zeroed-out new fields have the same meaning as the 
> > > > old
> > > > + * struct).
> > > > + *
> > > > + * @ksize is just sizeof(*dst), and @usize should've been passed by 
> > > > user space.
> > > > + * The recommended usage is something like the following:
> > > > + *
> > > > + *   SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > > > + *   {
> > > > + *  int err;
> > > > + *  struct foo karg = {};
> > > > + *
> > > > + *  // do something with karg
> > > > + *
> > > > + *  err = copy_struct_to_user(uarg, usize, , sizeof(karg));
> > > > + *  if (err)
> > > > + *return err;
> > > > + *
> > > > + *  // ...
> > > > + *   }
> > > > + *
> > > > + * There are three cases to consider:
> > > > + *  * If @usize == @ksize, then it's copied verbatim.
> > > > + *  * If @usize < @ksize, then kernel space is "returning" a newer 
> > > > struct to an
> > > > + *older user space. In order to avoid user space getting incomplete
> > > > + *information (new fields might be important), all trailing bytes 
> > > > in @src
> > > > + *(@ksize - @usize) must be zerored
> > > 
> > > s/zerored/zero/, right?
> > 
> > It should've been "zeroed".
> 
> That reads wrong to me; that way it reads like this function must take
> that action and zero out the 'rest'; which is just wrong.
> 
> This function must verify those bytes are zero, not make them zero.
> 
> > > >  , otherwise -EFBIG is returned.
> > > 
> > > 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
> > 
> > This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
> > a "too big" struct passed to the kernel, and EFBIG for a "too big"
> > struct passed to user-space. I would personally have preferred EMSGSIZE
> > instead of EFBIG, but felt using the existing error codes would be less
> > confusing.
> 
> Sadly a recent commit:
> 
>   1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify 
> sched_read_attr() ABI logic and code")
> 
> Made the situation even 'worse'.

And thinking more about things; I'm not convinced the above patch is
actually right.

Do we really want to simply truncate all the attributes of the task?

And should we not at least set sched_flags when there are non-default
clamp values applied?

See; that is I think the primary bug that had chrt failing; we tried to
publish the default clamp values as !0.


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to, from}_user helpers

2019-09-05 Thread Gabriel Paubert
On Thu, Sep 05, 2019 at 11:09:35AM +0200, Andreas Schwab wrote:
> On Sep 05 2019, Aleksa Sarai  wrote:
> 
> > diff --git a/lib/struct_user.c b/lib/struct_user.c
> > new file mode 100644
> > index ..7301ab1bbe98
> > --- /dev/null
> > +++ b/lib/struct_user.c
> > @@ -0,0 +1,182 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * Copyright (C) 2019 SUSE LLC
> > + * Copyright (C) 2019 Aleksa Sarai 
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define BUFFER_SIZE 64
> > +
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have 
> > already
> > + * checked access_ok(p, size).
> > + */
> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > +   const char zeros[BUFFER_SIZE] = {};
> 
> Perhaps make that static?

On SMP? It should at least be per cpu, and I'm not even sure
with preemption.

Gabriel

> 
> Andreas.
> 
> -- 
> Andreas Schwab, sch...@linux-m68k.org
> GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
> "And now for something completely different."


Re: [PATCH v5 14/31] powernv/fadump: define register/un-register callback functions

2019-09-05 Thread Hari Bathini



On 05/09/19 12:53 PM, Michael Ellerman wrote:
> Hari Bathini  writes:
>> diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
>> b/arch/powerpc/platforms/powernv/opal-fadump.c
>> index e466f7e..91fb909 100644
>> --- a/arch/powerpc/platforms/powernv/opal-fadump.c
>> +++ b/arch/powerpc/platforms/powernv/opal-fadump.c
>> @@ -52,6 +68,8 @@ static ulong opal_fadump_init_mem_struct(struct fw_dump 
>> *fadump_conf)
>>  opal_fdm->fadumphdr_addr = (opal_fdm->rgn[0].dest +
>>  fadump_conf->boot_memory_size);
>>  
>> +opal_fadump_update_config(fadump_conf, opal_fdm);
>> +
>>  return addr;
>>  }
>>  
>> @@ -97,12 +115,69 @@ static int opal_fadump_setup_metadata(struct fw_dump 
>> *fadump_conf)
>>  
>>  static int opal_fadump_register(struct fw_dump *fadump_conf)
>>  {
>> -return -EIO;
>> +int i, err = -EIO;
>> +s64 rc;
> 
> Some compiler versions are warning about this being used uninitialised:
> 
> arch/powerpc/platforms/powernv/opal-fadump.c:316:3: error: 'rc' may be used 
> uninitialized in this function [-Werror=uninitialized]
> 
> http://kisskb.ellerman.id.au/kisskb/buildresult/13943984/
> 
> Which does seem like a legitimate warning.

fixed...



Re: [PATCH] powerpc/64: Fix stacktrace on BE when function_graph is enabled

2019-09-05 Thread Michael Ellerman
"Naveen N. Rao"  writes:
> Michael Ellerman wrote:
>> Currently if we oops or warn while function_graph is active the stack
>> trace looks like:
>>   .trace_graph_return+0xac/0x100
>>   .ftrace_return_to_handler+0x98/0x140
>>   .return_to_handler+0x20/0x40
>>   .return_to_handler+0x0/0x40
>>   .return_to_handler+0x0/0x40
>>   .return_to_handler+0x0/0x40
>>   .return_to_handler+0x0/0x40
>>   .return_to_handler+0x0/0x40
>>   .return_to_handler+0x0/0x40
>>   .cpu_startup_entry+0x34/0x40
>>   .start_secondary+0x680/0x6f0
>>   start_secondary_prolog+0x10/0x14
>> 
>> Notice the multiple entries that just show .return_to_handler.
>> 
>> There is logic in show_stack() to detect this case and print the
>> traced function, but we inadvertently broke it in commit
>> 7d56c65a6ff9 ("powerpc/ftrace: Remove mod_return_to_handler") (2014),
>> because that commit accidentally removed the dereference of rth which
>> gets the text address from the function descriptor. Hence this is only
>> broken on big endian (or technically ELFv1).
>> 
>> Fix it by using the proper accessor, which is ppc_function_entry().
>> Result is we get a stack trace such as:
>> 
>>   .trace_graph_return+0x134/0x160
>>   .ftrace_return_to_handler+0x94/0x140
>>   .return_to_handler+0x20/0x40
>>   .return_to_handler+0x0/0x40 (.shared_cede_loop+0x48/0x130)
>>   .return_to_handler+0x0/0x40 (.cpuidle_enter_state+0xa0/0x690)
>>   .return_to_handler+0x0/0x40 (.cpuidle_enter+0x44/0x70)
>>   .return_to_handler+0x0/0x40 (.call_cpuidle+0x68/0xc0)
>>   .return_to_handler+0x0/0x40 (.do_idle+0x37c/0x400)
>>   .return_to_handler+0x0/0x40 (.cpu_startup_entry+0x30/0x50)
>>   .rest_init+0x224/0x348
>> 
>> Fixes: 7d56c65a6ff9 ("powerpc/ftrace: Remove mod_return_to_handler")
>> Signed-off-by: Michael Ellerman 
>> ---
>>  arch/powerpc/kernel/process.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
>> index 8fc4de0d22b4..1601d7cfe45e 100644
>> --- a/arch/powerpc/kernel/process.c
>> +++ b/arch/powerpc/kernel/process.c
>> @@ -2048,7 +2048,7 @@ void show_stack(struct task_struct *tsk, unsigned long 
>> *stack)
>>  #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>>  struct ftrace_ret_stack *ret_stack;
>>  extern void return_to_handler(void);
>> -unsigned long rth = (unsigned long)return_to_handler;
>> +unsigned long rth = ppc_function_entry(return_to_handler);
>
> Thanks! This looks good to me. A small suggestion though -- can we use 
> dereference_kernel_function_descriptor() instead? It will be a nop for 
> ABIv2, which would be nice, but not really a major deal.

ppc_function_entry() isn't a nop on ABIv2, *if* the function has a local
entry point.

As it happens return_to_handler doesn't have a local entry point, so it
is currently a nop.

But if return_to_handler did have a local entry then
ppc_function_entry() would do the right thing here because we use
ppc_function_entry() in prepare_ftrace_return().

At least I think that's true :)

> In either case:
> Reviewed-by: Naveen N. Rao 

cheers


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Aleksa Sarai
On 2019-09-05, Rasmus Villemoes  wrote:
> On 04/09/2019 22.19, Aleksa Sarai wrote:
> > A common pattern for syscall extensions is increasing the size of a
> > struct passed from userspace, such that the zero-value of the new fields
> > result in the old kernel behaviour (allowing for a mix of userspace and
> > kernel vintages to operate on one another in most cases). This is done
> > in both directions -- hence two helpers -- though it's more common to
> > have to copy user space structs into kernel space.
> > 
> > Previously there was no common lib/ function that implemented
> > the necessary extension-checking semantics (and different syscalls
> > implemented them slightly differently or incompletely[1]). A future
> > patch replaces all of the common uses of this pattern to use the new
> > copy_struct_{to,from}_user() helpers.
> > 
> > [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
> >  similar checks to copy_struct_from_user() while rt_sigprocmask(2)
> >  always rejects differently-sized struct arguments.
> > 
> > Suggested-by: Rasmus Villemoes 
> > Signed-off-by: Aleksa Sarai 
> > ---
> > diff --git a/lib/struct_user.c b/lib/struct_user.c
> > new file mode 100644
> > index ..7301ab1bbe98
> > --- /dev/null
> > +++ b/lib/struct_user.c
> > @@ -0,0 +1,182 @@
> > +// SPDX-License-Identifier: GPL-2.0-or-later
> > +/*
> > + * Copyright (C) 2019 SUSE LLC
> > + * Copyright (C) 2019 Aleksa Sarai 
> > + */
> > +
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +#include 
> > +
> > +#define BUFFER_SIZE 64
> > +
> > +/*
> > + * "memset(p, 0, size)" but for user space buffers. Caller must have 
> > already
> > + * checked access_ok(p, size).
> > + */
> 
> Isn't this __clear_user() exactly (perhaps except for the return value)?
> Perhaps not every arch has that?

I didn't know about clear_user() -- I will switch to it.

> > +static int __memzero_user(void __user *p, size_t s)
> > +{
> > +   const char zeros[BUFFER_SIZE] = {};
> > +   while (s > 0) {
> > +   size_t n = min(s, sizeof(zeros));
> > +
> > +   if (__copy_to_user(p, zeros, n))
> > +   return -EFAULT;
> > +
> > +   p += n;
> > +   s -= n;
> > +   }
> > +   return 0;
> > +}
> > +
> > +/**
> > + * copy_struct_to_user: copy a struct to user space
> > + * @dst:   Destination address, in user space.
> > + * @usize: Size of @dst struct.
> > + * @src:   Source address, in kernel space.
> > + * @ksize: Size of @src struct.
> > + *
> > + * Returns (in all cases, some data may have been copied):
> > + *  * -EFBIG:  (@usize < @ksize) and there are non-zero trailing bytes in 
> > @src.
> > + *  * -EFAULT: access to user space failed.
> > + */
> > +int copy_struct_to_user(void __user *dst, size_t usize,
> > +   const void *src, size_t ksize)
> > +{
> > +   size_t size = min(ksize, usize);
> > +   size_t rest = abs(ksize - usize);
> 
> Eh, I'd avoid abs() here due to the funkiness of the implicit type
> conversions - ksize-usize has type size_t, then that's coerced to an int
> (or a long maybe?), the abs is applied which return an int/long (or
> unsigned versions?). Something like "rest = max(ksize, usize) - size;"
> is more obviously correct and doesn't fall into any
> narrowing/widening/sign extending traps.

Yeah, I originally used "max(ksize, usize) - size" for that reason but
was worried it looked too funky (and some quick tests showed that abs()
gives the right results in most cases -- though I just realised it would
probably not give the right results around SIZE_MAX). I'll switch back.

> > +   if (unlikely(usize > PAGE_SIZE))
> > +   return -EFAULT;
> 
> Please don't. That is a restriction on all future extensions - once a
> kernel is shipped with a syscall using this helper with that arbitrary
> restriction in place, that syscall is forever prevented from extending
> its arg struct beyond PAGE_SIZE (which is arch-dependent anyway). Sure,
> it's hard to imagine, but who'd have thought 32 O_* or CLONE_* bits
> weren't enough for everybody?
>
> This is only for future compatibility, and if someone runs an app
> compiled against 7.3 headers on a 5.4 kernel, they probably don't care
> about performance, but they would like their app to run.

I'm not sure I agree that the limit is in place *forever* -- it's
generally not a break in compatibility to convert an error into a
success (though, there are counterexamples such as mknod(2) -- but that
was a very specific case).

You're right that it would mean that some very new code won't run on
very ancient kernels (assuming we ever pass around structs that
massive), but there should be a reasonable trade-off here IMHO.

If we allow very large sizes, a program could probably DoS the kernel by
allocating a moderately-large block of memory and then spawning a bunch
of threads that all cause the kernel to re-check that the same 1GB block
of memory is zeroed. I haven't tried, but it seems like it's 

Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 07:26:22PM +1000, Aleksa Sarai wrote:
> On 2019-09-05, Peter Zijlstra  wrote:
> > On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > > +/**
> > > + * copy_struct_to_user: copy a struct to user space
> > > + * @dst:   Destination address, in user space.
> > > + * @usize: Size of @dst struct.
> > > + * @src:   Source address, in kernel space.
> > > + * @ksize: Size of @src struct.
> > > + *
> > > + * Copies a struct from kernel space to user space, in a way that 
> > > guarantees
> > > + * backwards-compatibility for struct syscall arguments (as long as 
> > > future
> > > + * struct extensions are made such that all new fields are *appended* to 
> > > the
> > > + * old struct, and zeroed-out new fields have the same meaning as the old
> > > + * struct).
> > > + *
> > > + * @ksize is just sizeof(*dst), and @usize should've been passed by user 
> > > space.
> > > + * The recommended usage is something like the following:
> > > + *
> > > + *   SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > > + *   {
> > > + *  int err;
> > > + *  struct foo karg = {};
> > > + *
> > > + *  // do something with karg
> > > + *
> > > + *  err = copy_struct_to_user(uarg, usize, , sizeof(karg));
> > > + *  if (err)
> > > + *return err;
> > > + *
> > > + *  // ...
> > > + *   }
> > > + *
> > > + * There are three cases to consider:
> > > + *  * If @usize == @ksize, then it's copied verbatim.
> > > + *  * If @usize < @ksize, then kernel space is "returning" a newer 
> > > struct to an
> > > + *older user space. In order to avoid user space getting incomplete
> > > + *information (new fields might be important), all trailing bytes in 
> > > @src
> > > + *(@ksize - @usize) must be zerored
> > 
> > s/zerored/zero/, right?
> 
> It should've been "zeroed".

That reads wrong to me; that way it reads like this function must take
that action and zero out the 'rest'; which is just wrong.

This function must verify those bytes are zero, not make them zero.

> > >  , otherwise -EFBIG is returned.
> > 
> > 'Funny' that, copy_struct_from_user() below seems to use E2BIG.
> 
> This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
> a "too big" struct passed to the kernel, and EFBIG for a "too big"
> struct passed to user-space. I would personally have preferred EMSGSIZE
> instead of EFBIG, but felt using the existing error codes would be less
> confusing.

Sadly a recent commit:

  1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and robustify 
sched_read_attr() ABI logic and code")

Made the situation even 'worse'.


> > > + if (unlikely(!access_ok(src, usize)))
> > > + return -EFAULT;
> > > +
> > > + /* Deal with trailing bytes. */
> > > + if (usize < ksize)
> > > + memset(dst + size, 0, rest);
> > > + else if (usize > ksize) {
> > > + const void __user *addr = src + size;
> > > + char buffer[BUFFER_SIZE] = {};
> > 
> > Isn't that too big for on-stack?
> 
> Is a 64-byte buffer too big? I picked the number "at random" to be the
> size of a cache line, but I could shrink it down to 32 bytes if the size
> is an issue (I wanted to avoid needless allocations -- hence it being
> on-stack).

Ah, my ctags gave me a definition of BUFFER_SIZE that was 512. I suppose
64 should be OK.

> > > +
> > > + while (rest > 0) {
> > > + size_t bufsize = min(rest, sizeof(buffer));
> > > +
> > > + if (__copy_from_user(buffer, addr, bufsize))
> > > + return -EFAULT;
> > > + if (memchr_inv(buffer, 0, bufsize))
> > > + return -E2BIG;
> > > +
> > > + addr += bufsize;
> > > + rest -= bufsize;
> > > + }
> > 
> > The perf implementation uses get_user(); but if that is too slow, surely
> > we can do something with uaccess_try() here?
> 
> Is there a non-x86-specific way to do that (unless I'm mistaken only x86
> has uaccess_try() or the other *_try() wrappers)? The main "performance
> improvement" (if you can even call it that) is that we use memchr_inv()
> which finds non-matching characters more efficiently than just doing a
> loop.

Oh, you're right, that's x86 only :/


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Aleksa Sarai
On 2019-09-05, Peter Zijlstra  wrote:
> On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> > +/**
> > + * copy_struct_to_user: copy a struct to user space
> > + * @dst:   Destination address, in user space.
> > + * @usize: Size of @dst struct.
> > + * @src:   Source address, in kernel space.
> > + * @ksize: Size of @src struct.
> > + *
> > + * Copies a struct from kernel space to user space, in a way that 
> > guarantees
> > + * backwards-compatibility for struct syscall arguments (as long as future
> > + * struct extensions are made such that all new fields are *appended* to 
> > the
> > + * old struct, and zeroed-out new fields have the same meaning as the old
> > + * struct).
> > + *
> > + * @ksize is just sizeof(*dst), and @usize should've been passed by user 
> > space.
> > + * The recommended usage is something like the following:
> > + *
> > + *   SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> > + *   {
> > + *  int err;
> > + *  struct foo karg = {};
> > + *
> > + *  // do something with karg
> > + *
> > + *  err = copy_struct_to_user(uarg, usize, , sizeof(karg));
> > + *  if (err)
> > + *return err;
> > + *
> > + *  // ...
> > + *   }
> > + *
> > + * There are three cases to consider:
> > + *  * If @usize == @ksize, then it's copied verbatim.
> > + *  * If @usize < @ksize, then kernel space is "returning" a newer struct 
> > to an
> > + *older user space. In order to avoid user space getting incomplete
> > + *information (new fields might be important), all trailing bytes in 
> > @src
> > + *(@ksize - @usize) must be zerored
> 
> s/zerored/zero/, right?

It should've been "zeroed".

> >  , otherwise -EFBIG is returned.
> 
> 'Funny' that, copy_struct_from_user() below seems to use E2BIG.

This is a copy of the semantics that sched_[sg]etattr(2) uses -- E2BIG for
a "too big" struct passed to the kernel, and EFBIG for a "too big"
struct passed to user-space. I would personally have preferred EMSGSIZE
instead of EFBIG, but felt using the existing error codes would be less
confusing.

> 
> > + *  * If @usize > @ksize, then the kernel is "returning" an older struct 
> > to a
> > + *newer user space. The trailing bytes in @dst (@usize - @ksize) will 
> > be
> > + *zero-filled.
> > + *
> > + * Returns (in all cases, some data may have been copied):
> > + *  * -EFBIG:  (@usize < @ksize) and there are non-zero trailing bytes in 
> > @src.
> > + *  * -EFAULT: access to user space failed.
> > + */
> > +int copy_struct_to_user(void __user *dst, size_t usize,
> > +   const void *src, size_t ksize)
> > +{
> > +   size_t size = min(ksize, usize);
> > +   size_t rest = abs(ksize - usize);
> > +
> > +   if (unlikely(usize > PAGE_SIZE))
> > +   return -EFAULT;
> 
> Not documented above. Implementation consistent with *from*, but see
> below.

Will update the kernel-doc.

> > +   if (unlikely(!access_ok(dst, usize)))
> > +   return -EFAULT;
> > +
> > +   /* Deal with trailing bytes. */
> > +   if (usize < ksize) {
> > +   if (memchr_inv(src + size, 0, rest))
> > +   return -EFBIG;
> > +   } else if (usize > ksize) {
> > +   if (__memzero_user(dst + size, rest))
> > +   return -EFAULT;
> > +   }
> > +   /* Copy the interoperable parts of the struct. */
> > +   if (__copy_to_user(dst, src, size))
> > +   return -EFAULT;
> > +   return 0;
> > +}
> > +EXPORT_SYMBOL(copy_struct_to_user);
> > +
> > +/**
> > + * copy_struct_from_user: copy a struct from user space
> > + * @dst:   Destination address, in kernel space. This buffer must be @ksize
> > + * bytes long.
> > + * @ksize: Size of @dst struct.
> > + * @src:   Source address, in user space.
> > + * @usize: (Alleged) size of @src struct.
> > + *
> > + * Copies a struct from user space to kernel space, in a way that 
> > guarantees
> > + * backwards-compatibility for struct syscall arguments (as long as future
> > + * struct extensions are made such that all new fields are *appended* to 
> > the
> > + * old struct, and zeroed-out new fields have the same meaning as the old
> > + * struct).
> > + *
> > + * @ksize is just sizeof(*dst), and @usize should've been passed by user 
> > space.
> > + * The recommended usage is something like the following:
> > + *
> > + *   SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, 
> > usize)
> > + *   {
> > + *  int err;
> > + *  struct foo karg = {};
> > + *
> > + *  err = copy_struct_from_user(, sizeof(karg), uarg, size);
> > + *  if (err)
> > + *return err;
> > + *
> > + *  // ...
> > + *   }
> > + *
> > + * There are three cases to consider:
> > + *  * If @usize == @ksize, then it's copied verbatim.
> > + *  * If @usize < @ksize, then the user space has passed an old struct to a
> > + *newer kernel. The rest of the trailing bytes in @dst (@ksize - 
> > @usize)
> > + * 

Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-05 Thread Anshuman Khandual


On 09/05/2019 01:46 AM, Gerald Schaefer wrote:
> On Tue,  3 Sep 2019 13:31:46 +0530
> Anshuman Khandual  wrote:
> 
>> This adds a test module which will validate architecture page table helpers
>> and accessors regarding compliance with generic MM semantics expectations.
>> This will help various architectures in validating changes to the existing
>> page table helpers or addition of new ones.
>>
>> Test page table and memory pages creating it's entries at various level are
>> all allocated from system memory with required alignments. If memory pages
>> with required size and alignment could not be allocated, then all depending
>> individual tests are skipped.
> 
> This looks very useful, thanks. Of course, s390 is quite special and does
> not work nicely with this patch (yet), mostly because of our dynamic page
> table levels/folding. Still need to figure out what can be fixed in the arch

Hmm.

> code and what would need to be changed in the test module. See below for some
> generic comments/questions.

Sure.

> 
> At least one real bug in the s390 code was already revealed by this, which
> is very nice. In pmd/pud_bad(), we also check large pmds/puds for sanity,
> instead of reporting them as bad, which is apparently not how it is expected.

Hmm, so it has already started being useful :)

> 
> [...]
>> +/*
>> + * Basic operations
>> + *
>> + * mkold(entry) = An old and not a young entry
>> + * mkyoung(entry)   = A young and not an old entry
>> + * mkdirty(entry)   = A dirty and not a clean entry
>> + * mkclean(entry)   = A clean and not a dirty entry
>> + * mkwrite(entry)   = A write and not a write protected entry
>> + * wrprotect(entry) = A write protected and not a write entry
>> + * pxx_bad(entry)   = A mapped and non-table entry
>> + * pxx_same(entry1, entry2) = Both entries hold the exact same value
>> + */
>> +#define VADDR_TEST  (PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)
> 
> Why is P4D_SIZE missing in the VADDR_TEST calculation?

This was a random possible virtual address to generate a representative
page table structure for the test. As there is a default (PGDIR_SIZE) for
P4D_SIZE on platforms which really do not have P4D level, it should be okay
to add P4D_SIZE in the above calculation.

> 
> [...]
>> +
>> +#if !defined(__PAGETABLE_PMD_FOLDED) && !defined(__ARCH_HAS_4LEVEL_HACK)
>> +static void pud_clear_tests(pud_t *pudp)
>> +{
>> +memset(pudp, RANDOM_NZVALUE, sizeof(pud_t));
>> +pud_clear(pudp);
>> +WARN_ON(!pud_none(READ_ONCE(*pudp)));
>> +}
> 
> For pgd/p4d/pud_clear(), we only clear if the page table level is present
> and not folded. The memset() here overwrites the table type bits, so
> pud_clear() will not clear anything on s390 and the pud_none() check will
> fail.
> Would it be possible to OR a (larger) random value into the table, so that
> the lower 12 bits would be preserved?

So the suggestion is instead of doing memset() on entry with RANDOM_NZVALUE,
it should OR a large random value preserving lower 12 bits. Hmm, this should
still do the trick for other platforms, they just need non zero value. So on
s390, the lower 12 bits on the page table entry already has valid value while
entering this function which would make sure that pud_clear() really does
clear the entry ?

> 
>> +
>> +static void pud_populate_tests(struct mm_struct *mm, pud_t *pudp, pmd_t 
>> *pmdp)
>> +{
>> +/*
>> + * This entry points to next level page table page.
>> + * Hence this must not qualify as pud_bad().
>> + */
>> +pmd_clear(pmdp);
>> +pud_clear(pudp);
>> +pud_populate(mm, pudp, pmdp);
>> +WARN_ON(pud_bad(READ_ONCE(*pudp)));
>> +}
> 
> This will populate the pud with a pmd pointer that does not point to the
> beginning of the pmd table, but to the second entry (because of how
> VADDR_TEST is constructed). This will result in failing pud_bad() check
> on s390. Not sure why/how it works on other archs, but would it be possible
> to align pmdp down to the beginning of the pmd table (and similar for the
> other pxd_populate_tests)?

Right, that was a problem. Will fix it by using the saved entries used for
freeing the page table pages at the end, which always point to the beginning
of a page table page.

> 
> [...]
>> +
>> +p4d_free(mm, saved_p4dp);
>> +pud_free(mm, saved_pudp);
>> +pmd_free(mm, saved_pmdp);
>> +pte_free(mm, (pgtable_t) virt_to_page(saved_ptep));
> 
> pgtable_t is arch-specific, and on s390 it is not a struct page pointer,
> but a pte pointer. So this will go wrong, also on all other archs (if any)
> where pgtable_t is not struct page.
> Would it be possible to use pte_free_kernel() instead, and just pass
> saved_ptep directly?

It makes sense, will change.


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to, from}_user helpers

2019-09-05 Thread Andreas Schwab
On Sep 05 2019, Aleksa Sarai  wrote:

> diff --git a/lib/struct_user.c b/lib/struct_user.c
> new file mode 100644
> index ..7301ab1bbe98
> --- /dev/null
> +++ b/lib/struct_user.c
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2019 SUSE LLC
> + * Copyright (C) 2019 Aleksa Sarai 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define BUFFER_SIZE 64
> +
> +/*
> + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> + * checked access_ok(p, size).
> + */
> +static int __memzero_user(void __user *p, size_t s)
> +{
> + const char zeros[BUFFER_SIZE] = {};

Perhaps make that static?

Andreas.

-- 
Andreas Schwab, sch...@linux-m68k.org
GPG Key fingerprint = 7578 EB47 D4E5 4D69 2510  2552 DF73 E780 A9DA AEC1
"And now for something completely different."


Re: [PATCH 0/2] Fix IOMMU setup for hotplugged devices on pseries

2019-09-05 Thread Alexey Kardashevskiy




On 05/09/2019 14:22, Shawn Anastasio wrote:

On pseries QEMU guests, IOMMU setup for hotplugged PCI devices is currently
broken for all but the first device on a given bus. The culprit is an ordering
issue in the pseries hotplug path (via pci_rescan_bus()) which results in IOMMU
group assigment occuring before device registration in sysfs. This triggers
the following check in arch/powerpc/kernel/iommu.c:

/*
  * The sysfs entries should be populated before
  * binding IOMMU group. If sysfs entries isn't
  * ready, we simply bail.
  */
if (!device_is_registered(dev))
return -ENOENT;

This fails for hotplugged devices since the pcibios_add_device() call in the
pseries hotplug path (in pci_device_add()) occurs before device_add().
Since the IOMMU groups are set up in pcibios_add_device(), this means that a
sysfs entry will not yet be present and it will fail.


I just tried hotplugging 3 virtio-net devices into a guest system with 
v5.2 kernel and it seems working (i.e. BARs mapped, a driver is bound):



root@le-dbg:~# lspci -v | egrep -i '(virtio|Memory)'
00:00.0 Ethernet controller: Red Hat, Inc Virtio network device
Memory at 20008004 (32-bit, non-prefetchable) [size=4K]
Memory at 2100 (64-bit, prefetchable) [size=16K]
Kernel driver in use: virtio-pci
00:01.0 Ethernet controller: Red Hat, Inc Virtio network device
Memory at 200080041000 (32-bit, non-prefetchable) [size=4K]
Memory at 21004000 (64-bit, prefetchable) [size=16K]
Kernel driver in use: virtio-pci
00:02.0 Ethernet controller: Red Hat, Inc Virtio network device
Memory at 200080042000 (32-bit, non-prefetchable) [size=4K]
Memory at 21008000 (64-bit, prefetchable) [size=16K]
Kernel driver in use: virtio-pci

Can you explain in detail what you are doing exactly and what is failing 
and what qemu/guest kernel/guest distro is used? Thanks,





There is a special case that allows the first hotplugged device on a bus to
succeed, though. The powerpc pcibios_add_device() implementation will skip
initializing the device if bus setup is not yet complete.
Later, the pci core will call pcibios_fixup_bus() which will perform setup
for the first (and only) device on the bus and since it has already been
registered in sysfs, the IOMMU setup will succeed.

My current solution is to introduce another pcibios function, pcibios_fixup_dev,
which is called after device_add() in pci_device_add(). Then in powerpc code,
pcibios_setup_device() was moved from pcibios_add_device() to this new function
which will occur after sysfs registration so IOMMU assignment will succeed.

I added a new pcibios function rather than moving the pcibios_add_device() call
to after the device_add() call in pci_add_device() because there are other
architectures that use it and it wasn't immediately clear to me whether moving
it would break them.

If anybody has more insight or a better way to fix this, please let me know.

Shawn Anastasio (2):
   PCI: Introduce pcibios_fixup_dev()
   powerpc/pci: Fix IOMMU setup for hotplugged devices on pseries

  arch/powerpc/kernel/pci-common.c | 13 ++---
  drivers/pci/probe.c  | 14 ++
  include/linux/pci.h  |  1 +
  3 files changed, 21 insertions(+), 7 deletions(-)



--
Alexey


Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-05 Thread Kirill A. Shutemov
On Thu, Sep 05, 2019 at 01:48:27PM +0530, Anshuman Khandual wrote:
> >> +#define VADDR_TEST(PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)
> > 
> > What is special about this address? How do you know if it is not occupied
> > yet?
> 
> We are creating the page table from scratch after allocating an mm_struct
> for a given random virtual address 'VADDR_TEST'. Hence nothing is occupied
> just yet. There is nothing special about this address, just that it tries
> to ensure the page table entries are being created with some offset from
> beginning of respective page table page at all levels ? The idea is to
> have a more representative form of page table structure for test.

Why P4D_SIZE is missing?

Are you sure it will not land into kernel address space on any arch?

I think more robust way to deal with this would be using
get_unmapped_area() instead of fixed address.

> This makes sense for runtime cases but there is a problem here.
> 
> On arm64, pgd_populate() which takes (pud_t *) as last argument instead of
> (p4d_t *) will fail to build when not wrapped in !__PAGETABLE_P4D_FOLDED
> on certain configurations.
> 
> ./arch/arm64/include/asm/pgalloc.h:81:75: note:
> expected ‘pud_t *’ {aka ‘struct  *’}
> but argument is of type ‘pgd_t *’ {aka ‘struct  *’}
> static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgdp, pud_t 
> *pudp)
>~~~^~~~
> Wondering if this is something to be fixed on arm64 or its more general
> problem. Will look into this further.

I think you need wrap this into #ifndef __ARCH_HAS_5LEVEL_HACK.

> >> +  pmd_populate_tests(mm, pmdp, (pgtable_t) page);
> > 
> > This is not correct for architectures that defines pgtable_t as pte_t
> > pointer, not struct page pointer.
> 
> Right, a grep on the source confirms that.
> 
> These platforms define pgtable_t as struct page *
> 
> arch/alpha/include/asm/page.h:typedef struct page *pgtable_t;
> arch/arm/include/asm/page.h:typedef struct page *pgtable_t;
> arch/arm64/include/asm/page.h:typedef struct page *pgtable_t;
> arch/csky/include/asm/page.h:typedef struct page *pgtable_t;
> arch/hexagon/include/asm/page.h:typedef struct page *pgtable_t;
> arch/ia64/include/asm/page.h:  typedef struct page *pgtable_t;
> arch/ia64/include/asm/page.h:typedef struct page *pgtable_t;
> arch/m68k/include/asm/page.h:typedef struct page *pgtable_t;
> arch/microblaze/include/asm/page.h:typedef struct page *pgtable_t;
> arch/mips/include/asm/page.h:typedef struct page *pgtable_t;
> arch/nds32/include/asm/page.h:typedef struct page *pgtable_t;
> arch/nios2/include/asm/page.h:typedef struct page *pgtable_t;
> arch/openrisc/include/asm/page.h:typedef struct page *pgtable_t;
> arch/parisc/include/asm/page.h:typedef struct page *pgtable_t;
> arch/riscv/include/asm/page.h:typedef struct page *pgtable_t;
> arch/sh/include/asm/page.h:typedef struct page *pgtable_t;
> arch/sparc/include/asm/page_32.h:typedef struct page *pgtable_t;
> arch/um/include/asm/page.h:typedef struct page *pgtable_t;
> arch/unicore32/include/asm/page.h:typedef struct page *pgtable_t;
> arch/x86/include/asm/pgtable_types.h:typedef struct page *pgtable_t;
> arch/xtensa/include/asm/page.h:typedef struct page *pgtable_t;
> 
> These platforms define pgtable_t as pte_t *
> 
> arch/arc/include/asm/page.h:typedef pte_t * pgtable_t;
> arch/powerpc/include/asm/mmu.h:typedef pte_t *pgtable_t;
> arch/s390/include/asm/page.h:typedef pte_t *pgtable_t;
> arch/sparc/include/asm/page_64.h:typedef pte_t *pgtable_t;
> 
> Should we need have two pmd_populate_tests() definitions with
> different arguments (struct page pointer or pte_t pointer) and then
> call either one after detecting the given platform ?

Use pte_alloc_one() instead of alloc_mapped_page() to allocate the page
table.

> >> +  pud_populate_tests(mm, pudp, pmdp);
> >> +  p4d_populate_tests(mm, p4dp, pudp);
> >> +  pgd_populate_tests(mm, pgdp, p4dp);
> > 
> > This is wrong. All p?dp points to the second entry in page table entry.
> > This is not valid pointer for page table and triggers p?d_bad() on x86.
> 
> Yeah these are second entries because of the way we create the page table.
> But I guess its applicable only to the second argument in all these above
> cases because the first argument can be any valid entry on previous page
> table level.

Yes:

@@ -397,9 +396,9 @@ static int __init arch_pgtable_tests_init(void)
pgd_clear_tests(pgdp);
 
pmd_populate_tests(mm, pmdp, (pgtable_t) page);
-   pud_populate_tests(mm, pudp, pmdp);
-   p4d_populate_tests(mm, p4dp, pudp);
-   pgd_populate_tests(mm, pgdp, p4dp);
+   pud_populate_tests(mm, pudp, saved_pmdp);
+   p4d_populate_tests(mm, p4dp, saved_pudp);
+   pgd_populate_tests(mm, pgdp, saved_p4dp);
 
p4d_free(mm, saved_p4dp);
pud_free(mm, saved_pudp);

-- 
 Kirill A. Shutemov


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Rasmus Villemoes
On 04/09/2019 22.19, Aleksa Sarai wrote:
> A common pattern for syscall extensions is increasing the size of a
> struct passed from userspace, such that the zero-value of the new fields
> result in the old kernel behaviour (allowing for a mix of userspace and
> kernel vintages to operate on one another in most cases). This is done
> in both directions -- hence two helpers -- though it's more common to
> have to copy user space structs into kernel space.
> 
> Previously there was no common lib/ function that implemented
> the necessary extension-checking semantics (and different syscalls
> implemented them slightly differently or incompletely[1]). A future
> patch replaces all of the common uses of this pattern to use the new
> copy_struct_{to,from}_user() helpers.
> 
> [1]: For instance {sched_setattr,perf_event_open,clone3}(2) all do do
>  similar checks to copy_struct_from_user() while rt_sigprocmask(2)
>  always rejects differently-sized struct arguments.
> 
> Suggested-by: Rasmus Villemoes 
> Signed-off-by: Aleksa Sarai 
> ---
> diff --git a/lib/struct_user.c b/lib/struct_user.c
> new file mode 100644
> index ..7301ab1bbe98
> --- /dev/null
> +++ b/lib/struct_user.c
> @@ -0,0 +1,182 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) 2019 SUSE LLC
> + * Copyright (C) 2019 Aleksa Sarai 
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define BUFFER_SIZE 64
> +
> +/*
> + * "memset(p, 0, size)" but for user space buffers. Caller must have already
> + * checked access_ok(p, size).
> + */

Isn't this __clear_user() exactly (perhaps except for the return value)?
Perhaps not every arch has that?

> +static int __memzero_user(void __user *p, size_t s)
> +{
> + const char zeros[BUFFER_SIZE] = {};
> + while (s > 0) {
> + size_t n = min(s, sizeof(zeros));
> +
> + if (__copy_to_user(p, zeros, n))
> + return -EFAULT;
> +
> + p += n;
> + s -= n;
> + }
> + return 0;
> +}
> +
> +/**
> + * copy_struct_to_user: copy a struct to user space
> + * @dst:   Destination address, in user space.
> + * @usize: Size of @dst struct.
> + * @src:   Source address, in kernel space.
> + * @ksize: Size of @src struct.
> + *
> + * Returns (in all cases, some data may have been copied):
> + *  * -EFBIG:  (@usize < @ksize) and there are non-zero trailing bytes in 
> @src.
> + *  * -EFAULT: access to user space failed.
> + */
> +int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);

Eh, I'd avoid abs() here due to the funkiness of the implicit type
conversions - ksize-usize has type size_t, then that's coerced to an int
(or a long maybe?), the abs is applied which return an int/long (or
unsigned versions?). Something like "rest = max(ksize, usize) - size;"
is more obviously correct and doesn't fall into any
narrowing/widening/sign extending traps.

> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;

Please don't. That is a restriction on all future extensions - once a
kernel is shipped with a syscall using this helper with that arbitrary
restriction in place, that syscall is forever prevented from extending
its arg struct beyond PAGE_SIZE (which is arch-dependent anyway). Sure,
it's hard to imagine, but who'd have thought 32 O_* or CLONE_* bits
weren't enough for everybody?

This is only for future compatibility, and if someone runs an app
compiled against 7.3 headers on a 5.4 kernel, they probably don't care
about performance, but they would like their app to run.

[If we ever create such a large ABI struct that doesn't fit on stack,
we'd have to extend our API a little to create a dup_struct_from_user()
that does the kmalloc() for us and then calls copy_struct_from_user() -
but we might want that long before we hit PAGE_SIZE structs].

> + if (unlikely(!access_ok(dst, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize) {
> + if (memchr_inv(src + size, 0, rest))
> + return -EFBIG;
> + } else if (usize > ksize) {
> + if (__memzero_user(dst + size, rest))
> + return -EFAULT;

I think that could simply be __clear_user().

> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_to_user(dst, src, size))
> + return -EFAULT;

I think I understand why you put this last instead of handling the
buffer in the "natural" order. However,
I'm wondering whether we should actually do this copy before checking
that the extra kernel bytes are 0 - the user will still be told that
there was some extra information via the -EFBIG/-E2BIG return, but maybe
in some cases the part he understands is good enough. But I also guess
we have to look to existing users to see 

Re: [PATCH 1/1] mm/pgtable/debug: Add test validating architecture page table helpers

2019-09-05 Thread Anshuman Khandual
On 09/04/2019 07:49 PM, Kirill A. Shutemov wrote:
> On Tue, Sep 03, 2019 at 01:31:46PM +0530, Anshuman Khandual wrote:
>> This adds a test module which will validate architecture page table helpers
>> and accessors regarding compliance with generic MM semantics expectations.
>> This will help various architectures in validating changes to the existing
>> page table helpers or addition of new ones.
>>
>> Test page table and memory pages creating it's entries at various level are
>> all allocated from system memory with required alignments. If memory pages
>> with required size and alignment could not be allocated, then all depending
>> individual tests are skipped.
> 
> See my comments below.
> 
>>
>> Cc: Andrew Morton 
>> Cc: Vlastimil Babka 
>> Cc: Greg Kroah-Hartman 
>> Cc: Thomas Gleixner 
>> Cc: Mike Rapoport 
>> Cc: Jason Gunthorpe 
>> Cc: Dan Williams 
>> Cc: Peter Zijlstra 
>> Cc: Michal Hocko 
>> Cc: Mark Rutland 
>> Cc: Mark Brown 
>> Cc: Steven Price 
>> Cc: Ard Biesheuvel 
>> Cc: Masahiro Yamada 
>> Cc: Kees Cook 
>> Cc: Tetsuo Handa 
>> Cc: Matthew Wilcox 
>> Cc: Sri Krishna chowdary 
>> Cc: Dave Hansen 
>> Cc: Russell King - ARM Linux 
>> Cc: Michael Ellerman 
>> Cc: Paul Mackerras 
>> Cc: Martin Schwidefsky 
>> Cc: Heiko Carstens 
>> Cc: "David S. Miller" 
>> Cc: Vineet Gupta 
>> Cc: James Hogan 
>> Cc: Paul Burton 
>> Cc: Ralf Baechle 
>> Cc: linux-snps-...@lists.infradead.org
>> Cc: linux-m...@vger.kernel.org
>> Cc: linux-arm-ker...@lists.infradead.org
>> Cc: linux-i...@vger.kernel.org
>> Cc: linuxppc-dev@lists.ozlabs.org
>> Cc: linux-s...@vger.kernel.org
>> Cc: linux...@vger.kernel.org
>> Cc: sparcli...@vger.kernel.org
>> Cc: x...@kernel.org
>> Cc: linux-ker...@vger.kernel.org
>>
>> Suggested-by: Catalin Marinas 
>> Signed-off-by: Anshuman Khandual 
>> ---
>>  mm/Kconfig.debug   |  14 ++
>>  mm/Makefile|   1 +
>>  mm/arch_pgtable_test.c | 425 +
>>  3 files changed, 440 insertions(+)
>>  create mode 100644 mm/arch_pgtable_test.c
>>
>> diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug
>> index 327b3ebf23bf..ce9c397f7b07 100644
>> --- a/mm/Kconfig.debug
>> +++ b/mm/Kconfig.debug
>> @@ -117,3 +117,17 @@ config DEBUG_RODATA_TEST
>>  depends on STRICT_KERNEL_RWX
>>  ---help---
>>This option enables a testcase for the setting rodata read-only.
>> +
>> +config DEBUG_ARCH_PGTABLE_TEST
>> +bool "Test arch page table helpers for semantics compliance"
>> +depends on MMU
>> +depends on DEBUG_KERNEL
>> +help
>> +  This options provides a kernel module which can be used to test
>> +  architecture page table helper functions on various platform in
>> +  verifying if they comply with expected generic MM semantics. This
>> +  will help architectures code in making sure that any changes or
>> +  new additions of these helpers will still conform to generic MM
>> +  expected semantics.
>> +
>> +  If unsure, say N.
>> diff --git a/mm/Makefile b/mm/Makefile
>> index d996846697ef..bb572c5aa8c5 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -86,6 +86,7 @@ obj-$(CONFIG_HWPOISON_INJECT) += hwpoison-inject.o
>>  obj-$(CONFIG_DEBUG_KMEMLEAK) += kmemleak.o
>>  obj-$(CONFIG_DEBUG_KMEMLEAK_TEST) += kmemleak-test.o
>>  obj-$(CONFIG_DEBUG_RODATA_TEST) += rodata_test.o
>> +obj-$(CONFIG_DEBUG_ARCH_PGTABLE_TEST) += arch_pgtable_test.o
>>  obj-$(CONFIG_PAGE_OWNER) += page_owner.o
>>  obj-$(CONFIG_CLEANCACHE) += cleancache.o
>>  obj-$(CONFIG_MEMORY_ISOLATION) += page_isolation.o
>> diff --git a/mm/arch_pgtable_test.c b/mm/arch_pgtable_test.c
>> new file mode 100644
>> index ..f15be8a73723
>> --- /dev/null
>> +++ b/mm/arch_pgtable_test.c
>> @@ -0,0 +1,425 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * This kernel module validates architecture page table helpers &
>> + * accessors and helps in verifying their continued compliance with
>> + * generic MM semantics.
>> + *
>> + * Copyright (C) 2019 ARM Ltd.
>> + *
>> + * Author: Anshuman Khandual 
>> + */
>> +#define pr_fmt(fmt) "arch_pgtable_test: %s " fmt, __func__
>> +
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +#include 
>> +
>> +/*
>> + * Basic operations
>> + *
>> + * mkold(entry) = An old and not a young entry
>> + * mkyoung(entry)   = A young and not an old entry
>> + * mkdirty(entry)   = A dirty and not a clean entry
>> + * mkclean(entry)   = A clean and not a dirty entry
>> + * mkwrite(entry)   = A write and not a write protected entry
>> + * wrprotect(entry) = A write protected and not a write entry
>> + * pxx_bad(entry)   = A mapped and non-table entry
>> + * pxx_same(entry1, entry2) = Both entries hold the exact same value
>> + */
>> +#define VADDR_TEST  (PGDIR_SIZE + PUD_SIZE + PMD_SIZE + PAGE_SIZE)
> 

Re: [PATCH] powerpc: Perform a bounds check in arch_add_memory

2019-09-05 Thread David Hildenbrand
On 04.09.19 07:25, Alastair D'Silva wrote:
> On Mon, 2019-09-02 at 09:28 +0200, David Hildenbrand wrote:
>> On 02.09.19 01:54, Alastair D'Silva wrote:
>>> On Tue, 2019-08-27 at 09:13 +0200, David Hildenbrand wrote:
 On 27.08.19 08:39, Alastair D'Silva wrote:
> On Tue, 2019-08-27 at 08:28 +0200, Michal Hocko wrote:
>> On Tue 27-08-19 15:20:46, Alastair D'Silva wrote:
>>> From: Alastair D'Silva 
>>>
>>> It is possible for firmware to allocate memory ranges
>>> outside
>>> the range of physical memory that we support
>>> (MAX_PHYSMEM_BITS).
>>
>> Doesn't that count as a FW bug? Do you have any evidence of
>> that
>> in
>> the
>> field? Just wondering...
>>
>
> Not outside our lab, but OpenCAPI attached LPC memory is
> assigned
> addresses based on the slot/NPU it is connected to. These
> addresses
> prior to:
> 4ffe713b7587 ("powerpc/mm: Increase the max addressable memory
> to
> 2PB")
> were inaccessible and resulted in bogus sections - see our
> discussion
> on 'mm: Trigger bug on if a section is not found in
> __section_nr'.
> Doing this check here was your suggestion :)
>
> It's entirely possible that a similar problem will occur in the
> future,
> and it's cheap to guard against, which is why I've added this.
>

 If you keep it here, I guess this should be wrapped by a
 WARN_ON_ONCE().

 If we move it to common code (e.g., __add_pages() or
 add_memory()),
 then
 probably not. I can see that s390x allows to configure
 MAX_PHYSMEM_BITS,
 so the check could actually make sense.

>>>
>>> I couldn't see a nice platform indepedent way to determine the
>>> allowable address range, but if there is, then I'll move this to
>>> the
>>> generic code instead.
>>>
>>
>> At least on the !ZONE_DEVICE path we have
>>
>> __add_memory() -> register_memory_resource() ...
>>
>> return ERR_PTR(-E2BIG);
>>
>>
>> I was thinking about something like
>>
>> int add_pages()
>> {
>>  if ((start + size - 1) >> MAX_PHYSMEM_BITS)
>>  return -E2BIG;  
>>
>>  return arch_add_memory(...)
>> }
>>
>> And switching users of arch_add_memory() to add_pages(). However, x86
>> already has an add_pages() function, so that would need some more
>> thought.
>>
>> Maybe simply renaming the existing add_pages() to arch_add_pages().
>>
>> add_pages(): Create virtual mapping
>> __add_pages(): Don't create virtual mapping
>>
>> arch_add_memory(): Arch backend for add_pages()
>> arch_add_pages(): Arch backend for __add_pages()
>>
>> It would be even more consistent if we would have arch_add_pages()
>> vs.
>> __arch_add_pages().
> 
> Looking a bit further, I think a good course of action would be to add
> the check to memory_hotplug.c:check_hotplug_memory_range().
> 
> This would be the least invasive, and could check both
> MAX_POSSIBLE_PHYSMEM_BITS and MAX_PHYSMEM_BITS.

You won't be able to catch the memremap path that way, just saying. But
at least it would be an easy change.

> 
> With that in mind, we can drop this patch.
> 


-- 

Thanks,

David / dhildenb


Re: [PATCH v12 01/12] lib: introduce copy_struct_{to,from}_user helpers

2019-09-05 Thread Peter Zijlstra
On Thu, Sep 05, 2019 at 06:19:22AM +1000, Aleksa Sarai wrote:
> +/**
> + * copy_struct_to_user: copy a struct to user space
> + * @dst:   Destination address, in user space.
> + * @usize: Size of @dst struct.
> + * @src:   Source address, in kernel space.
> + * @ksize: Size of @src struct.
> + *
> + * Copies a struct from kernel space to user space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user 
> space.
> + * The recommended usage is something like the following:
> + *
> + *   SYSCALL_DEFINE2(foobar, struct foo __user *, uarg, size_t, usize)
> + *   {
> + *  int err;
> + *  struct foo karg = {};
> + *
> + *  // do something with karg
> + *
> + *  err = copy_struct_to_user(uarg, usize, , sizeof(karg));
> + *  if (err)
> + *return err;
> + *
> + *  // ...
> + *   }
> + *
> + * There are three cases to consider:
> + *  * If @usize == @ksize, then it's copied verbatim.
> + *  * If @usize < @ksize, then kernel space is "returning" a newer struct to 
> an
> + *older user space. In order to avoid user space getting incomplete
> + *information (new fields might be important), all trailing bytes in @src
> + *(@ksize - @usize) must be zerored

s/zerored/zero/, right?

>  , otherwise -EFBIG is returned.

'Funny' that, copy_struct_from_user() below seems to use E2BIG.

> + *  * If @usize > @ksize, then the kernel is "returning" an older struct to a
> + *newer user space. The trailing bytes in @dst (@usize - @ksize) will be
> + *zero-filled.
> + *
> + * Returns (in all cases, some data may have been copied):
> + *  * -EFBIG:  (@usize < @ksize) and there are non-zero trailing bytes in 
> @src.
> + *  * -EFAULT: access to user space failed.
> + */
> +int copy_struct_to_user(void __user *dst, size_t usize,
> + const void *src, size_t ksize)
> +{
> + size_t size = min(ksize, usize);
> + size_t rest = abs(ksize - usize);
> +
> + if (unlikely(usize > PAGE_SIZE))
> + return -EFAULT;

Not documented above. Implementation consistent with *from*, but see
below.

> + if (unlikely(!access_ok(dst, usize)))
> + return -EFAULT;
> +
> + /* Deal with trailing bytes. */
> + if (usize < ksize) {
> + if (memchr_inv(src + size, 0, rest))
> + return -EFBIG;
> + } else if (usize > ksize) {
> + if (__memzero_user(dst + size, rest))
> + return -EFAULT;
> + }
> + /* Copy the interoperable parts of the struct. */
> + if (__copy_to_user(dst, src, size))
> + return -EFAULT;
> + return 0;
> +}
> +EXPORT_SYMBOL(copy_struct_to_user);
> +
> +/**
> + * copy_struct_from_user: copy a struct from user space
> + * @dst:   Destination address, in kernel space. This buffer must be @ksize
> + * bytes long.
> + * @ksize: Size of @dst struct.
> + * @src:   Source address, in user space.
> + * @usize: (Alleged) size of @src struct.
> + *
> + * Copies a struct from user space to kernel space, in a way that guarantees
> + * backwards-compatibility for struct syscall arguments (as long as future
> + * struct extensions are made such that all new fields are *appended* to the
> + * old struct, and zeroed-out new fields have the same meaning as the old
> + * struct).
> + *
> + * @ksize is just sizeof(*dst), and @usize should've been passed by user 
> space.
> + * The recommended usage is something like the following:
> + *
> + *   SYSCALL_DEFINE2(foobar, const struct foo __user *, uarg, size_t, usize)
> + *   {
> + *  int err;
> + *  struct foo karg = {};
> + *
> + *  err = copy_struct_from_user(, sizeof(karg), uarg, size);
> + *  if (err)
> + *return err;
> + *
> + *  // ...
> + *   }
> + *
> + * There are three cases to consider:
> + *  * If @usize == @ksize, then it's copied verbatim.
> + *  * If @usize < @ksize, then the user space has passed an old struct to a
> + *newer kernel. The rest of the trailing bytes in @dst (@ksize - @usize)
> + *are to be zero-filled.
> + *  * If @usize > @ksize, then the user space has passed a new struct to an
> + *older kernel. The trailing bytes unknown to the kernel (@usize - 
> @ksize)
> + *are checked to ensure they are zeroed, otherwise -E2BIG is returned.
> + *
> + * Returns (in all cases, some data may have been copied):
> + *  * -E2BIG:  (@usize > @ksize) and there are non-zero trailing bytes in 
> @src.
> + *  * -E2BIG:  @usize is "too big" (at time of writing, >PAGE_SIZE).
> + *  * -EFAULT: access to user space failed.
> + */
> +int copy_struct_from_user(void *dst, size_t ksize,
> +   const 

Re: [PATCH v5 14/31] powernv/fadump: define register/un-register callback functions

2019-09-05 Thread Michael Ellerman
Hari Bathini  writes:
> diff --git a/arch/powerpc/platforms/powernv/opal-fadump.c 
> b/arch/powerpc/platforms/powernv/opal-fadump.c
> index e466f7e..91fb909 100644
> --- a/arch/powerpc/platforms/powernv/opal-fadump.c
> +++ b/arch/powerpc/platforms/powernv/opal-fadump.c
> @@ -52,6 +68,8 @@ static ulong opal_fadump_init_mem_struct(struct fw_dump 
> *fadump_conf)
>   opal_fdm->fadumphdr_addr = (opal_fdm->rgn[0].dest +
>   fadump_conf->boot_memory_size);
>  
> + opal_fadump_update_config(fadump_conf, opal_fdm);
> +
>   return addr;
>  }
>  
> @@ -97,12 +115,69 @@ static int opal_fadump_setup_metadata(struct fw_dump 
> *fadump_conf)
>  
>  static int opal_fadump_register(struct fw_dump *fadump_conf)
>  {
> - return -EIO;
> + int i, err = -EIO;
> + s64 rc;

Some compiler versions are warning about this being used uninitialised:

arch/powerpc/platforms/powernv/opal-fadump.c:316:3: error: 'rc' may be used 
uninitialized in this function [-Werror=uninitialized]

http://kisskb.ellerman.id.au/kisskb/buildresult/13943984/

Which does seem like a legitimate warning.

cheers