Re: [PATCH] crypto: vmx: fix incorrect kernel-doc comment syntax in files

2021-03-25 Thread Daniel Axtens
Hi Aditya,

Thanks for your patch!

> The opening comment mark '/**' is used for highlighting the beginning of
> kernel-doc comments.
> There are certain files in drivers/crypto/vmx, which follow this syntax,
> but the content inside does not comply with kernel-doc.
> Such lines were probably not meant for kernel-doc parsing, but are parsed
> due to the presence of kernel-doc like comment syntax(i.e, '/**'), which
> causes unexpected warnings from kernel-doc.
>
> E.g., presence of kernel-doc like comment in the header line for
> drivers/crypto/vmx/vmx.c causes this warning by kernel-doc:
>
> "warning: expecting prototype for Routines supporting VMX instructions on the 
> Power 8(). Prototype was for p8_init() instead"

checkpatch (scripts/checkpatch.pl --strict -g HEAD) complains about this line:
WARNING: Possible unwrapped commit description (prefer a maximum 75 chars per 
line)
but checkpatch should be ignored here, as you did the right thing by not
breaking an error message across multiple lines.

> Similarly for other files too.
>
> Provide a simple fix by replacing such occurrences with general comment
> format, i.e. '/*', to prevent kernel-doc from parsing it.

This makes sense.

Reviewed-by: Daniel Axtens 

Kind regards,
Daniel

>
> Signed-off-by: Aditya Srivastava 
> ---
> * Applies perfectly on next-20210319
>
>  drivers/crypto/vmx/aes.c | 2 +-
>  drivers/crypto/vmx/aes_cbc.c | 2 +-
>  drivers/crypto/vmx/aes_ctr.c | 2 +-
>  drivers/crypto/vmx/aes_xts.c | 2 +-
>  drivers/crypto/vmx/ghash.c   | 2 +-
>  drivers/crypto/vmx/vmx.c | 2 +-
>  6 files changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/crypto/vmx/aes.c b/drivers/crypto/vmx/aes.c
> index d05c02baebcf..ec06189fbf99 100644
> --- a/drivers/crypto/vmx/aes.c
> +++ b/drivers/crypto/vmx/aes.c
> @@ -1,5 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0-only
> -/**
> +/*
>   * AES routines supporting VMX instructions on the Power 8
>   *
>   * Copyright (C) 2015 International Business Machines Inc.
> diff --git a/drivers/crypto/vmx/aes_cbc.c b/drivers/crypto/vmx/aes_cbc.c
> index d88084447f1c..ed0debc7acb5 100644
> --- a/drivers/crypto/vmx/aes_cbc.c
> +++ b/drivers/crypto/vmx/aes_cbc.c
> @@ -1,5 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0-only
> -/**
> +/*
>   * AES CBC routines supporting VMX instructions on the Power 8
>   *
>   * Copyright (C) 2015 International Business Machines Inc.
> diff --git a/drivers/crypto/vmx/aes_ctr.c b/drivers/crypto/vmx/aes_ctr.c
> index 79ba062ee1c1..9a3da8cd62f3 100644
> --- a/drivers/crypto/vmx/aes_ctr.c
> +++ b/drivers/crypto/vmx/aes_ctr.c
> @@ -1,5 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0-only
> -/**
> +/*
>   * AES CTR routines supporting VMX instructions on the Power 8
>   *
>   * Copyright (C) 2015 International Business Machines Inc.
> diff --git a/drivers/crypto/vmx/aes_xts.c b/drivers/crypto/vmx/aes_xts.c
> index 9fee1b1532a4..dabbccb41550 100644
> --- a/drivers/crypto/vmx/aes_xts.c
> +++ b/drivers/crypto/vmx/aes_xts.c
> @@ -1,5 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0-only
> -/**
> +/*
>   * AES XTS routines supporting VMX In-core instructions on Power 8
>   *
>   * Copyright (C) 2015 International Business Machines Inc.
> diff --git a/drivers/crypto/vmx/ghash.c b/drivers/crypto/vmx/ghash.c
> index 14807ac2e3b9..5bc5710a6de0 100644
> --- a/drivers/crypto/vmx/ghash.c
> +++ b/drivers/crypto/vmx/ghash.c
> @@ -1,5 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0
> -/**
> +/*
>   * GHASH routines supporting VMX instructions on the Power 8
>   *
>   * Copyright (C) 2015, 2019 International Business Machines Inc.
> diff --git a/drivers/crypto/vmx/vmx.c b/drivers/crypto/vmx/vmx.c
> index a40d08e75fc0..7eb713cc87c8 100644
> --- a/drivers/crypto/vmx/vmx.c
> +++ b/drivers/crypto/vmx/vmx.c
> @@ -1,5 +1,5 @@
>  // SPDX-License-Identifier: GPL-2.0-only
> -/**
> +/*
>   * Routines supporting VMX instructions on the Power 8
>   *
>   * Copyright (C) 2015 International Business Machines Inc.
> -- 
> 2.17.1


Re: [PATCH 1/3] dt-bindings: Fix undocumented compatible strings in examples

2021-02-03 Thread Daniel Palmer
Hi Rob,

On Wed, 3 Feb 2021 at 05:55, Rob Herring  wrote:
> diff --git a/Documentation/devicetree/bindings/gpio/mstar,msc313-gpio.yaml 
> b/Documentation/devicetree/bindings/gpio/mstar,msc313-gpio.yaml
> index 1f2ef408bb43..fe1e1c63ffe3 100644
> --- a/Documentation/devicetree/bindings/gpio/mstar,msc313-gpio.yaml
> +++ b/Documentation/devicetree/bindings/gpio/mstar,msc313-gpio.yaml
> @@ -46,7 +46,7 @@ examples:
>  #include 
>
>  gpio: gpio@207800 {
> -  compatible = "mstar,msc313e-gpio";
> +  compatible = "mstar,msc313-gpio";
>#gpio-cells = <2>;
>reg = <0x207800 0x200>;
>gpio-controller;

This is correct. The compatible string dropped the e at some point and
I must have missed the example.
Thanks for the fix.

Reviewed-by: Daniel Palmer 


Re: [PATCH v2] Remove __init from padata_do_multithreaded and padata_mt_helper.

2020-10-27 Thread Daniel Jordan
On 10/27/20 12:46 AM, Nico Pache wrote:
> On Wed, Jul 08, 2020 at 03:51:40PM -0400, Daniel Jordan wrote:
> > (I was away for a while)
> > 
> > On Thu, Jul 02, 2020 at 11:55:48AM -0400, Nico Pache wrote:
> > > Allow padata_do_multithreaded function to be called after bootstrap.
> > 
> > The functions are __init because they're currently only needed during boot, 
> > and
> > using __init allows the text to be freed once it's over, saving some memory.
> > 
> > So this change, in isolation, doesn't make sense.  If there were an 
> > enhancement
> > you were thinking of making, this patch could then be bundled with it so the
> > change is made only when it's used.
> > 
> > However, there's still work that needs to be merged before
> > padata_do_multithreaded can be called after boot.  See the parts about 
> > priority
> > adjustments (MAX_NICE/renicing) and concurrency limits in this branch
> > 
> >   
> > https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5
> > 
> > and the ktask discussions from linux-mm/lkml where concerns about these 
> > issues
> > were raised.  I plan to post these parts fairly soon and can include you if 
> > you
> > want.
>
> I really like the speed benefits I've been able to achieve by using your
> padata multithreaded interface in the branch you linked me to. Do you
> still have plans on moving forward with this upstream?

Yes, I'm still planning to push these patches upstream, but it's going to take
some time with all the prerequisites.  I'm working on remote charging in the
CPU controller now, which is the biggest unfinished task.  A little background
on that here:

https://lore.kernel.org/linux-mm/20200219220859.gf54...@cmpxchg.org/


[PATCH] module: statically initialize init section freeing data

2020-10-08 Thread Daniel Jordan
Corentin hit the following workqueue warning when running with
CRYPTO_MANAGER_EXTRA_TESTS:

  WARNING: CPU: 2 PID: 147 at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0
  Modules linked in: ghash_generic
  CPU: 2 PID: 147 Comm: modprobe Not tainted
  5.6.0-rc1-next-20200214-00068-g166c9264f0b1-dirty #545
  Hardware name: Pine H64 model A (DT)
  pc : __queue_work+0x3b8/0x3d0
  Call trace:
   __queue_work+0x3b8/0x3d0
   queue_work_on+0x6c/0x90
   do_init_module+0x188/0x1f0
   load_module+0x1d00/0x22b0

I wasn't able to reproduce on x86 or rpi 3b+.

This is

  WARN_ON(!list_empty(&work->entry))

from __queue_work(), and it happens because the init_free_wq work item
isn't initialized in time for a crypto test that requests the gcm
module.  Some crypto tests were recently moved earlier in boot as
explained in commit c4741b230597 ("crypto: run initcalls for generic
implementations earlier"), which went into mainline less than two weeks
before the Fixes commit.

Avoid the warning by statically initializing init_free_wq and the
corresponding llist.

Link: https://lore.kernel.org/lkml/20200217204803.GA13479@Red/
Fixes: 1a7b7d922081 ("modules: Use vmalloc special flag")
Reported-by: Corentin Labbe 
Tested-by: Corentin Labbe 
Tested-on: sun50i-h6-pine-h64
Tested-on: imx8mn-ddr4-evk
Tested-on: sun50i-a64-bananapi-m64
Signed-off-by: Daniel Jordan 
---
 kernel/module.c | 13 +++--
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 1c5cff34d9f2..8486123ffd7a 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -91,8 +91,9 @@ EXPORT_SYMBOL_GPL(module_mutex);
 static LIST_HEAD(modules);
 
 /* Work queue for freeing init sections in success case */
-static struct work_struct init_free_wq;
-static struct llist_head init_free_list;
+static void do_free_init(struct work_struct *w);
+static DECLARE_WORK(init_free_wq, do_free_init);
+static LLIST_HEAD(init_free_list);
 
 #ifdef CONFIG_MODULES_TREE_LOOKUP
 
@@ -3579,14 +3580,6 @@ static void do_free_init(struct work_struct *w)
}
 }
 
-static int __init modules_wq_init(void)
-{
-   INIT_WORK(&init_free_wq, do_free_init);
-   init_llist_head(&init_free_list);
-   return 0;
-}
-module_init(modules_wq_init);
-
 /*
  * This is where the real work happens.
  *

base-commit: c85fb28b6f999db9928b841f63f1beeb3074eeca
-- 
2.28.0



Re: WARNING: at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0

2020-10-08 Thread Daniel Jordan
On Wed, Oct 07, 2020 at 09:41:17PM +0200, Corentin Labbe wrote:
> I have added CONFIG_FTRACE=y and your second patch.
> The boot log can be seen at http://kernel.montjoie.ovh/108789.log
> 
> But it seems the latest dump_stack addition flood a bit.

Heh, sorry for making it spew, there wasn't such a flood when I tried.  Your
output is sufficiently incriminating, so I'll go post the fix now.

> I have started to read ftrace documentation, but if you have a quick what to 
> do in /sys/kernel/debug/tracing, it will be helpfull.

Sure, you can view the trace in /sys/kernel/debug/tracing/trace and
kernel-parameters.txt has the boot options documented.


Re: WARNING: at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0

2020-10-05 Thread Daniel Jordan
On Thu, Oct 01, 2020 at 07:50:22PM +0200, Corentin Labbe wrote:
> On Tue, Mar 03, 2020 at 04:30:17PM -0500, Daniel Jordan wrote:
> > Barring other ideas, Corentin, would you be willing to boot with
> > 
> > trace_event=initcall:*,module:* trace_options=stacktrace
> > 
> > and
> > 
> > diff --git a/kernel/module.c b/kernel/module.c
> > index 33569a01d6e1..393be6979a27 100644
> > --- a/kernel/module.c
> > +++ b/kernel/module.c
> > @@ -3604,8 +3604,11 @@ static noinline int do_init_module(struct module 
> > *mod)
> >  * be cleaned up needs to sync with the queued work - ie
> >  * rcu_barrier()
> >  */
> > -   if (llist_add(&freeinit->node, &init_free_list))
> > +   if (llist_add(&freeinit->node, &init_free_list)) {
> > +   pr_warn("%s: schedule_work for mod=%s\n", __func__, mod->name);
> > +   dump_stack();
> > schedule_work(&init_free_wq);
> > +   }
> >  
> > mutex_unlock(&module_mutex);
> > wake_up_all(&module_wq);
> > 
> > but not my earlier fix and share the dmesg and ftrace output to see if the
> > theory holds?
> > 
> > Also, could you attach your config?  Curious now what your crypto options 
> > look
> > like after fiddling with some of them today while trying and failing to see
> > this on x86.
> > 
> > thanks,
> > Daniel
> 
> Hello
> 
> Sorry for the very delayed answer.
> 
> I fail to reproduce it on x86 (qemu and  real hw) and arm.
> It seems to only happen on arm64.

Thanks for the config and dmesg, but there's no ftrace.  I see it's not
configured in your kernel, so could you boot with my earlier debug patch plus
this one and the kernel argument initcall_debug instead?

I'm trying to see whether it really is a request module call from the crypto
tests that's triggering this warning.  Preeetty likely that's what's happening,
but want to be sure since I can't reproduce this.  Then I can post the fix.

diff --git a/crypto/algapi.c b/crypto/algapi.c
index fdabf2675b63..0667c6b4588e 100644
--- a/crypto/algapi.c
+++ b/crypto/algapi.c
@@ -393,6 +393,10 @@ static void crypto_wait_for_test(struct crypto_larval 
*larval)
 {
int err;
 
+   pr_warn("%s: cra_name %s cra_driver_name %s\n", __func__,
+   larval->adult->cra_name, larval->adult->cra_driver_name);
+   dump_stack();
+
err = crypto_probing_notify(CRYPTO_MSG_ALG_REGISTER, larval->adult);
if (err != NOTIFY_STOP) {
if (WARN_ON(err != NOTIFY_DONE))
diff --git a/kernel/kmod.c b/kernel/kmod.c
index 3cd075ce2a1e..46c4645be763 100644
--- a/kernel/kmod.c
+++ b/kernel/kmod.c
@@ -166,6 +166,8 @@ int __request_module(bool wait, const char *fmt, ...)
}
 
trace_module_request(module_name, wait, _RET_IP_);
+   pr_warn("%s: %s\n", __func__, module_name);
+   dump_stack();
 
ret = call_modprobe(module_name, wait ? UMH_WAIT_PROC : UMH_WAIT_EXEC);
 


Re: WARNING: at kernel/workqueue.c:1473 __queue_work+0x3b8/0x3d0

2020-09-30 Thread Daniel Jordan
On Fri, Sep 25, 2020 at 08:12:03PM +0200, Corentin Labbe wrote:
> On Tue, Mar 03, 2020 at 04:31:11PM -0500, Daniel Jordan wrote:
> > On Tue, Mar 03, 2020 at 08:48:19AM +0100, Corentin Labbe wrote:
> > > The patch fix the issue. Thanks!
> > 
> > Thanks for trying it!
> > 
> > > So you could add:
> > > Reported-by: Corentin Labbe 
> > > Tested-by: Corentin Labbe 
> > > Tested-on: sun50i-h6-pine-h64
> > > Tested-on: imx8mn-ddr4-evk
> > > Tested-on: sun50i-a64-bananapi-m64
> > 
> > I definitely will if the patch turns out to be the right fix.
> > 
> > thanks,
> > Daniel
> 
> Hello
> 
> I forgot about this problem since the patch is in my branch since.
> But a co-worker hit this problem recently and without this patch my CI still 
> have it.

Hi,

Sure, I'm happy to help get a fix merged, but let's nail down what the problem
is first.  It'd be useful to have the things requested here:

https://lore.kernel.org/linux-crypto/20200303213017.tanczhqd3nhpe...@ca-dmjordan1.us.oracle.com/

thanks,
Daniel


[PATCH] padata: fix possible padata_works_lock deadlock

2020-09-02 Thread Daniel Jordan
syzbot reports,

  WARNING: inconsistent lock state
  5.9.0-rc2-syzkaller #0 Not tainted
  
  inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
  syz-executor.0/26715 takes:
  (padata_works_lock){+.?.}-{2:2}, at: padata_do_parallel kernel/padata.c:220
  {IN-SOFTIRQ-W} state was registered at:
spin_lock include/linux/spinlock.h:354 [inline]
padata_do_parallel kernel/padata.c:220
...
__do_softirq kernel/softirq.c:298
...
sysvec_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1091
asm_sysvec_apic_timer_interrupt arch/x86/include/asm/idtentry.h:581

   Possible unsafe locking scenario:

 CPU0
 
lock(padata_works_lock);

  lock(padata_works_lock);

padata_do_parallel() takes padata_works_lock with softirqs enabled, so a
deadlock is possible if, on the same CPU, the lock is acquired in
process context and then softirq handling done in an interrupt leads to
the same path.

Fix by leaving softirqs disabled while do_parallel holds
padata_works_lock.

Reported-by: syzbot+f4b9f49e38e25eb4e...@syzkaller.appspotmail.com
Fixes: 4611ce2246889 ("padata: allocate work structures for parallel jobs from 
a pool")
Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 kernel/padata.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 16cb894dc272..d4d3ba6e1728 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -215,12 +215,13 @@ int padata_do_parallel(struct padata_shell *ps,
padata->pd = pd;
padata->cb_cpu = *cb_cpu;
 
-   rcu_read_unlock_bh();
-
spin_lock(&padata_works_lock);
padata->seq_nr = ++pd->seq_nr;
pw = padata_work_alloc();
spin_unlock(&padata_works_lock);
+
+   rcu_read_unlock_bh();
+
if (pw) {
padata_work_init(pw, padata_parallel_worker, padata, 0);
queue_work(pinst->parallel_wq, &pw->pw_work);

base-commit: 9c7d619be5a002ea29c172df5e3c1227c22cbb41
-- 
2.28.0



[PATCH v2] padata: add another maintainer and another list

2020-08-27 Thread Daniel Jordan
At Steffen's request, I'll help maintain padata for the foreseeable
future.

While at it, let's have patches go to lkml too since the code is now
used outside of crypto.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 MAINTAINERS | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3b186ade3597..06a1b8a6d953 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13024,7 +13024,9 @@ F:  lib/packing.c
 
 PADATA PARALLEL EXECUTION MECHANISM
 M: Steffen Klassert 
+M: Daniel Jordan 
 L: linux-crypto@vger.kernel.org
+L: linux-ker...@vger.kernel.org
 S: Maintained
 F: Documentation/core-api/padata.rst
 F: include/linux/padata.h
-- 
2.28.0



Re: [PATCH] padata: add a reviewer

2020-08-27 Thread Daniel Jordan
On Thu, Aug 27, 2020 at 08:44:09AM +0200, Steffen Klassert wrote:
> Please also consider to add yourself as one of the maintainers.

Ok, sure!  I'll take you up on that.


[PATCH] padata: add a reviewer

2020-08-26 Thread Daniel Jordan
I volunteer to review padata changes for the foreseeable future.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 MAINTAINERS | 1 +
 1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index 3b186ade3597..1481d47cfd75 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13024,6 +13024,7 @@ F:  lib/packing.c
 
 PADATA PARALLEL EXECUTION MECHANISM
 M: Steffen Klassert 
+R: Daniel Jordan 
 L: linux-crypto@vger.kernel.org
 S: Maintained
 F: Documentation/core-api/padata.rst
-- 
2.27.0



[PATCH 5/6] padata: fold padata_alloc_possible() into padata_alloc()

2020-07-14 Thread Daniel Jordan
There's no reason to have two interfaces when there's only one caller.
Removing _possible saves text and simplifies future changes.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 Documentation/core-api/padata.rst |  2 +-
 crypto/pcrypt.c   |  2 +-
 include/linux/padata.h|  2 +-
 kernel/padata.c   | 33 +--
 4 files changed, 8 insertions(+), 31 deletions(-)

diff --git a/Documentation/core-api/padata.rst 
b/Documentation/core-api/padata.rst
index 771d50330e5b5..35175710b43cc 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -27,7 +27,7 @@ padata_instance structure for overall control of how jobs are 
to be run::
 
 #include 
 
-struct padata_instance *padata_alloc_possible(const char *name);
+struct padata_instance *padata_alloc(const char *name);
 
 'name' simply identifies the instance.
 
diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 7374dfecaf70f..812892732a5e5 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -320,7 +320,7 @@ static int pcrypt_init_padata(struct padata_instance 
**pinst, const char *name)
 {
int ret = -ENOMEM;
 
-   *pinst = padata_alloc_possible(name);
+   *pinst = padata_alloc(name);
if (!*pinst)
return ret;
 
diff --git a/include/linux/padata.h b/include/linux/padata.h
index a941b96b7119e..070a7d43e8af8 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -192,7 +192,7 @@ extern void __init padata_init(void);
 static inline void __init padata_init(void) {}
 #endif
 
-extern struct padata_instance *padata_alloc_possible(const char *name);
+extern struct padata_instance *padata_alloc(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
 extern void padata_free_shell(struct padata_shell *ps);
diff --git a/kernel/padata.c b/kernel/padata.c
index 4f0a57e5738c9..1c0b97891edb8 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -979,18 +979,12 @@ static struct kobj_type padata_attr_type = {
 };
 
 /**
- * padata_alloc - allocate and initialize a padata instance and specify
- *cpumasks for serial and parallel workers.
- *
+ * padata_alloc - allocate and initialize a padata instance
  * @name: used to identify the instance
- * @pcpumask: cpumask that will be used for padata parallelization
- * @cbcpumask: cpumask that will be used for padata serialization
  *
  * Return: new instance on success, NULL on error
  */
-static struct padata_instance *padata_alloc(const char *name,
-   const struct cpumask *pcpumask,
-   const struct cpumask *cbcpumask)
+struct padata_instance *padata_alloc(const char *name)
 {
struct padata_instance *pinst;
 
@@ -1016,14 +1010,11 @@ static struct padata_instance *padata_alloc(const char 
*name,
free_cpumask_var(pinst->cpumask.pcpu);
goto err_free_serial_wq;
}
-   if (!padata_validate_cpumask(pinst, pcpumask) ||
-   !padata_validate_cpumask(pinst, cbcpumask))
-   goto err_free_masks;
 
INIT_LIST_HEAD(&pinst->pslist);
 
-   cpumask_copy(pinst->cpumask.pcpu, pcpumask);
-   cpumask_copy(pinst->cpumask.cbcpu, cbcpumask);
+   cpumask_copy(pinst->cpumask.pcpu, cpu_possible_mask);
+   cpumask_copy(pinst->cpumask.cbcpu, cpu_possible_mask);
 
if (padata_setup_cpumasks(pinst))
goto err_free_masks;
@@ -1057,21 +1048,7 @@ static struct padata_instance *padata_alloc(const char 
*name,
 err:
return NULL;
 }
-
-/**
- * padata_alloc_possible - Allocate and initialize padata instance.
- * Use the cpu_possible_mask for serial and
- * parallel workers.
- *
- * @name: used to identify the instance
- *
- * Return: new instance on success, NULL on error
- */
-struct padata_instance *padata_alloc_possible(const char *name)
-{
-   return padata_alloc(name, cpu_possible_mask, cpu_possible_mask);
-}
-EXPORT_SYMBOL(padata_alloc_possible);
+EXPORT_SYMBOL(padata_alloc);
 
 /**
  * padata_free - free a padata instance
-- 
2.27.0



[PATCH 4/6] padata: remove effective cpumasks from the instance

2020-07-14 Thread Daniel Jordan
A padata instance has effective cpumasks that store the user-supplied
masks ANDed with the online mask, but this middleman is unnecessary.
parallel_data keeps the same information around.  Removing this saves
text and code churn in future changes.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 include/linux/padata.h |  2 --
 kernel/padata.c| 30 +++---
 2 files changed, 3 insertions(+), 29 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 7d53208b43daa..a941b96b7119e 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -167,7 +167,6 @@ struct padata_mt_job {
  * @serial_wq: The workqueue used for serial work.
  * @pslist: List of padata_shell objects attached to this instance.
  * @cpumask: User supplied cpumasks for parallel and serial works.
- * @rcpumask: Actual cpumasks based on user cpumask and cpu_online_mask.
  * @kobj: padata instance kernel object.
  * @lock: padata instance lock.
  * @flags: padata flags.
@@ -179,7 +178,6 @@ struct padata_instance {
struct workqueue_struct *serial_wq;
struct list_headpslist;
struct padata_cpumask   cpumask;
-   struct padata_cpumask   rcpumask;
struct kobject   kobj;
struct mutex lock;
u8   flags;
diff --git a/kernel/padata.c b/kernel/padata.c
index 27f90a3c4dc6b..4f0a57e5738c9 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -571,13 +571,8 @@ static void padata_init_pqueues(struct parallel_data *pd)
 static struct parallel_data *padata_alloc_pd(struct padata_shell *ps)
 {
struct padata_instance *pinst = ps->pinst;
-   const struct cpumask *cbcpumask;
-   const struct cpumask *pcpumask;
struct parallel_data *pd;
 
-   cbcpumask = pinst->rcpumask.cbcpu;
-   pcpumask = pinst->rcpumask.pcpu;
-
pd = kzalloc(sizeof(struct parallel_data), GFP_KERNEL);
if (!pd)
goto err;
@@ -597,8 +592,8 @@ static struct parallel_data *padata_alloc_pd(struct 
padata_shell *ps)
if (!alloc_cpumask_var(&pd->cpumask.cbcpu, GFP_KERNEL))
goto err_free_pcpu;
 
-   cpumask_copy(pd->cpumask.pcpu, pcpumask);
-   cpumask_copy(pd->cpumask.cbcpu, cbcpumask);
+   cpumask_and(pd->cpumask.pcpu, pinst->cpumask.pcpu, cpu_online_mask);
+   cpumask_and(pd->cpumask.cbcpu, pinst->cpumask.cbcpu, cpu_online_mask);
 
padata_init_pqueues(pd);
padata_init_squeues(pd);
@@ -668,12 +663,6 @@ static int padata_replace(struct padata_instance *pinst)
 
pinst->flags |= PADATA_RESET;
 
-   cpumask_and(pinst->rcpumask.pcpu, pinst->cpumask.pcpu,
-   cpu_online_mask);
-
-   cpumask_and(pinst->rcpumask.cbcpu, pinst->cpumask.cbcpu,
-   cpu_online_mask);
-
list_for_each_entry(ps, &pinst->pslist, list) {
err = padata_replace_one(ps);
if (err)
@@ -856,8 +845,6 @@ static void __padata_free(struct padata_instance *pinst)
 
WARN_ON(!list_empty(&pinst->pslist));
 
-   free_cpumask_var(pinst->rcpumask.cbcpu);
-   free_cpumask_var(pinst->rcpumask.pcpu);
free_cpumask_var(pinst->cpumask.pcpu);
free_cpumask_var(pinst->cpumask.cbcpu);
destroy_workqueue(pinst->serial_wq);
@@ -1033,20 +1020,13 @@ static struct padata_instance *padata_alloc(const char 
*name,
!padata_validate_cpumask(pinst, cbcpumask))
goto err_free_masks;
 
-   if (!alloc_cpumask_var(&pinst->rcpumask.pcpu, GFP_KERNEL))
-   goto err_free_masks;
-   if (!alloc_cpumask_var(&pinst->rcpumask.cbcpu, GFP_KERNEL))
-   goto err_free_rcpumask_pcpu;
-
INIT_LIST_HEAD(&pinst->pslist);
 
cpumask_copy(pinst->cpumask.pcpu, pcpumask);
cpumask_copy(pinst->cpumask.cbcpu, cbcpumask);
-   cpumask_and(pinst->rcpumask.pcpu, pcpumask, cpu_online_mask);
-   cpumask_and(pinst->rcpumask.cbcpu, cbcpumask, cpu_online_mask);
 
if (padata_setup_cpumasks(pinst))
-   goto err_free_rcpumask_cbcpu;
+   goto err_free_masks;
 
__padata_start(pinst);
 
@@ -1064,10 +1044,6 @@ static struct padata_instance *padata_alloc(const char 
*name,
 
return pinst;
 
-err_free_rcpumask_cbcpu:
-   free_cpumask_var(pinst->rcpumask.cbcpu);
-err_free_rcpumask_pcpu:
-   free_cpumask_var(pinst->rcpumask.pcpu);
 err_free_masks:
free_cpumask_var(pinst->cpumask.pcpu);
free_cpumask_var(pinst->cpumask.cbcpu);
-- 
2.27.0



[PATCH 6/6] padata: remove padata_parallel_queue

2020-07-14 Thread Daniel Jordan
Only its reorder field is actually used now, so remove the struct and
embed @reorder directly in parallel_data.

No functional change, just a cleanup.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 include/linux/padata.h | 15 ++
 kernel/padata.c| 46 ++
 2 files changed, 22 insertions(+), 39 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 070a7d43e8af8..a433f13fc4bf7 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -66,17 +66,6 @@ struct padata_serial_queue {
struct parallel_data *pd;
 };
 
-/**
- * struct padata_parallel_queue - The percpu padata parallel queue
- *
- * @reorder: List to wait for reordering after parallel processing.
- * @num_obj: Number of objects that are processed by this cpu.
- */
-struct padata_parallel_queue {
-   struct padata_listreorder;
-   atomic_t  num_obj;
-};
-
 /**
  * struct padata_cpumask - The cpumasks for the parallel/serial workers
  *
@@ -93,7 +82,7 @@ struct padata_cpumask {
  * that depends on the cpumask in use.
  *
  * @ps: padata_shell object.
- * @pqueue: percpu padata queues used for parallelization.
+ * @reorder_list: percpu reorder lists
  * @squeue: percpu padata queues used for serialuzation.
  * @refcnt: Number of objects holding a reference on this parallel_data.
  * @seq_nr: Sequence number of the parallelized data object.
@@ -105,7 +94,7 @@ struct padata_cpumask {
  */
 struct parallel_data {
struct padata_shell *ps;
-   struct padata_parallel_queue__percpu *pqueue;
+   struct padata_list  __percpu *reorder_list;
struct padata_serial_queue  __percpu *squeue;
atomic_trefcnt;
unsigned intseq_nr;
diff --git a/kernel/padata.c b/kernel/padata.c
index 1c0b97891edb8..16cb894dc272b 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -250,13 +250,11 @@ EXPORT_SYMBOL(padata_do_parallel);
 static struct padata_priv *padata_find_next(struct parallel_data *pd,
bool remove_object)
 {
-   struct padata_parallel_queue *next_queue;
struct padata_priv *padata;
struct padata_list *reorder;
int cpu = pd->cpu;
 
-   next_queue = per_cpu_ptr(pd->pqueue, cpu);
-   reorder = &next_queue->reorder;
+   reorder = per_cpu_ptr(pd->reorder_list, cpu);
 
spin_lock(&reorder->lock);
if (list_empty(&reorder->list)) {
@@ -291,7 +289,7 @@ static void padata_reorder(struct parallel_data *pd)
int cb_cpu;
struct padata_priv *padata;
struct padata_serial_queue *squeue;
-   struct padata_parallel_queue *next_queue;
+   struct padata_list *reorder;
 
/*
 * We need to ensure that only one cpu can work on dequeueing of
@@ -339,9 +337,8 @@ static void padata_reorder(struct parallel_data *pd)
 */
smp_mb();
 
-   next_queue = per_cpu_ptr(pd->pqueue, pd->cpu);
-   if (!list_empty(&next_queue->reorder.list) &&
-   padata_find_next(pd, false))
+   reorder = per_cpu_ptr(pd->reorder_list, pd->cpu);
+   if (!list_empty(&reorder->list) && padata_find_next(pd, false))
queue_work(pinst->serial_wq, &pd->reorder_work);
 }
 
@@ -401,17 +398,16 @@ void padata_do_serial(struct padata_priv *padata)
 {
struct parallel_data *pd = padata->pd;
int hashed_cpu = padata_cpu_hash(pd, padata->seq_nr);
-   struct padata_parallel_queue *pqueue = per_cpu_ptr(pd->pqueue,
-  hashed_cpu);
+   struct padata_list *reorder = per_cpu_ptr(pd->reorder_list, hashed_cpu);
struct padata_priv *cur;
 
-   spin_lock(&pqueue->reorder.lock);
+   spin_lock(&reorder->lock);
/* Sort in ascending order of sequence number. */
-   list_for_each_entry_reverse(cur, &pqueue->reorder.list, list)
+   list_for_each_entry_reverse(cur, &reorder->list, list)
if (cur->seq_nr < padata->seq_nr)
break;
list_add(&padata->list, &cur->list);
-   spin_unlock(&pqueue->reorder.lock);
+   spin_unlock(&reorder->lock);
 
/*
 * Ensure the addition to the reorder list is ordered correctly
@@ -553,17 +549,15 @@ static void padata_init_squeues(struct parallel_data *pd)
}
 }
 
-/* Initialize all percpu queues used by parallel workers */
-static void padata_init_pqueues(struct parallel_data *pd)
+/* Initialize per-CPU reorder lists */
+static void padata_init_reorder_list(struct parallel_data *pd)
 {
int cpu;
-   struct padata_parallel_queue *pqueue;
+   struct padata_

[PATCH 0/6] padata cleanups

2020-07-14 Thread Daniel Jordan
These cleanups save ~5% of the padata text/data and make it a little
easier to use and develop going forward.

In particular, they pave the way to extend padata's multithreading support to
VFIO, a work-in-progress version of which can be found here:


https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5

Based on v5.8-rc5.  As always, feedback is welcome.

Daniel

Daniel Jordan (6):
  padata: remove start function
  padata: remove stop function
  padata: inline single call of pd_setup_cpumasks()
  padata: remove effective cpumasks from the instance
  padata: fold padata_alloc_possible() into padata_alloc()
  padata: remove padata_parallel_queue

 Documentation/core-api/padata.rst |  18 +--
 crypto/pcrypt.c   |  17 +--
 include/linux/padata.h|  21 +---
 kernel/padata.c   | 177 ++
 4 files changed, 46 insertions(+), 187 deletions(-)


base-commit: 11ba468877bb23f28956a35e896356252d63c983
-- 
2.27.0



[PATCH 1/6] padata: remove start function

2020-07-14 Thread Daniel Jordan
padata_start() is only used right after pcrypt allocates an instance
with all possible CPUs, when PADATA_INVALID can't happen, so there's no
need for a separate "start" step.  It can be done during allocation to
save text, make using padata easier, and avoid unneeded calls in the
future.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 crypto/pcrypt.c|  3 ---
 include/linux/padata.h |  1 -
 kernel/padata.c| 26 +-
 3 files changed, 1 insertion(+), 29 deletions(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 8bddc65cd5092..4f5707a3dd1e9 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -359,9 +359,6 @@ static int __init pcrypt_init(void)
if (err)
goto err_deinit_pencrypt;
 
-   padata_start(pencrypt);
-   padata_start(pdecrypt);
-
return crypto_register_template(&pcrypt_tmpl);
 
 err_deinit_pencrypt:
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 7302efff5e656..20294cddc7396 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -204,6 +204,5 @@ extern void padata_do_serial(struct padata_priv *padata);
 extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
  cpumask_var_t cpumask);
-extern int padata_start(struct padata_instance *pinst);
 extern void padata_stop(struct padata_instance *pinst);
 #endif
diff --git a/kernel/padata.c b/kernel/padata.c
index 4373f7adaa40a..9317623166124 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -789,30 +789,6 @@ int padata_set_cpumask(struct padata_instance *pinst, int 
cpumask_type,
 }
 EXPORT_SYMBOL(padata_set_cpumask);
 
-/**
- * padata_start - start the parallel processing
- *
- * @pinst: padata instance to start
- *
- * Return: 0 on success or negative error code
- */
-int padata_start(struct padata_instance *pinst)
-{
-   int err = 0;
-
-   mutex_lock(&pinst->lock);
-
-   if (pinst->flags & PADATA_INVALID)
-   err = -EINVAL;
-
-   __padata_start(pinst);
-
-   mutex_unlock(&pinst->lock);
-
-   return err;
-}
-EXPORT_SYMBOL(padata_start);
-
 /**
  * padata_stop - stop the parallel processing
  *
@@ -1100,7 +1076,7 @@ static struct padata_instance *padata_alloc(const char 
*name,
if (padata_setup_cpumasks(pinst))
goto err_free_rcpumask_cbcpu;
 
-   pinst->flags = 0;
+   __padata_start(pinst);
 
kobject_init(&pinst->kobj, &padata_attr_type);
mutex_init(&pinst->lock);
-- 
2.27.0



[PATCH 2/6] padata: remove stop function

2020-07-14 Thread Daniel Jordan
padata_stop() has two callers and is unnecessary in both cases.  When
pcrypt calls it before padata_free(), it's being unloaded so there are
no outstanding padata jobs[0].  When __padata_free() calls it, it's
either along the same path or else pcrypt initialization failed, which
of course means there are also no outstanding jobs.

Removing it simplifies padata and saves text.

[0] 
https://lore.kernel.org/linux-crypto/20191119225017.mjrak2fwa5vcc...@gondor.apana.org.au/

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 Documentation/core-api/padata.rst | 16 ++--
 crypto/pcrypt.c   | 12 +++-
 include/linux/padata.h|  1 -
 kernel/padata.c   | 14 --
 4 files changed, 5 insertions(+), 38 deletions(-)

diff --git a/Documentation/core-api/padata.rst 
b/Documentation/core-api/padata.rst
index 0830e5b0e8211..771d50330e5b5 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -31,18 +31,7 @@ padata_instance structure for overall control of how jobs 
are to be run::
 
 'name' simply identifies the instance.
 
-There are functions for enabling and disabling the instance::
-
-int padata_start(struct padata_instance *pinst);
-void padata_stop(struct padata_instance *pinst);
-
-These functions are setting or clearing the "PADATA_INIT" flag; if that flag is
-not set, other functions will refuse to work.  padata_start() returns zero on
-success (flag set) or -EINVAL if the padata cpumask contains no active CPU
-(flag not set).  padata_stop() clears the flag and blocks until the padata
-instance is unused.
-
-Finally, complete padata initialization by allocating a padata_shell::
+Then, complete padata initialization by allocating a padata_shell::
 
struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
 
@@ -155,11 +144,10 @@ submitted.
 Destroying
 --
 
-Cleaning up a padata instance predictably involves calling the three free
+Cleaning up a padata instance predictably involves calling the two free
 functions that correspond to the allocation in reverse::
 
 void padata_free_shell(struct padata_shell *ps);
-void padata_stop(struct padata_instance *pinst);
 void padata_free(struct padata_instance *pinst);
 
 It is the user's responsibility to ensure all outstanding jobs are complete
diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 4f5707a3dd1e9..7374dfecaf70f 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -331,12 +331,6 @@ static int pcrypt_init_padata(struct padata_instance 
**pinst, const char *name)
return ret;
 }
 
-static void pcrypt_fini_padata(struct padata_instance *pinst)
-{
-   padata_stop(pinst);
-   padata_free(pinst);
-}
-
 static struct crypto_template pcrypt_tmpl = {
.name = "pcrypt",
.create = pcrypt_create,
@@ -362,7 +356,7 @@ static int __init pcrypt_init(void)
return crypto_register_template(&pcrypt_tmpl);
 
 err_deinit_pencrypt:
-   pcrypt_fini_padata(pencrypt);
+   padata_free(pencrypt);
 err_unreg_kset:
kset_unregister(pcrypt_kset);
 err:
@@ -373,8 +367,8 @@ static void __exit pcrypt_exit(void)
 {
crypto_unregister_template(&pcrypt_tmpl);
 
-   pcrypt_fini_padata(pencrypt);
-   pcrypt_fini_padata(pdecrypt);
+   padata_free(pencrypt);
+   padata_free(pdecrypt);
 
kset_unregister(pcrypt_kset);
 }
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 20294cddc7396..7d53208b43daa 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -204,5 +204,4 @@ extern void padata_do_serial(struct padata_priv *padata);
 extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
  cpumask_var_t cpumask);
-extern void padata_stop(struct padata_instance *pinst);
 #endif
diff --git a/kernel/padata.c b/kernel/padata.c
index 9317623166124..8f55e717ba50b 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -789,19 +789,6 @@ int padata_set_cpumask(struct padata_instance *pinst, int 
cpumask_type,
 }
 EXPORT_SYMBOL(padata_set_cpumask);
 
-/**
- * padata_stop - stop the parallel processing
- *
- * @pinst: padata instance to stop
- */
-void padata_stop(struct padata_instance *pinst)
-{
-   mutex_lock(&pinst->lock);
-   __padata_stop(pinst);
-   mutex_unlock(&pinst->lock);
-}
-EXPORT_SYMBOL(padata_stop);
-
 #ifdef CONFIG_HOTPLUG_CPU
 
 static int __padata_add_cpu(struct padata_instance *pinst, int cpu)
@@ -883,7 +870,6 @@ static void __padata_free(struct padata_instance *pinst)
 
WARN_ON(!list_empty(&pinst->pslist));
 
-   padata_stop(pinst);
free_cpumask_var(pinst->rcpumask.cbcpu);
free_cpumask_var(pinst->rcpumask.pcpu);
free_cpumask_var(pinst->cpumask.pcpu);
-- 
2.27.0



[PATCH 3/6] padata: inline single call of pd_setup_cpumasks()

2020-07-14 Thread Daniel Jordan
pd_setup_cpumasks() has only one caller.  Move its contents inline to
prepare for the next cleanup.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 kernel/padata.c | 32 +---
 1 file changed, 9 insertions(+), 23 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 8f55e717ba50b..27f90a3c4dc6b 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -441,28 +441,6 @@ static int padata_setup_cpumasks(struct padata_instance 
*pinst)
return err;
 }
 
-static int pd_setup_cpumasks(struct parallel_data *pd,
-const struct cpumask *pcpumask,
-const struct cpumask *cbcpumask)
-{
-   int err = -ENOMEM;
-
-   if (!alloc_cpumask_var(&pd->cpumask.pcpu, GFP_KERNEL))
-   goto out;
-   if (!alloc_cpumask_var(&pd->cpumask.cbcpu, GFP_KERNEL))
-   goto free_pcpu_mask;
-
-   cpumask_copy(pd->cpumask.pcpu, pcpumask);
-   cpumask_copy(pd->cpumask.cbcpu, cbcpumask);
-
-   return 0;
-
-free_pcpu_mask:
-   free_cpumask_var(pd->cpumask.pcpu);
-out:
-   return err;
-}
-
 static void __init padata_mt_helper(struct work_struct *w)
 {
struct padata_work *pw = container_of(w, struct padata_work, pw_work);
@@ -613,8 +591,14 @@ static struct parallel_data *padata_alloc_pd(struct 
padata_shell *ps)
goto err_free_pqueue;
 
pd->ps = ps;
-   if (pd_setup_cpumasks(pd, pcpumask, cbcpumask))
+
+   if (!alloc_cpumask_var(&pd->cpumask.pcpu, GFP_KERNEL))
goto err_free_squeue;
+   if (!alloc_cpumask_var(&pd->cpumask.cbcpu, GFP_KERNEL))
+   goto err_free_pcpu;
+
+   cpumask_copy(pd->cpumask.pcpu, pcpumask);
+   cpumask_copy(pd->cpumask.cbcpu, cbcpumask);
 
padata_init_pqueues(pd);
padata_init_squeues(pd);
@@ -626,6 +610,8 @@ static struct parallel_data *padata_alloc_pd(struct 
padata_shell *ps)
 
return pd;
 
+err_free_pcpu:
+   free_cpumask_var(pd->cpumask.pcpu);
 err_free_squeue:
free_percpu(pd->squeue);
 err_free_pqueue:
-- 
2.27.0



Re: [PATCH v2] Remove __init from padata_do_multithreaded and padata_mt_helper.

2020-07-08 Thread Daniel Jordan
(I was away for a while)

On Thu, Jul 02, 2020 at 11:55:48AM -0400, Nico Pache wrote:
> Allow padata_do_multithreaded function to be called after bootstrap.

The functions are __init because they're currently only needed during boot, and
using __init allows the text to be freed once it's over, saving some memory.

So this change, in isolation, doesn't make sense.  If there were an enhancement
you were thinking of making, this patch could then be bundled with it so the
change is made only when it's used.

However, there's still work that needs to be merged before
padata_do_multithreaded can be called after boot.  See the parts about priority
adjustments (MAX_NICE/renicing) and concurrency limits in this branch

  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5

and the ktask discussions from linux-mm/lkml where concerns about these issues
were raised.  I plan to post these parts fairly soon and can include you if you
want.


[PATCH] padata: upgrade smp_mb__after_atomic to smp_mb in padata_do_serial

2020-06-08 Thread Daniel Jordan
A 5.7 kernel hangs during a tcrypt test of padata that waits for an AEAD
request to finish.  This is only seen on large machines running many
concurrent requests.

The issue is that padata never serializes the request.  The removal of
the reorder_objects atomic missed that the memory barrier in
padata_do_serial() depends on it.

Upgrade the barrier from smp_mb__after_atomic to smp_mb to get correct
ordering again.

Fixes: 3facced7aeed1 ("padata: remove reorder_objects")
Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 kernel/padata.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index a6afa12fb75ee..7b701bc3e7922 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -260,7 +260,7 @@ static void padata_reorder(struct parallel_data *pd)
 *
 * Ensure reorder queue is read after pd->lock is dropped so we see
 * new objects from another task in padata_do_serial.  Pairs with
-* smp_mb__after_atomic in padata_do_serial.
+* smp_mb in padata_do_serial.
 */
smp_mb();
 
@@ -342,7 +342,7 @@ void padata_do_serial(struct padata_priv *padata)
 * with the trylock of pd->lock in padata_reorder.  Pairs with smp_mb
 * in padata_reorder.
 */
-   smp_mb__after_atomic();
+   smp_mb();
 
padata_reorder(pd);
 }

base-commit: 3d77e6a8804abcc0504c904bd6e5cdf3a5cf8162
-- 
2.26.2



[PATCH v3 7/8] mm: make deferred init's max threads arch-specific

2020-05-27 Thread Daniel Jordan
Using padata during deferred init has only been tested on x86, so for
now limit it to this architecture.

If another arch wants this, it can find the max thread limit that's best
for it and override deferred_page_init_max_threads().

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 arch/x86/mm/init_64.c| 12 
 include/linux/memblock.h |  3 +++
 mm/page_alloc.c  | 13 -
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b5f73f5e207c..2d749ec12ea8a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1260,6 +1260,18 @@ void __init mem_init(void)
mem_init_print_info(NULL);
 }
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+   /*
+* More CPUs always led to greater speedups on tested systems, up to
+* all the nodes' CPUs.  Use all since the system is otherwise idle
+* now.
+*/
+   return max_t(int, cpumask_weight(node_cpumask), 1);
+}
+#endif
+
 int kernel_set_to_readonly;
 
 void mark_rodata_ro(void)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6bc37a731d27b..2b289df44194f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -275,6 +275,9 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone 
*zone,
 #define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
for (; i != U64_MAX;  \
 __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask);
+
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d47016849531..329fd1a809c59 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1835,6 +1835,13 @@ deferred_init_memmap_chunk(unsigned long start_pfn, 
unsigned long end_pfn,
}
 }
 
+/* An arch may override for more concurrency. */
+__weak int __init
+deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+   return 1;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
@@ -1883,11 +1890,7 @@ static int __init deferred_init_memmap(void *data)
 first_init_pfn))
goto zone_empty;
 
-   /*
-* More CPUs always led to greater speedups on tested systems, up to
-* all the nodes' CPUs.  Use all since the system is otherwise idle now.
-*/
-   max_threads = max(cpumask_weight(cpumask), 1u);
+   max_threads = deferred_page_init_max_threads(cpumask);
 
while (spfn < epfn) {
unsigned long epfn_align = ALIGN(epfn, PAGES_PER_SECTION);
-- 
2.26.2



[PATCH v3 1/8] padata: remove exit routine

2020-05-27 Thread Daniel Jordan
padata_driver_exit() is unnecessary because padata isn't built as a
module and doesn't exit.

padata's init routine will soon allocate memory, so getting rid of the
exit function now avoids pointless code to free it.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 kernel/padata.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index a6afa12fb75ee..835919c745266 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -1072,10 +1072,4 @@ static __init int padata_driver_init(void)
 }
 module_init(padata_driver_init);
 
-static __exit void padata_driver_exit(void)
-{
-   cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
-   cpuhp_remove_multi_state(hp_online);
-}
-module_exit(padata_driver_exit);
 #endif
-- 
2.26.2



[PATCH v3 0/8] padata: parallelize deferred page init

2020-05-27 Thread Daniel Jordan
Thanks to Alex for his continued review and Josh for running v2!  Please
continue to review and test, and acks for the padata parts would be
appreciated.

Daniel

--

Deferred struct page init is a bottleneck in kernel boot--the biggest
for us and probably others.  Optimizing it maximizes availability for
large-memory systems and allows spinning up short-lived VMs as needed
without having to leave them running.  It also benefits bare metal
machines hosting VMs that are sensitive to downtime.  In projects such
as VMM Fast Restart[1], where guest state is preserved across kexec
reboot, it helps prevent application and network timeouts in the guests.

So, multithread deferred init to take full advantage of system memory
bandwidth.

Extend padata, a framework that handles many parallel singlethreaded
jobs, to handle multithreaded jobs as well by adding support for
splitting up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.  More documentation in patches 4 and 8.

This series is the first step in a project to address other memory
proportional bottlenecks in the kernel such as pmem struct page init,
vfio page pinning, hugetlb fallocate, and munmap.  Deferred page init
doesn't require concurrency limits, resource control, or priority
adjustments like these other users will because it happens during boot
when the system is otherwise idle and waiting for page init to finish.

This has been run on a variety of x86 systems and speeds up kernel boot
by 4% to 49%, saving up to 1.6 out of 4 seconds.  Patch 6 has more
numbers.

The powerpc and s390 lists are included in case they want to give this a
try, they had enabled this feature when it was configured per arch.

Series based on v5.7-rc7 plus these three from mmotm

  mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
  mm-initialize-deferred-pages-with-interrupts-enabled.patch
  mm-call-cond_resched-from-deferred_init_memmap.patch

and it's available here:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v3
  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v3

and the future users and related features are available as
work-in-progress:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.5
  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.5

v3:
 - Remove nr_pages accounting as suggested by Alex, adding a new patch
 - Align deferred init ranges up not down, simplify surrounding code (Alex)
 - Add Josh's T-b's from v2 (Josh's T-b's for v1 lost in rebase, apologies!)
 - Move padata.h include up in init/main.c to reduce patch collisions (Andrew)
 - Slightly reword Documentation patch
 - Rebase on v5.7-rc7 and retest

v2:
 - Improve the problem statement (Andrew, Josh, Pavel)
 - Add T-b's to unchanged patches (Josh)
 - Fully initialize max-order blocks to avoid buddy issues (Alex)
 - Parallelize on section-aligned boundaries to avoid potential
   false sharing (Alex)
 - Return the maximum thread count from a function that architectures
   can override, with the generic version returning 1 (current
   behavior).  Override for x86 since that's the only arch this series
   has been tested on so far.  Other archs can test with more threads
   by dropping patch 6.
 - Rebase to v5.7-rc6, rerun tests

RFC v4 [2] -> v1:
 - merged with padata (Peter)
 - got rid of the 'task' nomenclature (Peter, Jon)

future work branch:
 - made lockdep-aware (Jason, Peter)
 - adjust workqueue worker priority with renice_or_cancel() (Tejun)
 - fixed undo problem in VFIO (Alex)

The remaining feedback, mainly resource control awareness (cgroup etc),
is TODO for later series.

[1] 
https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
https://www.youtube.com/watch?v=pBsHnf93tcQ

https://lore.kernel.org/linux-mm/1588812129-8596-1-git-send-email-anthony.yzn...@oracle.com/

[2] 
https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jor...@oracle.com/

Daniel Jordan (8):
  padata: remove exit routine
  padata: initialize earlier
  padata: allocate work structures for parallel jobs from a pool
  padata: add basic support for multithreaded jobs
  mm: don't track number of pages during deferred initialization
  mm: parallelize deferred_init_memmap()
  mm: make deferred init's max threads arch-specific
  padata: document multithreaded jobs

 Documentation/core-api/padata.rst |  41 +++--
 arch/x86/mm/init_64.c |  12 ++
 include/linux/memblock.h  |   3 +
 include/linux/padata.h|  43 -
 init/main.c   |   2 +
 kernel/padata.c   | 277 --
 mm/Kconfig|   6 +-
 mm/page_alloc.c   |  59 +-

[PATCH v3 3/8] padata: allocate work structures for parallel jobs from a pool

2020-05-27 Thread Daniel Jordan
padata allocates per-CPU, per-instance work structs for parallel jobs.
A do_parallel call assigns a job to a sequence number and hashes the
number to a CPU, where the job will eventually run using the
corresponding work.

This approach fit with how padata used to bind a job to each CPU
round-robin, makes less sense after commit bfde23ce200e6 ("padata:
unbind parallel jobs from specific CPUs") because a work isn't bound to
a particular CPU anymore, and isn't needed at all for multithreaded jobs
because they don't have sequence numbers.

Replace the per-CPU works with a preallocated pool, which allows sharing
them between existing padata users and the upcoming multithreaded user.
The pool will also facilitate setting NUMA-aware concurrency limits with
later users.

The pool is sized according to the number of possible CPUs.  With this
limit, MAX_OBJ_NUM no longer makes sense, so remove it.

If the global pool is exhausted, a parallel job is run in the current
task instead to throttle a system trying to do too much in parallel.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 include/linux/padata.h |   8 +--
 kernel/padata.c| 118 +++--
 2 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 476ecfa41f363..3bfa503503ac5 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
  * @list: List entry, to attach to the padata lists.
  * @pd: Pointer to the internal control structure.
  * @cb_cpu: Callback cpu for serializatioon.
- * @cpu: Cpu for parallelization.
  * @seq_nr: Sequence number of the parallelized data object.
  * @info: Used to pass information from the parallel to the serial function.
  * @parallel: Parallel execution function.
@@ -34,7 +33,6 @@ struct padata_priv {
struct list_headlist;
struct parallel_data*pd;
int cb_cpu;
-   int cpu;
unsigned intseq_nr;
int info;
void(*parallel)(struct padata_priv *padata);
@@ -68,15 +66,11 @@ struct padata_serial_queue {
 /**
  * struct padata_parallel_queue - The percpu padata parallel queue
  *
- * @parallel: List to wait for parallelization.
  * @reorder: List to wait for reordering after parallel processing.
- * @work: work struct for parallelization.
  * @num_obj: Number of objects that are processed by this cpu.
  */
 struct padata_parallel_queue {
-   struct padata_listparallel;
struct padata_listreorder;
-   struct work_structwork;
atomic_t  num_obj;
 };
 
@@ -111,7 +105,7 @@ struct parallel_data {
struct padata_parallel_queue__percpu *pqueue;
struct padata_serial_queue  __percpu *squeue;
atomic_trefcnt;
-   atomic_tseq_nr;
+   unsigned intseq_nr;
unsigned intprocessed;
int cpu;
struct padata_cpumask   cpumask;
diff --git a/kernel/padata.c b/kernel/padata.c
index 6f709bc0fc413..78ff9aa529204 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -32,7 +32,15 @@
 #include 
 #include 
 
-#define MAX_OBJ_NUM 1000
+struct padata_work {
+   struct work_struct  pw_work;
+   struct list_headpw_list;  /* padata_free_works linkage */
+   void*pw_data;
+};
+
+static DEFINE_SPINLOCK(padata_works_lock);
+static struct padata_work *padata_works;
+static LIST_HEAD(padata_free_works);
 
 static void padata_free_pd(struct parallel_data *pd);
 
@@ -58,30 +66,44 @@ static int padata_cpu_hash(struct parallel_data *pd, 
unsigned int seq_nr)
return padata_index_to_cpu(pd, cpu_index);
 }
 
-static void padata_parallel_worker(struct work_struct *parallel_work)
+static struct padata_work *padata_work_alloc(void)
 {
-   struct padata_parallel_queue *pqueue;
-   LIST_HEAD(local_list);
+   struct padata_work *pw;
 
-   local_bh_disable();
-   pqueue = container_of(parallel_work,
- struct padata_parallel_queue, work);
+   lockdep_assert_held(&padata_works_lock);
 
-   spin_lock(&pqueue->parallel.lock);
-   list_replace_init(&pqueue->parallel.list, &local_list);
-   spin_unlock(&pqueue->parallel.lock);
+   if (list_empty(&padata_free_works))
+   return NULL;/* No more work items allowed to be queued. */
 
-   while (!list_empty(&local_list)) {
-   struct padata_priv *padata;
+   pw = list_first_entry(&padata_free_works, struct padata_work, pw_list);
+   list_del(&pw->pw_list);
+   return pw;
+}
 
-   padata = list_entry(local_list.next,
-   struct padata_priv, list);

[PATCH v3 8/8] padata: document multithreaded jobs

2020-05-27 Thread Daniel Jordan
Add Documentation for multithreaded jobs.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 Documentation/core-api/padata.rst | 41 +++
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/Documentation/core-api/padata.rst 
b/Documentation/core-api/padata.rst
index 9a24c111781d9..0830e5b0e8211 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -4,23 +4,26 @@
 The padata parallel execution mechanism
 ===
 
-:Date: December 2019
+:Date: May 2020
 
 Padata is a mechanism by which the kernel can farm jobs out to be done in
-parallel on multiple CPUs while retaining their ordering.  It was developed for
-use with the IPsec code, which needs to be able to perform encryption and
-decryption on large numbers of packets without reordering those packets.  The
-crypto developers made a point of writing padata in a sufficiently general
-fashion that it could be put to other uses as well.
+parallel on multiple CPUs while optionally retaining their ordering.
 
-Usage
-=
+It was originally developed for IPsec, which needs to perform encryption and
+decryption on large numbers of packets without reordering those packets.  This
+is currently the sole consumer of padata's serialized job support.
+
+Padata also supports multithreaded jobs, splitting up the job evenly while load
+balancing and coordinating between threads.
+
+Running Serialized Jobs
+===
 
 Initializing
 
 
-The first step in using padata is to set up a padata_instance structure for
-overall control of how jobs are to be run::
+The first step in using padata to run serialized jobs is to set up a
+padata_instance structure for overall control of how jobs are to be run::
 
 #include 
 
@@ -162,6 +165,24 @@ functions that correspond to the allocation in reverse::
 It is the user's responsibility to ensure all outstanding jobs are complete
 before any of the above are called.
 
+Running Multithreaded Jobs
+==
+
+A multithreaded job has a main thread and zero or more helper threads, with the
+main thread participating in the job and then waiting until all helpers have
+finished.  padata splits the job into units called chunks, where a chunk is a
+piece of the job that one thread completes in one call to the thread function.
+
+A user has to do three things to run a multithreaded job.  First, describe the
+job by defining a padata_mt_job structure, which is explained in the Interface
+section.  This includes a pointer to the thread function, which padata will
+call each time it assigns a job chunk to a thread.  Then, define the thread
+function, which accepts three arguments, ``start``, ``end``, and ``arg``, where
+the first two delimit the range that the thread operates on and the last is a
+pointer to the job's shared state, if any.  Prepare the shared state, which is
+typically allocated on the main thread's stack.  Last, call
+padata_do_multithreaded(), which will return once the job is finished.
+
 Interface
 =
 
-- 
2.26.2



[PATCH v3 5/8] mm: don't track number of pages during deferred initialization

2020-05-27 Thread Daniel Jordan
Deferred page init used to report the number of pages initialized:

  node 0 initialised, 32439114 pages in 97ms

Tracking this makes the code more complicated when using multiple
threads.  Given that the statistic probably has limited value,
especially since a zone grows on demand so that the page count can vary,
just remove it.

The boot message now looks like

  node 0 deferred pages initialised in 97ms

Signed-off-by: Daniel Jordan 
Suggested-by: Alexander Duyck 
---
 mm/page_alloc.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0c0d9364aa6d..d64f3027fdfa6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1819,7 +1819,7 @@ static int __init deferred_init_memmap(void *data)
 {
pg_data_t *pgdat = data;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
-   unsigned long spfn = 0, epfn = 0, nr_pages = 0;
+   unsigned long spfn = 0, epfn = 0;
unsigned long first_init_pfn, flags;
unsigned long start = jiffies;
struct zone *zone;
@@ -1868,15 +1868,15 @@ static int __init deferred_init_memmap(void *data)
 * allocator.
 */
while (spfn < epfn) {
-   nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+   deferred_init_maxorder(&i, zone, &spfn, &epfn);
cond_resched();
}
 zone_empty:
/* Sanity check that the next zone really is unpopulated */
WARN_ON(++zid < MAX_NR_ZONES && populated_zone(++zone));
 
-   pr_info("node %d initialised, %lu pages in %ums\n",
-   pgdat->node_id, nr_pages, jiffies_to_msecs(jiffies - start));
+   pr_info("node %d deferred pages initialised in %ums\n",
+   pgdat->node_id, jiffies_to_msecs(jiffies - start));
 
pgdat_init_report_one_done();
return 0;
-- 
2.26.2



[PATCH v3 2/8] padata: initialize earlier

2020-05-27 Thread Daniel Jordan
padata will soon initialize the system's struct pages in parallel, so it
needs to be ready by page_alloc_init_late().

The error return from padata_driver_init() triggers an initcall warning,
so add a warning to padata_init() to avoid silent failure.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 include/linux/padata.h |  6 ++
 init/main.c|  2 ++
 kernel/padata.c| 17 -
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index a0d8b41850b25..476ecfa41f363 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -164,6 +164,12 @@ struct padata_instance {
 #definePADATA_INVALID  4
 };
 
+#ifdef CONFIG_PADATA
+extern void __init padata_init(void);
+#else
+static inline void __init padata_init(void) {}
+#endif
+
 extern struct padata_instance *padata_alloc_possible(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
diff --git a/init/main.c b/init/main.c
index 03371976d3872..df32f67214d23 100644
--- a/init/main.c
+++ b/init/main.c
@@ -63,6 +63,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -1482,6 +1483,7 @@ static noinline void __init kernel_init_freeable(void)
smp_init();
sched_init_smp();
 
+   padata_init();
page_alloc_init_late();
/* Initialize page ext after all struct pages are initialized. */
page_ext_init();
diff --git a/kernel/padata.c b/kernel/padata.c
index 835919c745266..6f709bc0fc413 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define MAX_OBJ_NUM 1000
 
@@ -1050,26 +1049,26 @@ void padata_free_shell(struct padata_shell *ps)
 }
 EXPORT_SYMBOL(padata_free_shell);
 
-#ifdef CONFIG_HOTPLUG_CPU
-
-static __init int padata_driver_init(void)
+void __init padata_init(void)
 {
+#ifdef CONFIG_HOTPLUG_CPU
int ret;
 
ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "padata:online",
  padata_cpu_online, NULL);
if (ret < 0)
-   return ret;
+   goto err;
hp_online = ret;
 
ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead",
  NULL, padata_cpu_dead);
if (ret < 0) {
cpuhp_remove_multi_state(hp_online);
-   return ret;
+   goto err;
}
-   return 0;
-}
-module_init(padata_driver_init);
 
+   return;
+err:
+   pr_warn("padata: initialization failed\n");
 #endif
+}
-- 
2.26.2



[PATCH v3 6/8] mm: parallelize deferred_init_memmap()

2020-05-27 Thread Daniel Jordan
 29.2)  79.8% 48.7 (  7.4)
 100% ( 16)  21.0%813.7 ( 21.0)  80.5% 47.0 (  5.2)

Server-oriented distros that enable deferred page init sometimes run in
small VMs, and they still benefit even though the fraction of boot time
saved is smaller:

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
  1 node * 2 cores * 2 threads = 4 CPUs
  16G/node = 16G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --716.0 ( 14.0) -- 49.7 (  0.6)
  25% (  1)   1.8%703.0 (  5.3)  -4.0% 51.7 (  0.6)
  50% (  2)   1.6%704.7 (  1.2)  43.0% 28.3 (  0.6)
  75% (  3)   2.7%696.7 ( 13.1)  49.7% 25.0 (  0.0)
 100% (  4)   4.1%687.0 ( 10.4)  55.7% 22.0 (  0.0)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
  1 node * 2 cores * 2 threads = 4 CPUs
  14G/node = 14G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --787.7 (  6.4) --122.3 (  0.6)
  25% (  1)   0.2%786.3 ( 10.8)  -2.5%125.3 (  2.1)
  50% (  2)   5.9%741.0 ( 13.9)  37.6% 76.3 ( 19.7)
  75% (  3)   8.3%722.0 ( 19.0)  49.9% 61.3 (  3.2)
 100% (  4)   9.3%714.7 (  9.5)  56.4% 53.3 (  1.5)

On Josh's 96-CPU and 192G memory system:

Without this patch series:
[0.487132] node 0 initialised, 23398907 pages in 292ms
[0.499132] node 1 initialised, 24189223 pages in 304ms
...
[0.629376] Run /sbin/init as init process

With this patch series:
[0.231435] node 1 initialised, 24189223 pages in 32ms
[0.236718] node 0 initialised, 23398907 pages in 36ms

[1] 
https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 mm/Kconfig  |  6 +++---
 mm/page_alloc.c | 46 --
 2 files changed, 43 insertions(+), 9 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c358..04c1da3f9f44c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
+   select PADATA
help
  Ordinarily all struct pages are initialised during early boot in a
  single thread. On very large machines this can take a considerable
  amount of time. If this option is set, large machines will bring up
- a subset of memmap at boot and then initialise the rest in parallel
- by starting one-off "pgdatinitX" kernel thread for each node X. This
- has a potential performance impact on processes running early in the
+ a subset of memmap at boot and then initialise the rest in parallel.
+ This has a potential performance impact on tasks running early in the
  lifetime of the system until these kthreads finish the
  initialisation.
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d64f3027fdfa6..1d47016849531 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1814,6 +1815,26 @@ deferred_init_maxorder(u64 *i, struct zone *zone, 
unsigned long *start_pfn,
return nr_pages;
 }
 
+static void __init
+deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
+  void *arg)
+{
+   unsigned long spfn, epfn;
+   struct zone *zone = arg;
+   u64 i;
+
+   deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
+
+   /*
+* Initialize and free pages in MAX_ORDER sized increments so that we
+* can avoid introducing any issues with the buddy allocator.
+*/
+   while (spfn < end_pfn) {
+   deferred_init_maxorder(&i, zone, &spfn, &epfn);
+   cond_resched();
+   }
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
@@ -1823,7 +1844,7 @@ static int __init deferred_init_memmap(void *data)
unsigned long first_init_pfn, flags;
unsigned long start = jiffies;
struct zone *zone;
-   int zid;
+   int zid, max_threads;
u64 i;
 
/* Bind memory initialisation thread to a local node if possible */
@@ -1863,13 +1884,26 @@ static int __init deferred_init_memmap(void *data)
goto zone_empty;
 
/*
-* Initialize and free pages in MAX_ORDER s

[PATCH v3 4/8] padata: add basic support for multithreaded jobs

2020-05-27 Thread Daniel Jordan
Sometimes the kernel doesn't take full advantage of system memory
bandwidth, leading to a single CPU spending excessive time in
initialization paths where the data scales with memory size.

Multithreading naturally addresses this problem.

Extend padata, a framework that handles many parallel yet singlethreaded
jobs, to also handle multithreaded jobs by adding support for splitting
up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.

This is inspired by work from Pavel Tatashin and Steve Sistare.

Signed-off-by: Daniel Jordan 
Tested-by: Josh Triplett 
---
 include/linux/padata.h |  29 
 kernel/padata.c| 152 -
 2 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 3bfa503503ac5..b0affa466a841 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -4,6 +4,9 @@
  *
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert 
+ *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan 
  */
 
 #ifndef PADATA_H
@@ -130,6 +133,31 @@ struct padata_shell {
struct list_headlist;
 };
 
+/**
+ * struct padata_mt_job - represents one multithreaded job
+ *
+ * @thread_fn: Called for each chunk of work that a padata thread does.
+ * @fn_arg: The thread function argument.
+ * @start: The start of the job (units are job-specific).
+ * @size: size of this node's work (units are job-specific).
+ * @align: Ranges passed to the thread function fall on this boundary, with the
+ * possible exceptions of the beginning and end of the job.
+ * @min_chunk: The minimum chunk size in job-specific units.  This allows
+ * the client to communicate the minimum amount of work that's
+ * appropriate for one worker thread to do at once.
+ * @max_threads: Max threads to use for the job, actual number may be less
+ *   depending on task size and minimum chunk size.
+ */
+struct padata_mt_job {
+   void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
+   void*fn_arg;
+   unsigned long   start;
+   unsigned long   size;
+   unsigned long   align;
+   unsigned long   min_chunk;
+   int max_threads;
+};
+
 /**
  * struct padata_instance - The overall control structure.
  *
@@ -171,6 +199,7 @@ extern void padata_free_shell(struct padata_shell *ps);
 extern int padata_do_parallel(struct padata_shell *ps,
  struct padata_priv *padata, int *cb_cpu);
 extern void padata_do_serial(struct padata_priv *padata);
+extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
  cpumask_var_t cpumask);
 extern int padata_start(struct padata_instance *pinst);
diff --git a/kernel/padata.c b/kernel/padata.c
index 78ff9aa529204..e78f57d9aef90 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -7,6 +7,9 @@
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert 
  *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan 
+ *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
  * version 2, as published by the Free Software Foundation.
@@ -21,6 +24,7 @@
  * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +36,8 @@
 #include 
 #include 
 
+#definePADATA_WORK_ONSTACK 1   /* Work's memory is on stack */
+
 struct padata_work {
struct work_struct  pw_work;
struct list_headpw_list;  /* padata_free_works linkage */
@@ -42,7 +48,17 @@ static DEFINE_SPINLOCK(padata_works_lock);
 static struct padata_work *padata_works;
 static LIST_HEAD(padata_free_works);
 
+struct padata_mt_job_state {
+   spinlock_t  lock;
+   struct completion   completion;
+   struct padata_mt_job*job;
+   int nworks;
+   int nworks_fini;
+   unsigned long   chunk_size;
+};
+
 static void padata_free_pd(struct parallel_data *pd);
+static void __init padata_mt_helper(struct work_struct *work);
 
 static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
 {
@@ -81,18 +97,56 @@ static struct padata_work *padata_work_alloc(void)
 }
 
 static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
-void *data)
+void *data, int flags)
 {
-   INIT_WORK(&pw->pw_work, work_fn);
+  

Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

2020-05-21 Thread Daniel Jordan
On Thu, May 21, 2020 at 09:46:35AM -0700, Alexander Duyck wrote:
> It is more about not bothering with the extra tracking. We don't
> really need it and having it doesn't really add much in the way of
> value.

Yeah, it can probably go.

> > > > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void 
> > > > *data)
> > > > goto zone_empty;
> > > >
> > > > /*
> > > > -* Initialize and free pages in MAX_ORDER sized increments so
> > > > -* that we can avoid introducing any issues with the buddy
> > > > -* allocator.
> > > > +* More CPUs always led to greater speedups on tested systems, 
> > > > up to
> > > > +* all the nodes' CPUs.  Use all since the system is otherwise 
> > > > idle now.
> > > >  */
> > > > +   max_threads = max(cpumask_weight(cpumask), 1u);
> > > > +
> > > > while (spfn < epfn) {
> > > > +   epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > > > +
> > > > +   if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > > > +   epfn_align - spfn >= PAGES_PER_SECTION) {
> > > > +   struct definit_args arg = { zone, 
> > > > ATOMIC_LONG_INIT(0) };
> > > > +   struct padata_mt_job job = {
> > > > +   .thread_fn   = 
> > > > deferred_init_memmap_chunk,
> > > > +   .fn_arg  = &arg,
> > > > +   .start   = spfn,
> > > > +   .size= epfn_align - spfn,
> > > > +   .align   = PAGES_PER_SECTION,
> > > > +   .min_chunk   = PAGES_PER_SECTION,
> > > > +   .max_threads = max_threads,
> > > > +   };
> > > > +
> > > > +   padata_do_multithreaded(&job);
> > > > +   nr_pages += atomic_long_read(&arg.nr_pages);
> > > > +   spfn = epfn_align;
> > > > +   }
> > > > +
> > > > nr_pages += deferred_init_maxorder(&i, zone, &spfn, 
> > > > &epfn);
> > > > cond_resched();
> > > > }
> > >
> > > This doesn't look right. You are basically adding threads in addition
> > > to calls to deferred_init_maxorder.
> >
> > The deferred_init_maxorder call is there to do the remaining, non-section
> > aligned part of a range.  It doesn't have to be done this way.
> 
> It is also doing the advancing though isn't it?

Yes.  Not sure what you're getting at.  There's the 'spfn = epfn_align' before
so nothing is skipped.  It's true that the nonaligned part is done outside of
padata when it could be done by a thread that'd otherwise be waiting or idle,
which should be addressed in the next version.

> I think I resolved this with the fix for it I described in the other
> email. We just need to swap out spfn for epfn and make sure we align
> spfn with epfn_align. Then I think that takes care of possible skips.

Right, though your fix looks a lot like deferred_init_mem_pfn_range_in_zone().
Seems better to just use that and not repeat ourselves.  Lame that it's
starting at the beginning of the ranges every time, maybe it could be
generalized somehow, but I think it should be fast enough.

> > We could use deferred_init_mem_pfn_range_in_zone() instead of the for_each
> > loop.
> >
> > What I was trying to avoid by aligning down is creating a discontiguous pfn
> > range that get passed to padata.  We already discussed how those are handled
> > by the zone iterator in the thread function, but job->size can be 
> > exaggerated
> > to include parts of the range that are never touched.  Thinking more about 
> > it
> > though, it's a small fraction of the total work and shouldn't matter.
> 
> So the problem with aligning down is that you are going to be slowed
> up as you have to go single threaded to initialize whatever remains.
> So worst case scenario is that you have a section aligned block and
> you will process all but 1 section in parallel, and then have to
> process the remaining section one max order block at a time.

Yes, aligning up is better.

> > > This should accomplish the same thing, but much more efficiently.
> >
> > Well, more cleanly.  I'll give it a try.
> 
> I agree I am not sure if it will make a big difference on x86, however
> the more ranges you have to process the faster this approach should be
> as it stays parallel the entire time rather than having to drop out
> and process the last section one max order block at a time.

Right.


[stable-4.4 5/5] padata: purge get_cpu and reorder_via_wq from padata_do_serial

2020-05-21 Thread Daniel Jordan
[ Upstream commit 065cf577135a4977931c7a1e1edf442bfd9773dd]

With the removal of the padata timer, padata_do_serial no longer
needs special CPU handling, so remove it.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 23 +++
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 43b72f5dfe07..c50975f43b34 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -322,24 +322,9 @@ static void padata_serial_worker(struct work_struct 
*serial_work)
  */
 void padata_do_serial(struct padata_priv *padata)
 {
-   int cpu;
-   struct padata_parallel_queue *pqueue;
-   struct parallel_data *pd;
-   int reorder_via_wq = 0;
-
-   pd = padata->pd;
-
-   cpu = get_cpu();
-
-   /* We need to enqueue the padata object into the correct
-* per-cpu queue.
-*/
-   if (cpu != padata->cpu) {
-   reorder_via_wq = 1;
-   cpu = padata->cpu;
-   }
-
-   pqueue = per_cpu_ptr(pd->pqueue, cpu);
+   struct parallel_data *pd = padata->pd;
+   struct padata_parallel_queue *pqueue = per_cpu_ptr(pd->pqueue,
+  padata->cpu);
 
spin_lock(&pqueue->reorder.lock);
list_add_tail(&padata->list, &pqueue->reorder.list);
@@ -353,8 +338,6 @@ void padata_do_serial(struct padata_priv *padata)
 */
smp_mb__after_atomic();
 
-   put_cpu();
-
padata_reorder(pd);
 }
 EXPORT_SYMBOL(padata_do_serial);
-- 
2.26.2



[stable-4.4 4/5] padata: initialize pd->cpu with effective cpumask

2020-05-21 Thread Daniel Jordan
[ Upstream commit ec9c7d19336ee98ecba8de80128aa405c45feebb ]

Exercising CPU hotplug on a 5.2 kernel with recent padata fixes from
cryptodev-2.6.git in an 8-CPU kvm guest...

# modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo c > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
# modprobe tcrypt mode=215

...caused the following crash:

BUG: kernel NULL pointer dereference, address: 
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 0 P4D 0
Oops:  [#1] SMP PTI
CPU: 2 PID: 134 Comm: kworker/2:2 Not tainted 5.2.0-padata-base+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-
Workqueue: pencrypt padata_parallel_worker
RIP: 0010:padata_reorder+0xcb/0x180
...
Call Trace:
 padata_do_serial+0x57/0x60
 pcrypt_aead_enc+0x3a/0x50 [pcrypt]
 padata_parallel_worker+0x9b/0xe0
 process_one_work+0x1b5/0x3f0
 worker_thread+0x4a/0x3c0
 ...

In padata_alloc_pd, pd->cpu is set using the user-supplied cpumask
instead of the effective cpumask, and in this case cpumask_first picked
an offline CPU.

The offline CPU's reorder->list.next is NULL in padata_reorder because
the list wasn't initialized in padata_init_pqueues, which only operates
on CPUs in the effective mask.

Fix by using the effective mask in padata_alloc_pd.

Fixes: 6fc4dbcf0276 ("padata: Replace delayed timer with immediate workqueue in 
padata_reorder")
Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index e5966eedfa36..43b72f5dfe07 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -449,7 +449,7 @@ static struct parallel_data *padata_alloc_pd(struct 
padata_instance *pinst,
atomic_set(&pd->refcnt, 1);
pd->pinst = pinst;
spin_lock_init(&pd->lock);
-   pd->cpu = cpumask_first(pcpumask);
+   pd->cpu = cpumask_first(pd->cpumask.pcpu);
INIT_WORK(&pd->reorder_work, invoke_padata_reorder);
 
return pd;
-- 
2.26.2



[stable-4.4 3/5] padata: Replace delayed timer with immediate workqueue in padata_reorder

2020-05-21 Thread Daniel Jordan
From: Herbert Xu 

[ Upstream commit 6fc4dbcf0276279d488c5fbbfabe94734134f4fa ]

The function padata_reorder will use a timer when it cannot progress
while completed jobs are outstanding (pd->reorder_objects > 0).  This
is suboptimal as if we do end up using the timer then it would have
introduced a gratuitous delay of one second.

In fact we can easily distinguish between whether completed jobs
are outstanding and whether we can make progress.  All we have to
do is look at the next pqueue list.

This patch does that by replacing pd->processed with pd->cpu so
that the next pqueue is more accessible.

A work queue is used instead of the original try_again to avoid
hogging the CPU.

Note that we don't bother removing the work queue in
padata_flush_queues because the whole premise is broken.  You
cannot flush async crypto requests so it makes no sense to even
try.  A subsequent patch will fix it by replacing it with a ref
counting scheme.

Signed-off-by: Herbert Xu 
[dj: - adjust context
 - corrected setup_timer -> timer_setup to delete hunk
 - skip padata_flush_queues() hunk, function already removed
   in 4.4]
Signed-off-by: Daniel Jordan 
---
 include/linux/padata.h | 13 ++
 kernel/padata.c| 95 --
 2 files changed, 22 insertions(+), 86 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index e74d61fa50fe..547a8d1e4a3b 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -85,18 +84,14 @@ struct padata_serial_queue {
  * @serial: List to wait for serialization after reordering.
  * @pwork: work struct for parallelization.
  * @swork: work struct for serialization.
- * @pd: Backpointer to the internal control structure.
  * @work: work struct for parallelization.
- * @reorder_work: work struct for reordering.
  * @num_obj: Number of objects that are processed by this cpu.
  * @cpu_index: Index of the cpu.
  */
 struct padata_parallel_queue {
struct padata_listparallel;
struct padata_listreorder;
-   struct parallel_data *pd;
struct work_structwork;
-   struct work_structreorder_work;
atomic_t  num_obj;
int   cpu_index;
 };
@@ -122,10 +117,10 @@ struct padata_cpumask {
  * @reorder_objects: Number of objects waiting in the reorder queues.
  * @refcnt: Number of objects holding a reference on this parallel_data.
  * @max_seq_nr:  Maximal used sequence number.
+ * @cpu: Next CPU to be processed.
  * @cpumask: The cpumasks in use for parallel and serial workers.
+ * @reorder_work: work struct for reordering.
  * @lock: Reorder lock.
- * @processed: Number of already processed objects.
- * @timer: Reorder timer.
  */
 struct parallel_data {
struct padata_instance  *pinst;
@@ -134,10 +129,10 @@ struct parallel_data {
atomic_treorder_objects;
atomic_trefcnt;
atomic_tseq_nr;
+   int cpu;
struct padata_cpumask   cpumask;
+   struct work_struct  reorder_work;
spinlock_t  lock cacheline_aligned;
-   unsigned intprocessed;
-   struct timer_list   timer;
 };
 
 /**
diff --git a/kernel/padata.c b/kernel/padata.c
index 4f860043a8e5..e5966eedfa36 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -165,23 +165,12 @@ EXPORT_SYMBOL(padata_do_parallel);
  */
 static struct padata_priv *padata_get_next(struct parallel_data *pd)
 {
-   int cpu, num_cpus;
-   unsigned int next_nr, next_index;
struct padata_parallel_queue *next_queue;
struct padata_priv *padata;
struct padata_list *reorder;
+   int cpu = pd->cpu;
 
-   num_cpus = cpumask_weight(pd->cpumask.pcpu);
-
-   /*
-* Calculate the percpu reorder queue and the sequence
-* number of the next object.
-*/
-   next_nr = pd->processed;
-   next_index = next_nr % num_cpus;
-   cpu = padata_index_to_cpu(pd, next_index);
next_queue = per_cpu_ptr(pd->pqueue, cpu);
-
reorder = &next_queue->reorder;
 
spin_lock(&reorder->lock);
@@ -192,7 +181,8 @@ static struct padata_priv *padata_get_next(struct 
parallel_data *pd)
list_del_init(&padata->list);
atomic_dec(&pd->reorder_objects);
 
-   pd->processed++;
+   pd->cpu = cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1,
+   false);
 
spin_unlock(&reorder->lock);
goto out;
@@ -215,6 +205,7 @@ static void padata_reorder(struct parallel_data *pd)
struct padata_priv *padata;
struct padata_serial_queu

[stable-4.4 2/5] sched/fair, cpumask: Export for_each_cpu_wrap()

2020-05-21 Thread Daniel Jordan
From: Peter Zijlstra 

[ Upstream commit c743f0a5c50f2fcbc628526279cfa24f3dabe182 ]

More users for for_each_cpu_wrap() have appeared. Promote the construct
to generic cpumask interface.

The implementation is slightly modified to reduce arguments.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Lauro Ramos Venancio 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: lw...@redhat.com
Link: 
http://lkml.kernel.org/r/20170414122005.o35me2h5nowqk...@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
[dj: include only what's added to the cpumask interface, 4.4 doesn't
 have them in the scheduler]
Signed-off-by: Daniel Jordan 
---
 include/linux/cpumask.h | 17 +
 lib/cpumask.c   | 32 
 2 files changed, 49 insertions(+)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index bb3a4bb35183..1322883e7b46 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -232,6 +232,23 @@ unsigned int cpumask_local_spread(unsigned int i, int 
node);
(cpu) = cpumask_next_zero((cpu), (mask)),   \
(cpu) < nr_cpu_ids;)
 
+extern int cpumask_next_wrap(int n, const struct cpumask *mask, int start, 
bool wrap);
+
+/**
+ * for_each_cpu_wrap - iterate over every cpu in a mask, starting at a 
specified location
+ * @cpu: the (optionally unsigned) integer iterator
+ * @mask: the cpumask poiter
+ * @start: the start location
+ *
+ * The implementation does not assume any bit in @mask is set (including 
@start).
+ *
+ * After the loop, cpu is >= nr_cpu_ids.
+ */
+#define for_each_cpu_wrap(cpu, mask, start)
\
+   for ((cpu) = cpumask_next_wrap((start)-1, (mask), (start), false);  
\
+(cpu) < nr_cpumask_bits;   
\
+(cpu) = cpumask_next_wrap((cpu), (mask), (start), true))
+
 /**
  * for_each_cpu_and - iterate over every cpu in both masks
  * @cpu: the (optionally unsigned) integer iterator
diff --git a/lib/cpumask.c b/lib/cpumask.c
index 5a70f6196f57..24f06e7abf92 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -42,6 +42,38 @@ int cpumask_any_but(const struct cpumask *mask, unsigned int 
cpu)
return i;
 }
 
+/**
+ * cpumask_next_wrap - helper to implement for_each_cpu_wrap
+ * @n: the cpu prior to the place to search
+ * @mask: the cpumask pointer
+ * @start: the start point of the iteration
+ * @wrap: assume @n crossing @start terminates the iteration
+ *
+ * Returns >= nr_cpu_ids on completion
+ *
+ * Note: the @wrap argument is required for the start condition when
+ * we cannot assume @start is set in @mask.
+ */
+int cpumask_next_wrap(int n, const struct cpumask *mask, int start, bool wrap)
+{
+   int next;
+
+again:
+   next = cpumask_next(n, mask);
+
+   if (wrap && n < start && next >= start) {
+   return nr_cpumask_bits;
+
+   } else if (next >= nr_cpumask_bits) {
+   wrap = true;
+   n = -1;
+   goto again;
+   }
+
+   return next;
+}
+EXPORT_SYMBOL(cpumask_next_wrap);
+
 /* These are not inline because of header tangles. */
 #ifdef CONFIG_CPUMASK_OFFSTACK
 /**
-- 
2.26.2



[stable-4.4 1/5] padata: set cpu_index of unused CPUs to -1

2020-05-21 Thread Daniel Jordan
From: Mathias Krause 

[ Upstream commit 1bd845bcb41d5b7f83745e0cb99273eb376f2ec5 ]

The parallel queue per-cpu data structure gets initialized only for CPUs
in the 'pcpu' CPU mask set. This is not sufficient as the reorder timer
may run on a different CPU and might wrongly decide it's the target CPU
for the next reorder item as per-cpu memory gets memset(0) and we might
be waiting for the first CPU in cpumask.pcpu, i.e. cpu_index 0.

Make the '__this_cpu_read(pd->pqueue->cpu_index) == next_queue->cpu_index'
compare in padata_get_next() fail in this case by initializing the
cpu_index member of all per-cpu parallel queues. Use -1 for unused ones.

Signed-off-by: Mathias Krause 
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 8aef48c3267b..4f860043a8e5 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -461,8 +461,14 @@ static void padata_init_pqueues(struct parallel_data *pd)
struct padata_parallel_queue *pqueue;
 
cpu_index = 0;
-   for_each_cpu(cpu, pd->cpumask.pcpu) {
+   for_each_possible_cpu(cpu) {
pqueue = per_cpu_ptr(pd->pqueue, cpu);
+
+   if (!cpumask_test_cpu(cpu, pd->cpumask.pcpu)) {
+   pqueue->cpu_index = -1;
+   continue;
+   }
+
pqueue->pd = pd;
pqueue->cpu_index = cpu_index;
cpu_index++;
-- 
2.26.2



[stable-4.9 4/4] padata: purge get_cpu and reorder_via_wq from padata_do_serial

2020-05-21 Thread Daniel Jordan
[ Upstream commit 065cf577135a4977931c7a1e1edf442bfd9773dd ]

With the removal of the padata timer, padata_do_serial no longer
needs special CPU handling, so remove it.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 23 +++
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 1030e6cfc08c..e82f066d63ac 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -323,24 +323,9 @@ static void padata_serial_worker(struct work_struct 
*serial_work)
  */
 void padata_do_serial(struct padata_priv *padata)
 {
-   int cpu;
-   struct padata_parallel_queue *pqueue;
-   struct parallel_data *pd;
-   int reorder_via_wq = 0;
-
-   pd = padata->pd;
-
-   cpu = get_cpu();
-
-   /* We need to enqueue the padata object into the correct
-* per-cpu queue.
-*/
-   if (cpu != padata->cpu) {
-   reorder_via_wq = 1;
-   cpu = padata->cpu;
-   }
-
-   pqueue = per_cpu_ptr(pd->pqueue, cpu);
+   struct parallel_data *pd = padata->pd;
+   struct padata_parallel_queue *pqueue = per_cpu_ptr(pd->pqueue,
+  padata->cpu);
 
spin_lock(&pqueue->reorder.lock);
list_add_tail(&padata->list, &pqueue->reorder.list);
@@ -354,8 +339,6 @@ void padata_do_serial(struct padata_priv *padata)
 */
smp_mb__after_atomic();
 
-   put_cpu();
-
padata_reorder(pd);
 }
 EXPORT_SYMBOL(padata_do_serial);
-- 
2.26.2



[stable-4.9 1/4] padata: set cpu_index of unused CPUs to -1

2020-05-21 Thread Daniel Jordan
From: Mathias Krause 

[ Upstream commit 1bd845bcb41d5b7f83745e0cb99273eb376f2ec5 ]

The parallel queue per-cpu data structure gets initialized only for CPUs
in the 'pcpu' CPU mask set. This is not sufficient as the reorder timer
may run on a different CPU and might wrongly decide it's the target CPU
for the next reorder item as per-cpu memory gets memset(0) and we might
be waiting for the first CPU in cpumask.pcpu, i.e. cpu_index 0.

Make the '__this_cpu_read(pd->pqueue->cpu_index) == next_queue->cpu_index'
compare in padata_get_next() fail in this case by initializing the
cpu_index member of all per-cpu parallel queues. Use -1 for unused ones.

Signed-off-by: Mathias Krause 
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 693536efccf9..52a1d3fd13b5 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -462,8 +462,14 @@ static void padata_init_pqueues(struct parallel_data *pd)
struct padata_parallel_queue *pqueue;
 
cpu_index = 0;
-   for_each_cpu(cpu, pd->cpumask.pcpu) {
+   for_each_possible_cpu(cpu) {
pqueue = per_cpu_ptr(pd->pqueue, cpu);
+
+   if (!cpumask_test_cpu(cpu, pd->cpumask.pcpu)) {
+   pqueue->cpu_index = -1;
+   continue;
+   }
+
pqueue->pd = pd;
pqueue->cpu_index = cpu_index;
cpu_index++;
-- 
2.26.2



[stable-4.9 3/4] padata: initialize pd->cpu with effective cpumask

2020-05-21 Thread Daniel Jordan
[ Upstream commit ec9c7d19336ee98ecba8de80128aa405c45feebb ]

Exercising CPU hotplug on a 5.2 kernel with recent padata fixes from
cryptodev-2.6.git in an 8-CPU kvm guest...

# modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo c > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
# modprobe tcrypt mode=215

...caused the following crash:

BUG: kernel NULL pointer dereference, address: 
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 0 P4D 0
Oops:  [#1] SMP PTI
CPU: 2 PID: 134 Comm: kworker/2:2 Not tainted 5.2.0-padata-base+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-
Workqueue: pencrypt padata_parallel_worker
RIP: 0010:padata_reorder+0xcb/0x180
...
Call Trace:
 padata_do_serial+0x57/0x60
 pcrypt_aead_enc+0x3a/0x50 [pcrypt]
 padata_parallel_worker+0x9b/0xe0
 process_one_work+0x1b5/0x3f0
 worker_thread+0x4a/0x3c0
 ...

In padata_alloc_pd, pd->cpu is set using the user-supplied cpumask
instead of the effective cpumask, and in this case cpumask_first picked
an offline CPU.

The offline CPU's reorder->list.next is NULL in padata_reorder because
the list wasn't initialized in padata_init_pqueues, which only operates
on CPUs in the effective mask.

Fix by using the effective mask in padata_alloc_pd.

Fixes: 6fc4dbcf0276 ("padata: Replace delayed timer with immediate workqueue in 
padata_reorder")
Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 0b9c39730d6d..1030e6cfc08c 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -450,7 +450,7 @@ static struct parallel_data *padata_alloc_pd(struct 
padata_instance *pinst,
atomic_set(&pd->refcnt, 1);
pd->pinst = pinst;
spin_lock_init(&pd->lock);
-   pd->cpu = cpumask_first(pcpumask);
+   pd->cpu = cpumask_first(pd->cpumask.pcpu);
INIT_WORK(&pd->reorder_work, invoke_padata_reorder);
 
return pd;
-- 
2.26.2



[stable-4.14 2/4] padata: Replace delayed timer with immediate workqueue in padata_reorder

2020-05-21 Thread Daniel Jordan
From: Herbert Xu 

[ Upstream commit 6fc4dbcf0276279d488c5fbbfabe94734134f4fa ]

The function padata_reorder will use a timer when it cannot progress
while completed jobs are outstanding (pd->reorder_objects > 0).  This
is suboptimal as if we do end up using the timer then it would have
introduced a gratuitous delay of one second.

In fact we can easily distinguish between whether completed jobs
are outstanding and whether we can make progress.  All we have to
do is look at the next pqueue list.

This patch does that by replacing pd->processed with pd->cpu so
that the next pqueue is more accessible.

A work queue is used instead of the original try_again to avoid
hogging the CPU.

Note that we don't bother removing the work queue in
padata_flush_queues because the whole premise is broken.  You
cannot flush async crypto requests so it makes no sense to even
try.  A subsequent patch will fix it by replacing it with a ref
counting scheme.

Signed-off-by: Herbert Xu 
[dj: - adjust context
 - corrected setup_timer -> timer_setup to delete hunk
 - skip padata_flush_queues() hunk, function already removed
   in 4.14]
Signed-off-by: Daniel Jordan 
---
 include/linux/padata.h | 13 ++
 kernel/padata.c| 95 --
 2 files changed, 22 insertions(+), 86 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 5d13d25da2c8..d803397a28f7 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -85,18 +84,14 @@ struct padata_serial_queue {
  * @serial: List to wait for serialization after reordering.
  * @pwork: work struct for parallelization.
  * @swork: work struct for serialization.
- * @pd: Backpointer to the internal control structure.
  * @work: work struct for parallelization.
- * @reorder_work: work struct for reordering.
  * @num_obj: Number of objects that are processed by this cpu.
  * @cpu_index: Index of the cpu.
  */
 struct padata_parallel_queue {
struct padata_listparallel;
struct padata_listreorder;
-   struct parallel_data *pd;
struct work_structwork;
-   struct work_structreorder_work;
atomic_t  num_obj;
int   cpu_index;
 };
@@ -122,10 +117,10 @@ struct padata_cpumask {
  * @reorder_objects: Number of objects waiting in the reorder queues.
  * @refcnt: Number of objects holding a reference on this parallel_data.
  * @max_seq_nr:  Maximal used sequence number.
+ * @cpu: Next CPU to be processed.
  * @cpumask: The cpumasks in use for parallel and serial workers.
+ * @reorder_work: work struct for reordering.
  * @lock: Reorder lock.
- * @processed: Number of already processed objects.
- * @timer: Reorder timer.
  */
 struct parallel_data {
struct padata_instance  *pinst;
@@ -134,10 +129,10 @@ struct parallel_data {
atomic_treorder_objects;
atomic_trefcnt;
atomic_tseq_nr;
+   int cpu;
struct padata_cpumask   cpumask;
+   struct work_struct  reorder_work;
spinlock_t  lock cacheline_aligned;
-   unsigned intprocessed;
-   struct timer_list   timer;
 };
 
 /**
diff --git a/kernel/padata.c b/kernel/padata.c
index 858e82179744..66d96ed62286 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -166,23 +166,12 @@ EXPORT_SYMBOL(padata_do_parallel);
  */
 static struct padata_priv *padata_get_next(struct parallel_data *pd)
 {
-   int cpu, num_cpus;
-   unsigned int next_nr, next_index;
struct padata_parallel_queue *next_queue;
struct padata_priv *padata;
struct padata_list *reorder;
+   int cpu = pd->cpu;
 
-   num_cpus = cpumask_weight(pd->cpumask.pcpu);
-
-   /*
-* Calculate the percpu reorder queue and the sequence
-* number of the next object.
-*/
-   next_nr = pd->processed;
-   next_index = next_nr % num_cpus;
-   cpu = padata_index_to_cpu(pd, next_index);
next_queue = per_cpu_ptr(pd->pqueue, cpu);
-
reorder = &next_queue->reorder;
 
spin_lock(&reorder->lock);
@@ -193,7 +182,8 @@ static struct padata_priv *padata_get_next(struct 
parallel_data *pd)
list_del_init(&padata->list);
atomic_dec(&pd->reorder_objects);
 
-   pd->processed++;
+   pd->cpu = cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1,
+   false);
 
spin_unlock(&reorder->lock);
goto out;
@@ -216,6 +206,7 @@ static void padata_reorder(struct parallel_data *pd)
struct padata_priv *padata;
struct padata_serial_queu

[stable-4.9 2/4] padata: Replace delayed timer with immediate workqueue in padata_reorder

2020-05-21 Thread Daniel Jordan
From: Herbert Xu 

[ Upstream commit 6fc4dbcf0276279d488c5fbbfabe94734134f4fa ]

The function padata_reorder will use a timer when it cannot progress
while completed jobs are outstanding (pd->reorder_objects > 0).  This
is suboptimal as if we do end up using the timer then it would have
introduced a gratuitous delay of one second.

In fact we can easily distinguish between whether completed jobs
are outstanding and whether we can make progress.  All we have to
do is look at the next pqueue list.

This patch does that by replacing pd->processed with pd->cpu so
that the next pqueue is more accessible.

A work queue is used instead of the original try_again to avoid
hogging the CPU.

Note that we don't bother removing the work queue in
padata_flush_queues because the whole premise is broken.  You
cannot flush async crypto requests so it makes no sense to even
try.  A subsequent patch will fix it by replacing it with a ref
counting scheme.

Signed-off-by: Herbert Xu 
[dj: - adjust context
 - corrected setup_timer -> timer_setup to delete hunk
 - skip padata_flush_queues() hunk, function already removed
   in 4.9]
Signed-off-by: Daniel Jordan 
---
 include/linux/padata.h | 13 ++
 kernel/padata.c| 95 --
 2 files changed, 22 insertions(+), 86 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 86c885f90878..3afa17ed59da 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -85,18 +84,14 @@ struct padata_serial_queue {
  * @serial: List to wait for serialization after reordering.
  * @pwork: work struct for parallelization.
  * @swork: work struct for serialization.
- * @pd: Backpointer to the internal control structure.
  * @work: work struct for parallelization.
- * @reorder_work: work struct for reordering.
  * @num_obj: Number of objects that are processed by this cpu.
  * @cpu_index: Index of the cpu.
  */
 struct padata_parallel_queue {
struct padata_listparallel;
struct padata_listreorder;
-   struct parallel_data *pd;
struct work_structwork;
-   struct work_structreorder_work;
atomic_t  num_obj;
int   cpu_index;
 };
@@ -122,10 +117,10 @@ struct padata_cpumask {
  * @reorder_objects: Number of objects waiting in the reorder queues.
  * @refcnt: Number of objects holding a reference on this parallel_data.
  * @max_seq_nr:  Maximal used sequence number.
+ * @cpu: Next CPU to be processed.
  * @cpumask: The cpumasks in use for parallel and serial workers.
+ * @reorder_work: work struct for reordering.
  * @lock: Reorder lock.
- * @processed: Number of already processed objects.
- * @timer: Reorder timer.
  */
 struct parallel_data {
struct padata_instance  *pinst;
@@ -134,10 +129,10 @@ struct parallel_data {
atomic_treorder_objects;
atomic_trefcnt;
atomic_tseq_nr;
+   int cpu;
struct padata_cpumask   cpumask;
+   struct work_struct  reorder_work;
spinlock_t  lock cacheline_aligned;
-   unsigned intprocessed;
-   struct timer_list   timer;
 };
 
 /**
diff --git a/kernel/padata.c b/kernel/padata.c
index 52a1d3fd13b5..0b9c39730d6d 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -166,23 +166,12 @@ EXPORT_SYMBOL(padata_do_parallel);
  */
 static struct padata_priv *padata_get_next(struct parallel_data *pd)
 {
-   int cpu, num_cpus;
-   unsigned int next_nr, next_index;
struct padata_parallel_queue *next_queue;
struct padata_priv *padata;
struct padata_list *reorder;
+   int cpu = pd->cpu;
 
-   num_cpus = cpumask_weight(pd->cpumask.pcpu);
-
-   /*
-* Calculate the percpu reorder queue and the sequence
-* number of the next object.
-*/
-   next_nr = pd->processed;
-   next_index = next_nr % num_cpus;
-   cpu = padata_index_to_cpu(pd, next_index);
next_queue = per_cpu_ptr(pd->pqueue, cpu);
-
reorder = &next_queue->reorder;
 
spin_lock(&reorder->lock);
@@ -193,7 +182,8 @@ static struct padata_priv *padata_get_next(struct 
parallel_data *pd)
list_del_init(&padata->list);
atomic_dec(&pd->reorder_objects);
 
-   pd->processed++;
+   pd->cpu = cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1,
+   false);
 
spin_unlock(&reorder->lock);
goto out;
@@ -216,6 +206,7 @@ static void padata_reorder(struct parallel_data *pd)
struct padata_priv *padata;
struct padata_serial_queu

[stable-4.14 4/4] padata: purge get_cpu and reorder_via_wq from padata_do_serial

2020-05-21 Thread Daniel Jordan
[ Upstream commit 065cf577135a4977931c7a1e1edf442bfd9773dd ]

With the removal of the padata timer, padata_do_serial no longer
needs special CPU handling, so remove it.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 23 +++
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 6d0cdee9d321..f56ec63f60ba 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -323,24 +323,9 @@ static void padata_serial_worker(struct work_struct 
*serial_work)
  */
 void padata_do_serial(struct padata_priv *padata)
 {
-   int cpu;
-   struct padata_parallel_queue *pqueue;
-   struct parallel_data *pd;
-   int reorder_via_wq = 0;
-
-   pd = padata->pd;
-
-   cpu = get_cpu();
-
-   /* We need to enqueue the padata object into the correct
-* per-cpu queue.
-*/
-   if (cpu != padata->cpu) {
-   reorder_via_wq = 1;
-   cpu = padata->cpu;
-   }
-
-   pqueue = per_cpu_ptr(pd->pqueue, cpu);
+   struct parallel_data *pd = padata->pd;
+   struct padata_parallel_queue *pqueue = per_cpu_ptr(pd->pqueue,
+  padata->cpu);
 
spin_lock(&pqueue->reorder.lock);
list_add_tail(&padata->list, &pqueue->reorder.list);
@@ -354,8 +339,6 @@ void padata_do_serial(struct padata_priv *padata)
 */
smp_mb__after_atomic();
 
-   put_cpu();
-
padata_reorder(pd);
 }
 EXPORT_SYMBOL(padata_do_serial);
-- 
2.26.2



[stable-4.14 1/4] padata: set cpu_index of unused CPUs to -1

2020-05-21 Thread Daniel Jordan
From: Mathias Krause 

[ Upstream commit 1bd845bcb41d5b7f83745e0cb99273eb376f2ec5 ]

The parallel queue per-cpu data structure gets initialized only for CPUs
in the 'pcpu' CPU mask set. This is not sufficient as the reorder timer
may run on a different CPU and might wrongly decide it's the target CPU
for the next reorder item as per-cpu memory gets memset(0) and we might
be waiting for the first CPU in cpumask.pcpu, i.e. cpu_index 0.

Make the '__this_cpu_read(pd->pqueue->cpu_index) == next_queue->cpu_index'
compare in padata_get_next() fail in this case by initializing the
cpu_index member of all per-cpu parallel queues. Use -1 for unused ones.

Signed-off-by: Mathias Krause 
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 40a0ebb8ea51..858e82179744 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -462,8 +462,14 @@ static void padata_init_pqueues(struct parallel_data *pd)
struct padata_parallel_queue *pqueue;
 
cpu_index = 0;
-   for_each_cpu(cpu, pd->cpumask.pcpu) {
+   for_each_possible_cpu(cpu) {
pqueue = per_cpu_ptr(pd->pqueue, cpu);
+
+   if (!cpumask_test_cpu(cpu, pd->cpumask.pcpu)) {
+   pqueue->cpu_index = -1;
+   continue;
+   }
+
pqueue->pd = pd;
pqueue->cpu_index = cpu_index;
cpu_index++;
-- 
2.26.2



[stable-4.14 3/4] padata: initialize pd->cpu with effective cpumask

2020-05-21 Thread Daniel Jordan
[ Upstream commit ec9c7d19336ee98ecba8de80128aa405c45feebb ]

Exercising CPU hotplug on a 5.2 kernel with recent padata fixes from
cryptodev-2.6.git in an 8-CPU kvm guest...

# modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo c > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
# modprobe tcrypt mode=215

...caused the following crash:

BUG: kernel NULL pointer dereference, address: 
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 0 P4D 0
Oops:  [#1] SMP PTI
CPU: 2 PID: 134 Comm: kworker/2:2 Not tainted 5.2.0-padata-base+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-
Workqueue: pencrypt padata_parallel_worker
RIP: 0010:padata_reorder+0xcb/0x180
...
Call Trace:
 padata_do_serial+0x57/0x60
 pcrypt_aead_enc+0x3a/0x50 [pcrypt]
 padata_parallel_worker+0x9b/0xe0
 process_one_work+0x1b5/0x3f0
 worker_thread+0x4a/0x3c0
 ...

In padata_alloc_pd, pd->cpu is set using the user-supplied cpumask
instead of the effective cpumask, and in this case cpumask_first picked
an offline CPU.

The offline CPU's reorder->list.next is NULL in padata_reorder because
the list wasn't initialized in padata_init_pqueues, which only operates
on CPUs in the effective mask.

Fix by using the effective mask in padata_alloc_pd.

Fixes: 6fc4dbcf0276 ("padata: Replace delayed timer with immediate workqueue in 
padata_reorder")
Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 66d96ed62286..6d0cdee9d321 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -450,7 +450,7 @@ static struct parallel_data *padata_alloc_pd(struct 
padata_instance *pinst,
atomic_set(&pd->refcnt, 1);
pd->pinst = pinst;
spin_lock_init(&pd->lock);
-   pd->cpu = cpumask_first(pcpumask);
+   pd->cpu = cpumask_first(pd->cpumask.pcpu);
INIT_WORK(&pd->reorder_work, invoke_padata_reorder);
 
return pd;
-- 
2.26.2



[stable-4.19 2/3] padata: initialize pd->cpu with effective cpumask

2020-05-21 Thread Daniel Jordan
[ Upstream commit ec9c7d19336ee98ecba8de80128aa405c45feebb ]

Exercising CPU hotplug on a 5.2 kernel with recent padata fixes from
cryptodev-2.6.git in an 8-CPU kvm guest...

# modprobe tcrypt alg="pcrypt(rfc4106(gcm(aes)))" type=3
# echo 0 > /sys/devices/system/cpu/cpu1/online
# echo c > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
# modprobe tcrypt mode=215

...caused the following crash:

BUG: kernel NULL pointer dereference, address: 
#PF: supervisor read access in kernel mode
#PF: error_code(0x) - not-present page
PGD 0 P4D 0
Oops:  [#1] SMP PTI
CPU: 2 PID: 134 Comm: kworker/2:2 Not tainted 5.2.0-padata-base+ #7
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-
Workqueue: pencrypt padata_parallel_worker
RIP: 0010:padata_reorder+0xcb/0x180
...
Call Trace:
 padata_do_serial+0x57/0x60
 pcrypt_aead_enc+0x3a/0x50 [pcrypt]
 padata_parallel_worker+0x9b/0xe0
 process_one_work+0x1b5/0x3f0
 worker_thread+0x4a/0x3c0
 ...

In padata_alloc_pd, pd->cpu is set using the user-supplied cpumask
instead of the effective cpumask, and in this case cpumask_first picked
an offline CPU.

The offline CPU's reorder->list.next is NULL in padata_reorder because
the list wasn't initialized in padata_init_pqueues, which only operates
on CPUs in the effective mask.

Fix by using the effective mask in padata_alloc_pd.

Fixes: 6fc4dbcf0276 ("padata: Replace delayed timer with immediate workqueue in 
padata_reorder")
Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 47dc31ce15ac..e9b8d517fd4b 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -451,7 +451,7 @@ static struct parallel_data *padata_alloc_pd(struct 
padata_instance *pinst,
atomic_set(&pd->refcnt, 1);
pd->pinst = pinst;
spin_lock_init(&pd->lock);
-   pd->cpu = cpumask_first(pcpumask);
+   pd->cpu = cpumask_first(pd->cpumask.pcpu);
INIT_WORK(&pd->reorder_work, invoke_padata_reorder);
 
return pd;
-- 
2.26.2



[stable-4.19 1/3] padata: Replace delayed timer with immediate workqueue in padata_reorder

2020-05-21 Thread Daniel Jordan
From: Herbert Xu 

[ Upstream commit 6fc4dbcf0276279d488c5fbbfabe94734134f4fa ]

The function padata_reorder will use a timer when it cannot progress
while completed jobs are outstanding (pd->reorder_objects > 0).  This
is suboptimal as if we do end up using the timer then it would have
introduced a gratuitous delay of one second.

In fact we can easily distinguish between whether completed jobs
are outstanding and whether we can make progress.  All we have to
do is look at the next pqueue list.

This patch does that by replacing pd->processed with pd->cpu so
that the next pqueue is more accessible.

A work queue is used instead of the original try_again to avoid
hogging the CPU.

Note that we don't bother removing the work queue in
padata_flush_queues because the whole premise is broken.  You
cannot flush async crypto requests so it makes no sense to even
try.  A subsequent patch will fix it by replacing it with a ref
counting scheme.

Signed-off-by: Herbert Xu 
[dj: - adjust context
 - corrected setup_timer -> timer_setup to delete hunk
 - skip padata_flush_queues() hunk, function already removed
   in 4.19]
Signed-off-by: Daniel Jordan 
---
 include/linux/padata.h | 13 ++
 kernel/padata.c| 95 --
 2 files changed, 22 insertions(+), 86 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 5d13d25da2c8..d803397a28f7 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
 #include 
 #include 
 #include 
-#include 
 #include 
 #include 
 
@@ -85,18 +84,14 @@ struct padata_serial_queue {
  * @serial: List to wait for serialization after reordering.
  * @pwork: work struct for parallelization.
  * @swork: work struct for serialization.
- * @pd: Backpointer to the internal control structure.
  * @work: work struct for parallelization.
- * @reorder_work: work struct for reordering.
  * @num_obj: Number of objects that are processed by this cpu.
  * @cpu_index: Index of the cpu.
  */
 struct padata_parallel_queue {
struct padata_listparallel;
struct padata_listreorder;
-   struct parallel_data *pd;
struct work_structwork;
-   struct work_structreorder_work;
atomic_t  num_obj;
int   cpu_index;
 };
@@ -122,10 +117,10 @@ struct padata_cpumask {
  * @reorder_objects: Number of objects waiting in the reorder queues.
  * @refcnt: Number of objects holding a reference on this parallel_data.
  * @max_seq_nr:  Maximal used sequence number.
+ * @cpu: Next CPU to be processed.
  * @cpumask: The cpumasks in use for parallel and serial workers.
+ * @reorder_work: work struct for reordering.
  * @lock: Reorder lock.
- * @processed: Number of already processed objects.
- * @timer: Reorder timer.
  */
 struct parallel_data {
struct padata_instance  *pinst;
@@ -134,10 +129,10 @@ struct parallel_data {
atomic_treorder_objects;
atomic_trefcnt;
atomic_tseq_nr;
+   int cpu;
struct padata_cpumask   cpumask;
+   struct work_struct  reorder_work;
spinlock_t  lock cacheline_aligned;
-   unsigned intprocessed;
-   struct timer_list   timer;
 };
 
 /**
diff --git a/kernel/padata.c b/kernel/padata.c
index c280cb153915..47dc31ce15ac 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -167,23 +167,12 @@ EXPORT_SYMBOL(padata_do_parallel);
  */
 static struct padata_priv *padata_get_next(struct parallel_data *pd)
 {
-   int cpu, num_cpus;
-   unsigned int next_nr, next_index;
struct padata_parallel_queue *next_queue;
struct padata_priv *padata;
struct padata_list *reorder;
+   int cpu = pd->cpu;
 
-   num_cpus = cpumask_weight(pd->cpumask.pcpu);
-
-   /*
-* Calculate the percpu reorder queue and the sequence
-* number of the next object.
-*/
-   next_nr = pd->processed;
-   next_index = next_nr % num_cpus;
-   cpu = padata_index_to_cpu(pd, next_index);
next_queue = per_cpu_ptr(pd->pqueue, cpu);
-
reorder = &next_queue->reorder;
 
spin_lock(&reorder->lock);
@@ -194,7 +183,8 @@ static struct padata_priv *padata_get_next(struct 
parallel_data *pd)
list_del_init(&padata->list);
atomic_dec(&pd->reorder_objects);
 
-   pd->processed++;
+   pd->cpu = cpumask_next_wrap(cpu, pd->cpumask.pcpu, -1,
+   false);
 
spin_unlock(&reorder->lock);
goto out;
@@ -217,6 +207,7 @@ static void padata_reorder(struct parallel_data *pd)
struct padata_priv *padata;
struct padata_serial_queu

[stable-4.19 3/3] padata: purge get_cpu and reorder_via_wq from padata_do_serial

2020-05-21 Thread Daniel Jordan
[ Upstream commit 065cf577135a4977931c7a1e1edf442bfd9773dd ]

With the removal of the padata timer, padata_do_serial no longer
needs special CPU handling, so remove it.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Steffen Klassert 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Herbert Xu 
Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 23 +++
 1 file changed, 3 insertions(+), 20 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index e9b8d517fd4b..93e4fb2d9f2e 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -324,24 +324,9 @@ static void padata_serial_worker(struct work_struct 
*serial_work)
  */
 void padata_do_serial(struct padata_priv *padata)
 {
-   int cpu;
-   struct padata_parallel_queue *pqueue;
-   struct parallel_data *pd;
-   int reorder_via_wq = 0;
-
-   pd = padata->pd;
-
-   cpu = get_cpu();
-
-   /* We need to enqueue the padata object into the correct
-* per-cpu queue.
-*/
-   if (cpu != padata->cpu) {
-   reorder_via_wq = 1;
-   cpu = padata->cpu;
-   }
-
-   pqueue = per_cpu_ptr(pd->pqueue, cpu);
+   struct parallel_data *pd = padata->pd;
+   struct padata_parallel_queue *pqueue = per_cpu_ptr(pd->pqueue,
+  padata->cpu);
 
spin_lock(&pqueue->reorder.lock);
list_add_tail(&padata->list, &pqueue->reorder.list);
@@ -355,8 +340,6 @@ void padata_do_serial(struct padata_priv *padata)
 */
smp_mb__after_atomic();
 
-   put_cpu();
-
padata_reorder(pd);
 }
 EXPORT_SYMBOL(padata_do_serial);
-- 
2.26.2



Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

2020-05-21 Thread Daniel Jordan
On Thu, May 21, 2020 at 08:00:31AM -0700, Alexander Duyck wrote:
> So I was thinking about my suggestion further and the loop at the end
> isn't quite correct as I believe it could lead to gaps. The loop on
> the end should probably be:
> for_each_free_mem_pfn_range_in_zone_from(i, zone, spfn, epfn) 
> {
> if (epfn <= epfn_align)
> continue;
> if (spfn < epfn_align)
> spfn = epfn_align;
> break;
> }
> 
> That would generate a new range where epfn_align has actually ended
> and there is a range of new PFNs to process.

Whoops, my email crossed with yours.  Agreed, but see the other message.


Re: [PATCH v2 5/7] mm: parallelize deferred_init_memmap()

2020-05-21 Thread Daniel Jordan
On Wed, May 20, 2020 at 06:29:32PM -0700, Alexander Duyck wrote:
> On Wed, May 20, 2020 at 11:27 AM Daniel Jordan
> > @@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, 
> > unsigned long *start_pfn,
> > return nr_pages;
> >  }
> >
> > +struct definit_args {
> > +   struct zone *zone;
> > +   atomic_long_t nr_pages;
> > +};
> > +
> > +static void __init
> > +deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
> > +  void *arg)
> > +{
> > +   unsigned long spfn, epfn, nr_pages = 0;
> > +   struct definit_args *args = arg;
> > +   struct zone *zone = args->zone;
> > +   u64 i;
> > +
> > +   deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, 
> > start_pfn);
> > +
> > +   /*
> > +* Initialize and free pages in MAX_ORDER sized increments so that 
> > we
> > +* can avoid introducing any issues with the buddy allocator.
> > +*/
> > +   while (spfn < end_pfn) {
> > +   nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > +   cond_resched();
> > +   }
> > +
> > +   atomic_long_add(nr_pages, &args->nr_pages);
> > +}
> > +
> 
> Personally I would get rid of nr_pages entirely. It isn't worth the
> cache thrash to have this atomic variable bouncing around.

One of the things I tried to optimize was the managed_pages atomic adds in
__free_pages_core, but performance stayed the same on the biggest machine I
tested when it was done once at the end of page init instead of in every thread
for every pageblock.

I'm not sure this atomic would matter either, given it's less frequent.

> You could
> probably just have this function return void since all nr_pages is
> used for is a pr_info  statement at the end of the initialization
> which will be completely useless now anyway since we really have the
> threads running in parallel anyway.

The timestamp is still useful for observability, page init is a significant
part of kernel boot on big machines, over 10% sometimes with these patches.

It's mostly the time that matters though, I agree the number of pages is less
important and is probably worth removing just to simplify the code.  I'll do it
if no one sees a reason to keep it.

> We only really need the nr_pages logic in deferred_grow_zone in order
> to track if we have freed enough pages to allow us to go back to what
> we were doing.
>
> > @@ -1863,11 +1892,32 @@ static int __init deferred_init_memmap(void *data)
> > goto zone_empty;
> >
> > /*
> > -* Initialize and free pages in MAX_ORDER sized increments so
> > -* that we can avoid introducing any issues with the buddy
> > -* allocator.
> > +* More CPUs always led to greater speedups on tested systems, up to
> > +* all the nodes' CPUs.  Use all since the system is otherwise idle 
> > now.
> >  */
> > +   max_threads = max(cpumask_weight(cpumask), 1u);
> > +
> > while (spfn < epfn) {
> > +   epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
> > +
> > +   if (IS_ALIGNED(spfn, PAGES_PER_SECTION) &&
> > +   epfn_align - spfn >= PAGES_PER_SECTION) {
> > +   struct definit_args arg = { zone, 
> > ATOMIC_LONG_INIT(0) };
> > +   struct padata_mt_job job = {
> > +   .thread_fn   = deferred_init_memmap_chunk,
> > +   .fn_arg  = &arg,
> > +   .start   = spfn,
> > +   .size= epfn_align - spfn,
> > +   .align   = PAGES_PER_SECTION,
> > +   .min_chunk   = PAGES_PER_SECTION,
> > +   .max_threads = max_threads,
> > +   };
> > +
> > +   padata_do_multithreaded(&job);
> > +   nr_pages += atomic_long_read(&arg.nr_pages);
> > +   spfn = epfn_align;
> > +   }
> > +
> > nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
> > cond_resched();
> > }
> 
> This doesn't look right. You are basically adding threads in addition
> to calls to deferred_init_maxorder.

The deferred_init_maxorder call is the

Re: Backporting "padata: Remove broken queue flushing"

2020-05-21 Thread Daniel Jordan
On Thu, May 21, 2020 at 10:00:46AM +0200, Greg Kroah-Hartman wrote:
> but these:
> 
> > [3.16-4.19] 6fc4dbcf0276 padata: Replace delayed timer with immediate 
> > workqueue in padata_reorder
> > [3.16-4.19] ec9c7d19336e padata: initialize pd->cpu with effective cpumask
> > [3.16-4.19] 065cf577135a padata: purge get_cpu and reorder_via_wq from 
> > padata_do_serial
> 
> Need some non-trivial backporting.  Can you, or someone else do it so I
> can queue them up?  I don't have the free time at the moment, sorry.

Sure, I'll do these three.

Daniel


[PATCH v2 1/7] padata: remove exit routine

2020-05-20 Thread Daniel Jordan
padata_driver_exit() is unnecessary because padata isn't built as a
module and doesn't exit.

padata's init routine will soon allocate memory, so getting rid of the
exit function now avoids pointless code to free it.

Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index a6afa12fb75ee..835919c745266 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -1072,10 +1072,4 @@ static __init int padata_driver_init(void)
 }
 module_init(padata_driver_init);
 
-static __exit void padata_driver_exit(void)
-{
-   cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
-   cpuhp_remove_multi_state(hp_online);
-}
-module_exit(padata_driver_exit);
 #endif
-- 
2.26.2



[PATCH v2 0/7] padata: parallelize deferred page init

2020-05-20 Thread Daniel Jordan
Deferred struct page init is a bottleneck in kernel boot--the biggest
for us and probably others.  Optimizing it maximizes availability for
large-memory systems and allows spinning up short-lived VMs as needed
without having to leave them running.  It also benefits bare metal
machines hosting VMs that are sensitive to downtime.  In projects such
as VMM Fast Restart[1], where guest state is preserved across kexec
reboot, it helps prevent application and network timeouts in the guests.

So, multithread deferred init to take full advantage of system memory
bandwidth.

Extend padata, a framework that handles many parallel singlethreaded
jobs, to handle multithreaded jobs as well by adding support for
splitting up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.  More documentation in patches 4 and 7.

This series is the first step in a project to address other memory
proportional bottlenecks in the kernel such as pmem struct page init,
vfio page pinning, hugetlb fallocate, and munmap.  Deferred page init
doesn't require concurrency limits, resource control, or priority
adjustments like these other users will because it happens during boot
when the system is otherwise idle and waiting for page init to finish.

This has been run on a variety of x86 systems and speeds up kernel boot
by 3% to 49%, saving up to 1.6 out of 4 seconds.  Patch 5 has more
numbers.

Please review and test, and thanks to Alex, Andrew, Josh, and Pavel for
their feedback in the last version.

The powerpc and s390 lists are included in case they want to give this a
try, they had enabled this feature when it was configured per arch.

Series based on 5.7-rc6 plus these three from mmotm

  mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
  mm-initialize-deferred-pages-with-interrupts-enabled.patch
  mm-call-cond_resched-from-deferred_init_memmap.patch

and it's available here:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v2
  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v2

and the future users and related features are available as
work-in-progress:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.4
  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.4

v2:
 - Improve the problem statement (Andrew, Josh, Pavel)
 - Add T-b's to unchanged patches (Josh)
 - Fully initialize max-order blocks to avoid buddy issues (Alex)
 - Parallelize on section-aligned boundaries to avoid potential
   false sharing (Alex)
 - Return the maximum thread count from a function that architectures
   can override, with the generic version returning 1 (current
   behavior).  Override for x86 since that's the only arch this series
   has been tested on so far.  Other archs can test with more threads
   by dropping patch 6.
 - Rebase to v5.7-rc6, rerun tests

RFC v4 [2] -> v1:
 - merged with padata (Peter)
 - got rid of the 'task' nomenclature (Peter, Jon)

future work branch:
 - made lockdep-aware (Jason, Peter)
 - adjust workqueue worker priority with renice_or_cancel() (Tejun)
 - fixed undo problem in VFIO (Alex)

The remaining feedback, mainly resource control awareness (cgroup etc),
is TODO for later series.

[1] 
https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf
https://www.youtube.com/watch?v=pBsHnf93tcQ

https://lore.kernel.org/linux-mm/1588812129-8596-1-git-send-email-anthony.yzn...@oracle.com/

[2] 
https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jor...@oracle.com/

Daniel Jordan (7):
  padata: remove exit routine
  padata: initialize earlier
  padata: allocate work structures for parallel jobs from a pool
  padata: add basic support for multithreaded jobs
  mm: parallelize deferred_init_memmap()
  mm: make deferred init's max threads arch-specific
  padata: document multithreaded jobs

 Documentation/core-api/padata.rst |  41 +++--
 arch/x86/mm/init_64.c |  12 ++
 include/linux/memblock.h  |   3 +
 include/linux/padata.h|  43 -
 init/main.c   |   2 +
 kernel/padata.c   | 277 --
 mm/Kconfig|   6 +-
 mm/page_alloc.c   |  67 +++-
 8 files changed, 373 insertions(+), 78 deletions(-)


base-commit: b9bbe6ed63b2b9f2c9ee5cbd0f2c946a2723f4ce
prerequisite-patch-id: 4ad522141e1119a325a9799dad2bd982fbac8b7c
prerequisite-patch-id: 169273327e56f5461101a71dfbd6b4cfd4570cf0
prerequisite-patch-id: 0f34692c8a9673d4c4f6a3545cf8ec3a2abf8620
-- 
2.26.2



[PATCH v2 5/7] mm: parallelize deferred_init_memmap()

2020-05-20 Thread Daniel Jordan
  5.0)  71.8% 68.7 (  2.1)
 100% ( 16)  19.8%805.3 ( 10.8)  76.4% 57.3 ( 15.9)

Server-oriented distros that enable deferred page init sometimes run in
small VMs, and they still benefit even though the fraction of boot time
saved is smaller:

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
  1 node * 2 cores * 2 threads = 4 CPUs
  16G/node = 16G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --722.3 (  9.5) -- 50.7 (  0.6)
  25% (  1)  -3.3%746.3 (  4.7)  -2.0% 51.7 (  1.2)
  50% (  2)   0.2%721.0 ( 11.3)  29.6% 35.7 (  4.9)
  75% (  3)  -0.3%724.3 ( 11.2)  48.7% 26.0 (  0.0)
 100% (  4)   3.0%700.3 ( 13.6)  55.9% 22.3 (  0.6)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
  1 node * 2 cores * 2 threads = 4 CPUs
  14G/node = 14G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --673.0 (  6.9) -- 57.0 (  1.0)
  25% (  1)  -0.6%677.3 ( 19.8)   1.8% 56.0 (  1.0)
  50% (  2)   3.4%650.0 (  3.6)  36.8% 36.0 (  5.2)
  75% (  3)   4.2%644.7 (  7.6)  56.1% 25.0 (  1.0)
 100% (  4)   5.3%637.0 (  5.6)  63.2% 21.0 (  0.0)

On Josh's 96-CPU and 192G memory system:

Without this patch series:
[0.487132] node 0 initialised, 23398907 pages in 292ms
[0.499132] node 1 initialised, 24189223 pages in 304ms
...
[0.629376] Run /sbin/init as init process

With this patch series:
[0.227868] node 0 initialised, 23398907 pages in 28ms
[0.230019] node 1 initialised, 24189223 pages in 28ms
...
[0.361069] Run /sbin/init as init process

[1] 
https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Signed-off-by: Daniel Jordan 
---
 mm/Kconfig  |  6 ++---
 mm/page_alloc.c | 60 -
 2 files changed, 58 insertions(+), 8 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index c1acc34c1c358..04c1da3f9f44c 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -750,13 +750,13 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
+   select PADATA
help
  Ordinarily all struct pages are initialised during early boot in a
  single thread. On very large machines this can take a considerable
  amount of time. If this option is set, large machines will bring up
- a subset of memmap at boot and then initialise the rest in parallel
- by starting one-off "pgdatinitX" kernel thread for each node X. This
- has a potential performance impact on processes running early in the
+ a subset of memmap at boot and then initialise the rest in parallel.
+ This has a potential performance impact on tasks running early in the
  lifetime of the system until these kthreads finish the
  initialisation.
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0c0d9364aa6d..9cb780e8dec78 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -68,6 +68,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1814,16 +1815,44 @@ deferred_init_maxorder(u64 *i, struct zone *zone, 
unsigned long *start_pfn,
return nr_pages;
 }
 
+struct definit_args {
+   struct zone *zone;
+   atomic_long_t nr_pages;
+};
+
+static void __init
+deferred_init_memmap_chunk(unsigned long start_pfn, unsigned long end_pfn,
+  void *arg)
+{
+   unsigned long spfn, epfn, nr_pages = 0;
+   struct definit_args *args = arg;
+   struct zone *zone = args->zone;
+   u64 i;
+
+   deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn, start_pfn);
+
+   /*
+* Initialize and free pages in MAX_ORDER sized increments so that we
+* can avoid introducing any issues with the buddy allocator.
+*/
+   while (spfn < end_pfn) {
+   nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
+   cond_resched();
+   }
+
+   atomic_long_add(nr_pages, &args->nr_pages);
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
pg_data_t *pgdat = data;
const struct cpumask *cpumask = cpumask_of_node(pgdat->node_id);
unsigned long spfn = 0, epfn = 0, nr_pages = 0;
-   unsigned long first_init_pfn, flags;
+   unsigned long fi

[PATCH v2 6/7] mm: make deferred init's max threads arch-specific

2020-05-20 Thread Daniel Jordan
Using padata during deferred init has only been tested on x86, so for
now limit it to this architecture.

If another arch wants this, it can find the max thread limit that's best
for it and override deferred_page_init_max_threads().

Signed-off-by: Daniel Jordan 
---
 arch/x86/mm/init_64.c| 12 
 include/linux/memblock.h |  3 +++
 mm/page_alloc.c  | 13 -
 3 files changed, 23 insertions(+), 5 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 8b5f73f5e207c..2d749ec12ea8a 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1260,6 +1260,18 @@ void __init mem_init(void)
mem_init_print_info(NULL);
 }
 
+#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+   /*
+* More CPUs always led to greater speedups on tested systems, up to
+* all the nodes' CPUs.  Use all since the system is otherwise idle
+* now.
+*/
+   return max_t(int, cpumask_weight(node_cpumask), 1);
+}
+#endif
+
 int kernel_set_to_readonly;
 
 void mark_rodata_ro(void)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 6bc37a731d27b..2b289df44194f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -275,6 +275,9 @@ void __next_mem_pfn_range_in_zone(u64 *idx, struct zone 
*zone,
 #define for_each_free_mem_pfn_range_in_zone_from(i, zone, p_start, p_end) \
for (; i != U64_MAX;  \
 __next_mem_pfn_range_in_zone(&i, zone, p_start, p_end))
+
+int __init deferred_page_init_max_threads(const struct cpumask *node_cpumask);
+
 #endif /* CONFIG_DEFERRED_STRUCT_PAGE_INIT */
 
 /**
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9cb780e8dec78..0d7d805f98b2d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1843,6 +1843,13 @@ deferred_init_memmap_chunk(unsigned long start_pfn, 
unsigned long end_pfn,
atomic_long_add(nr_pages, &args->nr_pages);
 }
 
+/* An arch may override for more concurrency. */
+__weak int __init
+deferred_page_init_max_threads(const struct cpumask *node_cpumask)
+{
+   return 1;
+}
+
 /* Initialise remaining memory on a node */
 static int __init deferred_init_memmap(void *data)
 {
@@ -1891,11 +1898,7 @@ static int __init deferred_init_memmap(void *data)
 first_init_pfn))
goto zone_empty;
 
-   /*
-* More CPUs always led to greater speedups on tested systems, up to
-* all the nodes' CPUs.  Use all since the system is otherwise idle now.
-*/
-   max_threads = max(cpumask_weight(cpumask), 1u);
+   max_threads = deferred_page_init_max_threads(cpumask);
 
while (spfn < epfn) {
epfn_align = ALIGN_DOWN(epfn, PAGES_PER_SECTION);
-- 
2.26.2



[PATCH v2 2/7] padata: initialize earlier

2020-05-20 Thread Daniel Jordan
padata will soon initialize the system's struct pages in parallel, so it
needs to be ready by page_alloc_init_late().

The error return from padata_driver_init() triggers an initcall warning,
so add a warning to padata_init() to avoid silent failure.

Signed-off-by: Daniel Jordan 
---
 include/linux/padata.h |  6 ++
 init/main.c|  2 ++
 kernel/padata.c| 17 -
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index a0d8b41850b25..476ecfa41f363 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -164,6 +164,12 @@ struct padata_instance {
 #definePADATA_INVALID  4
 };
 
+#ifdef CONFIG_PADATA
+extern void __init padata_init(void);
+#else
+static inline void __init padata_init(void) {}
+#endif
+
 extern struct padata_instance *padata_alloc_possible(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
diff --git a/init/main.c b/init/main.c
index 03371976d3872..8ab521f7af5d2 100644
--- a/init/main.c
+++ b/init/main.c
@@ -94,6 +94,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1482,6 +1483,7 @@ static noinline void __init kernel_init_freeable(void)
smp_init();
sched_init_smp();
 
+   padata_init();
page_alloc_init_late();
/* Initialize page ext after all struct pages are initialized. */
page_ext_init();
diff --git a/kernel/padata.c b/kernel/padata.c
index 835919c745266..6f709bc0fc413 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define MAX_OBJ_NUM 1000
 
@@ -1050,26 +1049,26 @@ void padata_free_shell(struct padata_shell *ps)
 }
 EXPORT_SYMBOL(padata_free_shell);
 
-#ifdef CONFIG_HOTPLUG_CPU
-
-static __init int padata_driver_init(void)
+void __init padata_init(void)
 {
+#ifdef CONFIG_HOTPLUG_CPU
int ret;
 
ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "padata:online",
  padata_cpu_online, NULL);
if (ret < 0)
-   return ret;
+   goto err;
hp_online = ret;
 
ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead",
  NULL, padata_cpu_dead);
if (ret < 0) {
cpuhp_remove_multi_state(hp_online);
-   return ret;
+   goto err;
}
-   return 0;
-}
-module_init(padata_driver_init);
 
+   return;
+err:
+   pr_warn("padata: initialization failed\n");
 #endif
+}
-- 
2.26.2



[PATCH v2 7/7] padata: document multithreaded jobs

2020-05-20 Thread Daniel Jordan
Add Documentation for multithreaded jobs.

Signed-off-by: Daniel Jordan 
---
 Documentation/core-api/padata.rst | 41 +++
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/Documentation/core-api/padata.rst 
b/Documentation/core-api/padata.rst
index 9a24c111781d9..b7e047af993e8 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -4,23 +4,26 @@
 The padata parallel execution mechanism
 ===
 
-:Date: December 2019
+:Date: April 2020
 
 Padata is a mechanism by which the kernel can farm jobs out to be done in
-parallel on multiple CPUs while retaining their ordering.  It was developed for
-use with the IPsec code, which needs to be able to perform encryption and
-decryption on large numbers of packets without reordering those packets.  The
-crypto developers made a point of writing padata in a sufficiently general
-fashion that it could be put to other uses as well.
+parallel on multiple CPUs while optionally retaining their ordering.
 
-Usage
-=
+It was originally developed for IPsec, which needs to perform encryption and
+decryption on large numbers of packets without reordering those packets.  This
+is currently the sole consumer of padata's serialized job support.
+
+Padata also supports multithreaded jobs, splitting up the job evenly while load
+balancing and coordinating between threads.
+
+Running Serialized Jobs
+===
 
 Initializing
 
 
-The first step in using padata is to set up a padata_instance structure for
-overall control of how jobs are to be run::
+The first step in using padata to run parallel jobs is to set up a
+padata_instance structure for overall control of how jobs are to be run::
 
 #include 
 
@@ -162,6 +165,24 @@ functions that correspond to the allocation in reverse::
 It is the user's responsibility to ensure all outstanding jobs are complete
 before any of the above are called.
 
+Running Multithreaded Jobs
+==
+
+A multithreaded job has a main thread and zero or more helper threads, with the
+main thread participating in the job and then waiting until all helpers have
+finished.  padata splits the job into units called chunks, where a chunk is a
+piece of the job that one thread completes in one call to the thread function.
+
+A user has to do three things to run a multithreaded job.  First, describe the
+job by defining a padata_mt_job structure, which is explained in the Interface
+section.  This includes a pointer to the thread function, which padata will
+call each time it assigns a job chunk to a thread.  Then, define the thread
+function, which accepts three arguments, ``start``, ``end``, and ``arg``, where
+the first two delimit the range that the thread operates on and the last is a
+pointer to the job's shared state, if any.  Prepare the shared state, which is
+typically a stack-allocated structure that wraps the required data.  Last, call
+padata_do_multithreaded(), which will return once the job is finished.
+
 Interface
 =
 
-- 
2.26.2



[PATCH v2 4/7] padata: add basic support for multithreaded jobs

2020-05-20 Thread Daniel Jordan
Sometimes the kernel doesn't take full advantage of system memory
bandwidth, leading to a single CPU spending excessive time in
initialization paths where the data scales with memory size.

Multithreading naturally addresses this problem.

Extend padata, a framework that handles many parallel yet singlethreaded
jobs, to also handle multithreaded jobs by adding support for splitting
up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.

This is inspired by work from Pavel Tatashin and Steve Sistare.

Signed-off-by: Daniel Jordan 
---
 include/linux/padata.h |  29 
 kernel/padata.c| 152 -
 2 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 3bfa503503ac5..b0affa466a841 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -4,6 +4,9 @@
  *
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert 
+ *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan 
  */
 
 #ifndef PADATA_H
@@ -130,6 +133,31 @@ struct padata_shell {
struct list_headlist;
 };
 
+/**
+ * struct padata_mt_job - represents one multithreaded job
+ *
+ * @thread_fn: Called for each chunk of work that a padata thread does.
+ * @fn_arg: The thread function argument.
+ * @start: The start of the job (units are job-specific).
+ * @size: size of this node's work (units are job-specific).
+ * @align: Ranges passed to the thread function fall on this boundary, with the
+ * possible exceptions of the beginning and end of the job.
+ * @min_chunk: The minimum chunk size in job-specific units.  This allows
+ * the client to communicate the minimum amount of work that's
+ * appropriate for one worker thread to do at once.
+ * @max_threads: Max threads to use for the job, actual number may be less
+ *   depending on task size and minimum chunk size.
+ */
+struct padata_mt_job {
+   void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
+   void*fn_arg;
+   unsigned long   start;
+   unsigned long   size;
+   unsigned long   align;
+   unsigned long   min_chunk;
+   int max_threads;
+};
+
 /**
  * struct padata_instance - The overall control structure.
  *
@@ -171,6 +199,7 @@ extern void padata_free_shell(struct padata_shell *ps);
 extern int padata_do_parallel(struct padata_shell *ps,
  struct padata_priv *padata, int *cb_cpu);
 extern void padata_do_serial(struct padata_priv *padata);
+extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
  cpumask_var_t cpumask);
 extern int padata_start(struct padata_instance *pinst);
diff --git a/kernel/padata.c b/kernel/padata.c
index 78ff9aa529204..e78f57d9aef90 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -7,6 +7,9 @@
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert 
  *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan 
+ *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
  * version 2, as published by the Free Software Foundation.
@@ -21,6 +24,7 @@
  * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +36,8 @@
 #include 
 #include 
 
+#definePADATA_WORK_ONSTACK 1   /* Work's memory is on stack */
+
 struct padata_work {
struct work_struct  pw_work;
struct list_headpw_list;  /* padata_free_works linkage */
@@ -42,7 +48,17 @@ static DEFINE_SPINLOCK(padata_works_lock);
 static struct padata_work *padata_works;
 static LIST_HEAD(padata_free_works);
 
+struct padata_mt_job_state {
+   spinlock_t  lock;
+   struct completion   completion;
+   struct padata_mt_job*job;
+   int nworks;
+   int nworks_fini;
+   unsigned long   chunk_size;
+};
+
 static void padata_free_pd(struct parallel_data *pd);
+static void __init padata_mt_helper(struct work_struct *work);
 
 static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
 {
@@ -81,18 +97,56 @@ static struct padata_work *padata_work_alloc(void)
 }
 
 static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
-void *data)
+void *data, int flags)
 {
-   INIT_WORK(&pw->pw_work, work_fn);
+   if (flags & PADATA_WORK_ONSTACK)
+

[PATCH v2 3/7] padata: allocate work structures for parallel jobs from a pool

2020-05-20 Thread Daniel Jordan
padata allocates per-CPU, per-instance work structs for parallel jobs.
A do_parallel call assigns a job to a sequence number and hashes the
number to a CPU, where the job will eventually run using the
corresponding work.

This approach fit with how padata used to bind a job to each CPU
round-robin, makes less sense after commit bfde23ce200e6 ("padata:
unbind parallel jobs from specific CPUs") because a work isn't bound to
a particular CPU anymore, and isn't needed at all for multithreaded jobs
because they don't have sequence numbers.

Replace the per-CPU works with a preallocated pool, which allows sharing
them between existing padata users and the upcoming multithreaded user.
The pool will also facilitate setting NUMA-aware concurrency limits with
later users.

The pool is sized according to the number of possible CPUs.  With this
limit, MAX_OBJ_NUM no longer makes sense, so remove it.

If the global pool is exhausted, a parallel job is run in the current
task instead to throttle a system trying to do too much in parallel.

Signed-off-by: Daniel Jordan 
---
 include/linux/padata.h |   8 +--
 kernel/padata.c| 118 +++--
 2 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 476ecfa41f363..3bfa503503ac5 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
  * @list: List entry, to attach to the padata lists.
  * @pd: Pointer to the internal control structure.
  * @cb_cpu: Callback cpu for serializatioon.
- * @cpu: Cpu for parallelization.
  * @seq_nr: Sequence number of the parallelized data object.
  * @info: Used to pass information from the parallel to the serial function.
  * @parallel: Parallel execution function.
@@ -34,7 +33,6 @@ struct padata_priv {
struct list_headlist;
struct parallel_data*pd;
int cb_cpu;
-   int cpu;
unsigned intseq_nr;
int info;
void(*parallel)(struct padata_priv *padata);
@@ -68,15 +66,11 @@ struct padata_serial_queue {
 /**
  * struct padata_parallel_queue - The percpu padata parallel queue
  *
- * @parallel: List to wait for parallelization.
  * @reorder: List to wait for reordering after parallel processing.
- * @work: work struct for parallelization.
  * @num_obj: Number of objects that are processed by this cpu.
  */
 struct padata_parallel_queue {
-   struct padata_listparallel;
struct padata_listreorder;
-   struct work_structwork;
atomic_t  num_obj;
 };
 
@@ -111,7 +105,7 @@ struct parallel_data {
struct padata_parallel_queue__percpu *pqueue;
struct padata_serial_queue  __percpu *squeue;
atomic_trefcnt;
-   atomic_tseq_nr;
+   unsigned intseq_nr;
unsigned intprocessed;
int cpu;
struct padata_cpumask   cpumask;
diff --git a/kernel/padata.c b/kernel/padata.c
index 6f709bc0fc413..78ff9aa529204 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -32,7 +32,15 @@
 #include 
 #include 
 
-#define MAX_OBJ_NUM 1000
+struct padata_work {
+   struct work_struct  pw_work;
+   struct list_headpw_list;  /* padata_free_works linkage */
+   void*pw_data;
+};
+
+static DEFINE_SPINLOCK(padata_works_lock);
+static struct padata_work *padata_works;
+static LIST_HEAD(padata_free_works);
 
 static void padata_free_pd(struct parallel_data *pd);
 
@@ -58,30 +66,44 @@ static int padata_cpu_hash(struct parallel_data *pd, 
unsigned int seq_nr)
return padata_index_to_cpu(pd, cpu_index);
 }
 
-static void padata_parallel_worker(struct work_struct *parallel_work)
+static struct padata_work *padata_work_alloc(void)
 {
-   struct padata_parallel_queue *pqueue;
-   LIST_HEAD(local_list);
+   struct padata_work *pw;
 
-   local_bh_disable();
-   pqueue = container_of(parallel_work,
- struct padata_parallel_queue, work);
+   lockdep_assert_held(&padata_works_lock);
 
-   spin_lock(&pqueue->parallel.lock);
-   list_replace_init(&pqueue->parallel.list, &local_list);
-   spin_unlock(&pqueue->parallel.lock);
+   if (list_empty(&padata_free_works))
+   return NULL;/* No more work items allowed to be queued. */
 
-   while (!list_empty(&local_list)) {
-   struct padata_priv *padata;
+   pw = list_first_entry(&padata_free_works, struct padata_work, pw_list);
+   list_del(&pw->pw_list);
+   return pw;
+}
 
-   padata = list_entry(local_list.next,
-   struct padata_priv, list);
+static void padata_work_init(struc

Re: Backporting "padata: Remove broken queue flushing"

2020-05-19 Thread Daniel Jordan
Hello Ben,

On Tue, May 19, 2020 at 02:53:05PM +0100, Ben Hutchings wrote:
> I noticed that commit 07928d9bfc81 "padata: Remove broken queue
> flushing" has been backported to most stable branches, but commit
> 6fc4dbcf0276 "padata: Replace delayed timer with immediate workqueue in
> padata_reorder" has not.
>
> Is this correct?  What prevents the parallel_data ref-count from
> dropping to 0 while the timer is scheduled?

Doesn't seem like anything does, looking at 4.19.

I can see a race where the timer function uses a parallel_data after free
whether or not the refcount goes to 0.  Don't think it's likely to happen in
practice because of how small the window is between the serial callback
finishing and the timer being deactivated.


   task1:
   padata_reorder
  task2:
  padata_do_serial
// object arrives in reorder queue
 // sees reorder_objects > 0,
 //   set timer for 1 second
 mod_timer
 return
padata_reorder
  // queue serial work, which finishes
  //   (now possibly no more objects
  //left)
  |
   task1: |
   // pd is freed one of two ways:|
   //   1) pcrypt is unloaded |
   //   2) padata_replace triggered   |
   //  from userspace | (small window)
  |
   task3: |
   padata_reorder_timer   |
 // uses pd after free|
  |
  del_timer  // too late


If I got this right we might want to backport the commit you mentioned to be on
the safe side.


Re: [PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder()

2020-05-07 Thread Daniel Jordan
On Thu, May 07, 2020 at 02:18:42PM -0700, Alexander Duyck wrote:
> The idea behind merging ranges it to address possible cases where a
> range is broken up such that there is a hole in a max order block as a
> result.

Gah, yes, you're right, there could be multiple ranges in a max order block, so
the threads have to use the zone iterators to skip the holes.

> By combining the ranges if they both span the same section we
> can guarantee that the entire section will be initialized as a block
> and not potentially have partially initialized sections floating
> around. Without that mo_pfn logic I had in there I was getting panics
> every so often when booting up one of my systems as I recall.
> 
> Also the iterator itself is cheap. It is basically just walking a
> read-only list so it scales efficiently as well. One of the reasons

Agreed, it's not expensive, it's just gnarliness I was hoping to avoid, but
obviously it's not gonna work.

> why I arranged the code the way I did is that it also allowed me to
> get rid of an extra check in the code as the previous code was having
> to verify if the pfn belonged to the node. That is all handled
> directly through the for_each_free_mem_pfn_range_in_zone[_from] call
> now.
> 
> > With the series as it stands plus leaving in the section alignment check in
> > deferred_grow_zone (which I think could be relaxed to a maxorder alignment
> > check) so it doesn't stop mid-max-order-block, threads simply deal with a
> > start/end range and deferred_init_maxorder becomes shorter and simpler too.
> 
> I still think we are better off initializing complete sections since
> the pageblock_flags are fully initialized that way as well.

Fair enough.

> What
> guarantee do you have that all of the memory ranges will be max order
> aligned?

Sure, it's a problem with multiple ranges in a maxorder block, the rest
could've been handled.

> The problem is we have to guarantee all pages are initialized
> before we start freeing the pages in a max order page. If we just
> process each block as-is I believe we can end up with some
> architectures trying to access uninitialized memory in the buddy
> allocator as a result. That is why the deferred_init_maxorder function
> will walk through the iterator, using the _from version to avoid
> unnecessary iteration, the first time initializing the pages it needs
> to cross that max order boundary, and then again to free the max order
> block of pages that have been initialized. The iterator itself is
> farily cheap and only has to get you through the smaller ranges before
> you end up at the one big range that it just kind of sits at while it
> is working on getting it processed.

Right.


Ok, I think we're on the same page for the next version.  Thanks for the
thorough review!


Re: [PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder()

2020-05-07 Thread Daniel Jordan
On Thu, May 07, 2020 at 08:26:26AM -0700, Alexander Duyck wrote:
> On Wed, May 6, 2020 at 3:39 PM Daniel Jordan  
> wrote:
> > On Tue, May 05, 2020 at 08:27:52AM -0700, Alexander Duyck wrote:
> > > > Maybe it's better to leave deferred_init_maxorder alone and adapt the
> > > > multithreading to the existing implementation.  That'd mean dealing 
> > > > with the
> > > > pesky opaque index somehow, so deferred_init_mem_pfn_range_in_zone() 
> > > > could be
> >
> > I should have been explicit, was thinking of @i from
> > () when mentioning the opaque index.
> 
> Okay, that makes sense. However in reality you don't need to split
> that piece out. All you really are doing is splitting up the
> first_init_pfn value over multiple threads so you just need to make
> use of deferred_init_mem_pfn_range_in_zone() to initialize it.

Ok, I assume you mean that each thread should use
deferred_init_mem_pfn_range_in_zone.  Yes, that's what I meant when saying that
function could be generalized, though not sure we should opt for this.

> > > > generalized to find it in the thread function based on the start/end 
> > > > range, or
> > > > it could be maintained as part of the range that padata passes to the 
> > > > thread
> > > > function.
> > >
> > > You may be better off just implementing your threads to operate like
> > > deferred_grow_zone does. All your worker thread really needs then is
> > > to know where to start performing the page initialization and then it
> > > could go through and process an entire section worth of pages. The
> > > other bit that would have to be changed is patch 6 so that you combine
> > > any ranges that might span a single section instead of just splitting
> > > the work up based on the ranges.
> >
> > How are you thinking of combining them?  I don't see a way to do it without
> > storing an arbitrary number of ranges somewhere for each thread.
> 
> So when you are putting together your data you are storing a starting
> value and a length. All you end up having to do is make certain that
> the size + start pfn is section aligned. Then if you jump to a new
> section you have the option of either adding to the size of your
> current section or submitting the range and starting with a new start
> pfn in a new section. All you are really doing is breaking up the
> first_deferred_pfn over multiple sections. What I would do is section
> align end_pfn, and then check the next range from the zone. If the
> start_pfn of the next range is less than end_pfn you merge the two
> ranges by just increasing the size, otherwise you could start a new
> range.
> 
> The idea is that you just want to define what the valid range of PFNs
> are, and if there are sizable holes you skip over them. You would
> leave most of the lifting for identifying exactly what PFNs to
> initialize to the pfn_range_in_zone iterators since they would all be
> read-only accesses anyway.

Ok, I follow you.  My assumption is that there are generally few free pfn
ranges relative to the total number of pfns being initialized so that it's
efficient to parallelize over a single pfn range from the zone iterator.  On
the systems I tested, there were about 20 tiny ranges and one enormous range
per node so that firing off a job per range kept things simple without
affecting performance.  If that assumption holds, I'm not sure it's worth it to
merge ranges.

With the series as it stands plus leaving in the section alignment check in
deferred_grow_zone (which I think could be relaxed to a maxorder alignment
check) so it doesn't stop mid-max-order-block, threads simply deal with a
start/end range and deferred_init_maxorder becomes shorter and simpler too.


Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()

2020-05-06 Thread Daniel Jordan
On Wed, May 06, 2020 at 06:43:35PM -0400, Daniel Jordan wrote:
> On Wed, May 06, 2020 at 03:36:54PM -0700, Alexander Duyck wrote:
> > On Wed, May 6, 2020 at 3:21 PM Daniel Jordan  
> > wrote:
> > >
> > > On Tue, May 05, 2020 at 07:55:43AM -0700, Alexander Duyck wrote:
> > > > One question about this data. What is the power management
> > > > configuration on the systems when you are running these tests? I'm
> > > > just curious if CPU frequency scaling, C states, and turbo are
> > > > enabled?
> > >
> > > Yes, intel_pstate is loaded in active mode without hwp and with turbo 
> > > enabled
> > > (those power management docs are great by the way!) and intel_idle is in 
> > > use
> > > too.
> > >
> > > > I ask because that is what I have seen usually make the
> > > > difference in these kind of workloads as the throughput starts
> > > > dropping off as you start seeing the core frequency lower and more
> > > > cores become active.
> > >
> > > If I follow, you're saying there's a chance performance would improve 
> > > with the
> > > above disabled, but how often would a system be configured that way?  
> > > Even if
> > > it were faster, the machine is configured how it's configured, or am I 
> > > missing
> > > your point?
> > 
> > I think you might be missing my point. What I was getting at is that I
> > know for performance testing sometimes C states and P states get
> > disabled in order to get consistent results between runs, it sounds
> > like you have them enabled though. I was just wondering if you had
> > disabled them or not. If they were disabled then you wouldn't get the
> > benefits of turbo and as such adding more cores wouldn't come at a
> > penalty, while with it enabled the first few cores should start to
> > slow down as they fell out of turbo mode. So it may be part of the
> > reason why you are only hitting about 10x at full core count.

I checked the memory bandwidth of the biggest system, the Skylake.  Couldn't
find official specs for it, all I could quickly find were stream results from a
blog post of ours that quoted a range of about 123-145 GB/s over both nodes
when compiling with gcc.  That's with all CPUs.

Again using all CPUs, multithreaded page init is doing 41 GiB/s per node
assuming it's just touching the 64 bytes of each page struct, so assuming
there's more memory traffic than just struct page, it seems another part of the
reason for only 10x is we're bottlenecked on memory bandwidth.


Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()

2020-05-06 Thread Daniel Jordan
On Wed, May 06, 2020 at 03:36:54PM -0700, Alexander Duyck wrote:
> On Wed, May 6, 2020 at 3:21 PM Daniel Jordan  
> wrote:
> >
> > On Tue, May 05, 2020 at 07:55:43AM -0700, Alexander Duyck wrote:
> > > One question about this data. What is the power management
> > > configuration on the systems when you are running these tests? I'm
> > > just curious if CPU frequency scaling, C states, and turbo are
> > > enabled?
> >
> > Yes, intel_pstate is loaded in active mode without hwp and with turbo 
> > enabled
> > (those power management docs are great by the way!) and intel_idle is in use
> > too.
> >
> > > I ask because that is what I have seen usually make the
> > > difference in these kind of workloads as the throughput starts
> > > dropping off as you start seeing the core frequency lower and more
> > > cores become active.
> >
> > If I follow, you're saying there's a chance performance would improve with 
> > the
> > above disabled, but how often would a system be configured that way?  Even 
> > if
> > it were faster, the machine is configured how it's configured, or am I 
> > missing
> > your point?
> 
> I think you might be missing my point. What I was getting at is that I
> know for performance testing sometimes C states and P states get
> disabled in order to get consistent results between runs, it sounds
> like you have them enabled though. I was just wondering if you had
> disabled them or not. If they were disabled then you wouldn't get the
> benefits of turbo and as such adding more cores wouldn't come at a
> penalty, while with it enabled the first few cores should start to
> slow down as they fell out of turbo mode. So it may be part of the
> reason why you are only hitting about 10x at full core count.

All right, that makes way more sense.

> As it stands I think your code may speed up a bit if you split the
> work up based on section instead of max order. That would get rid of
> any cache bouncing you may be doing on the pageblock flags and reduce
> the overhead for splitting the work up into individual pieces since
> each piece will be bigger.

See my other mail.


Re: [PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder()

2020-05-06 Thread Daniel Jordan
On Tue, May 05, 2020 at 08:27:52AM -0700, Alexander Duyck wrote:
> As it turns out that deferred_free_range will be setting the
> migratetype for the page. In a sparse config the migratetype bits are
> stored in the section bitmap. So to avoid cacheline bouncing it would
> make sense to section align the tasks so that they only have one
> thread touching one section rather than having the pageblock_flags
> getting bounced between threads.

That's a good point, I'll change the alignment.

I kicked off some runs on the Skylake bare metal system to check how this did
and the performance stayed the same, but see below.

> It should also reduce the overhead
> for having to parallelize the work in the first place since a section
> is several times larger than a MAX_ORDER page and allows for more
> batching of the work.

I think you may be assuming that threads work in MAX_ORDER batches, maybe
because that's the job's min_chunk, but padata works differently.  The
min_chunk is a lower bound that establishes the smallest amount of work that
makes sense for a thread to do in one go, so in this case it's useful to
prevent starting large numbers of threads to initialize a tiny amount of pages.

Internally padata uses total job size and min chunk to arrive at the chunk
size, which on big machines will be much larger than min_chunk.  The idea is
the chunk size should be large enough to minimize multithreading overhead but
small enough to permit load balancing between threads.

This is probably why the results didn't change much when aligning by section,
but that doesn't mean other systems won't benefit.

> > Maybe it's better to leave deferred_init_maxorder alone and adapt the
> > multithreading to the existing implementation.  That'd mean dealing with the
> > pesky opaque index somehow, so deferred_init_mem_pfn_range_in_zone() could 
> > be

I should have been explicit, was thinking of @i from
for_each_free_mem_pfn_range_in_zone_from() when mentioning the opaque index.

> > generalized to find it in the thread function based on the start/end range, 
> > or
> > it could be maintained as part of the range that padata passes to the thread
> > function.
> 
> You may be better off just implementing your threads to operate like
> deferred_grow_zone does. All your worker thread really needs then is
> to know where to start performing the page initialization and then it
> could go through and process an entire section worth of pages. The
> other bit that would have to be changed is patch 6 so that you combine
> any ranges that might span a single section instead of just splitting
> the work up based on the ranges.

How are you thinking of combining them?  I don't see a way to do it without
storing an arbitrary number of ranges somewhere for each thread.

> If you are referring to the mo_pfn you shouldn't even need to think
> about it.

(clarified "opaque index" above)

> All it is doing is guaranteeing you are processing at least
> a full max order worth of pages. Without that the logic before was
> either process a whole section, or just process all of memory
> initializing it before it started freeing it. I found it made things
> much more efficient to process only up to MAX_ORDER at a time as you
> could squeeze that into the L2 cache for most x86 processors at least
> and it reduced the memory bandwidth by quite a bit.

Yes, that was clever, we should keep doing it that way.


Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()

2020-05-06 Thread Daniel Jordan
On Tue, May 05, 2020 at 07:55:43AM -0700, Alexander Duyck wrote:
> One question about this data. What is the power management
> configuration on the systems when you are running these tests? I'm
> just curious if CPU frequency scaling, C states, and turbo are
> enabled?

Yes, intel_pstate is loaded in active mode without hwp and with turbo enabled
(those power management docs are great by the way!) and intel_idle is in use
too.

> I ask because that is what I have seen usually make the
> difference in these kind of workloads as the throughput starts
> dropping off as you start seeing the core frequency lower and more
> cores become active.

If I follow, you're saying there's a chance performance would improve with the
above disabled, but how often would a system be configured that way?  Even if
it were faster, the machine is configured how it's configured, or am I missing
your point?


Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()

2020-05-04 Thread Daniel Jordan
On Mon, May 04, 2020 at 09:48:44PM -0400, Daniel Jordan wrote:
> On Mon, May 04, 2020 at 05:40:19PM -0700, Alexander Duyck wrote:
> > On Mon, May 4, 2020 at 4:44 PM Josh Triplett  wrote:
> > >
> > > On May 4, 2020 3:33:58 PM PDT, Alexander Duyck 
> > >  wrote:
> > > >On Thu, Apr 30, 2020 at 1:12 PM Daniel Jordan
> > > > wrote:
> > > >> /*
> > > >> -* Initialize and free pages in MAX_ORDER sized increments so
> > > >> -* that we can avoid introducing any issues with the buddy
> > > >> -* allocator.
> > > >> +* More CPUs always led to greater speedups on tested
> > > >systems, up to
> > > >> +* all the nodes' CPUs.  Use all since the system is
> > > >otherwise idle now.
> > > >>  */
> > > >
> > > >I would be curious about your data. That isn't what I have seen in the
> > > >past. Typically only up to about 8 or 10 CPUs gives you any benefit,
> > > >beyond that I was usually cache/memory bandwidth bound.
> 
> On Skylake it took more than 8 or 10 CPUs, though on other machines the 
> benefit
> of using all versus half or 3/4 of the CPUs is less significant.
> 
> Given that the rest of the system is idle at this point, my main concern is
> whether other archs regress past a certain thread count.

Reposting the data to be consistent with the way the percentages are reported
in the changelog.


Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
  2 nodes * 26 cores * 2 threads = 104 CPUs
  384G/node = 768G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   4056.7 (  5.5) --   1763.3 (  4.2)
   2% (  1)  -2.4%   4153.3 (  2.5)  -5.6%   1861.7 (  5.5)
  12% (  6)  35.0%   2637.7 ( 38.7)  80.3%346.7 ( 37.5)
  25% ( 13)  38.4%   2497.3 ( 38.5)  88.1%210.0 ( 41.8)
  37% ( 19)  38.9%   2477.0 ( 19.0)  89.5%185.3 ( 21.5)
  50% ( 26)  39.1%   2471.7 ( 21.4)  89.8%179.7 ( 25.8)
  75% ( 39)  39.5%   2455.7 ( 33.2)  90.8%161.7 ( 29.3)
 100% ( 52)  39.9%   2436.7 (  2.1)  91.8%144.3 (  5.9)


Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
  1 node * 16 cores * 2 threads = 32 CPUs
  192G/node = 192G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   1957.3 ( 14.0) --   1093.7 ( 12.9)
   3% (  1)   1.4%   1930.7 ( 10.0)   3.7%   1053.3 (  7.6)
  12% (  4)  41.2%   1151.7 (  9.0)  74.5%278.7 (  0.6)
  25% (  8)  46.3%   1051.0 (  7.8)  83.7%178.0 (  2.6)
  38% ( 12)  48.7%   1003.3 (  7.6)  87.0%141.7 (  3.8)
  50% ( 16)  48.2%   1014.3 ( 20.0)  87.8%133.3 (  3.2)
  75% ( 24)  49.5%989.3 (  6.7)  88.4%126.3 (  1.5)
 100% ( 32)  49.1%996.0 (  7.2)  88.4%127.3 (  5.1)


Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
  2 nodes * 18 cores * 2 threads = 72 CPUs
  128G/node = 256G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   1666.0 (  3.5) --618.0 (  3.5)
   3% (  1)   1.0%   1649.7 (  1.5)   2.9%600.0 (  1.0)
  11% (  4)  25.9%   1234.7 ( 21.4)  70.4%183.0 ( 22.5)
  25% (  9)  29.6%   1173.0 ( 10.0)  80.7%119.3 (  9.6)
  36% ( 13)  30.8%   1153.7 ( 17.0)  84.0% 99.0 ( 15.6)
  50% ( 18)  31.0%   1150.3 ( 15.5)  84.3% 97.3 ( 16.2)
  75% ( 27)  31.0%   1150.3 (  2.5)  84.6% 95.0 (  5.6)
 100% ( 36)  31.3%   1145.3 (  1.5)  85.6% 89.0 (  1.7)


AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
  1 node * 8 cores * 2 threads = 16 CPUs
  64G/node = 64G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   1029.7 ( 42.3) --253.7 (  3.1)
   6% (  1)   3.3%995.3 ( 21.4)   4.3%242.7 (  5.5)
  12% (  2)  14.0%885.7 ( 24.4)  46.4% 

Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()

2020-05-04 Thread Daniel Jordan
On Mon, May 04, 2020 at 05:40:19PM -0700, Alexander Duyck wrote:
> On Mon, May 4, 2020 at 4:44 PM Josh Triplett  wrote:
> >
> > On May 4, 2020 3:33:58 PM PDT, Alexander Duyck  
> > wrote:
> > >On Thu, Apr 30, 2020 at 1:12 PM Daniel Jordan
> > > wrote:
> > >> /*
> > >> -* Initialize and free pages in MAX_ORDER sized increments so
> > >> -* that we can avoid introducing any issues with the buddy
> > >> -* allocator.
> > >> +* More CPUs always led to greater speedups on tested
> > >systems, up to
> > >> +* all the nodes' CPUs.  Use all since the system is
> > >otherwise idle now.
> > >>  */
> > >
> > >I would be curious about your data. That isn't what I have seen in the
> > >past. Typically only up to about 8 or 10 CPUs gives you any benefit,
> > >beyond that I was usually cache/memory bandwidth bound.

On Skylake it took more than 8 or 10 CPUs, though on other machines the benefit
of using all versus half or 3/4 of the CPUs is less significant.

Given that the rest of the system is idle at this point, my main concern is
whether other archs regress past a certain thread count.


Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
  2 nodes * 26 cores * 2 threads = 104 CPUs
  384G/node = 768G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   4056.7 (  5.5) --   1763.3 (  4.2)
  (  1)  -2.3%   4153.3 (  2.5)  -5.3%   1861.7 (  5.5)
  12% (  6)  53.8%   2637.7 ( 38.7) 408.7%346.7 ( 37.5)
  25% ( 13)  62.4%   2497.3 ( 38.5) 739.7%210.0 ( 41.8)
  37% ( 19)  63.8%   2477.0 ( 19.0) 851.4%185.3 ( 21.5)
  50% ( 26)  64.1%   2471.7 ( 21.4) 881.4%179.7 ( 25.8)
  75% ( 39)  65.2%   2455.7 ( 33.2) 990.7%161.7 ( 29.3)
 100% ( 52)  66.5%   2436.7 (  2.1)1121.7%144.3 (  5.9)


Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
  1 node * 16 cores * 2 threads = 32 CPUs
  192G/node = 192G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   1957.3 ( 14.0) --   1093.7 ( 12.9)
  (  1)   1.4%   1930.7 ( 10.0)   3.8%   1053.3 (  7.6)
  12% (  4)  70.0%   1151.7 (  9.0) 292.5%278.7 (  0.6)
  25% (  8)  86.2%   1051.0 (  7.8) 514.4%178.0 (  2.6)
  37% ( 12)  95.1%   1003.3 (  7.6) 672.0%141.7 (  3.8)
  50% ( 16)  93.0%   1014.3 ( 20.0) 720.2%133.3 (  3.2)
  75% ( 24)  97.8%989.3 (  6.7) 765.7%126.3 (  1.5)
 100% ( 32)  96.5%996.0 (  7.2) 758.9%127.3 (  5.1)


Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
  2 nodes * 18 cores * 2 threads = 72 CPUs
  128G/node = 256G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   1666.0 (  3.5) --618.0 (  3.5)
  (  1)   1.0%   1649.7 (  1.5)   3.0%600.0 (  1.0)
  12% (  4)  34.9%   1234.7 ( 21.4) 237.7%183.0 ( 22.5)
  25% (  9)  42.0%   1173.0 ( 10.0) 417.9%119.3 (  9.6)
  37% ( 13)  44.4%   1153.7 ( 17.0) 524.2% 99.0 ( 15.6)
  50% ( 18)  44.8%   1150.3 ( 15.5) 534.9% 97.3 ( 16.2)
  75% ( 27)  44.8%   1150.3 (  2.5) 550.5% 95.0 (  5.6)
 100% ( 36)  45.5%   1145.3 (  1.5) 594.4% 89.0 (  1.7)


AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
  1 node * 8 cores * 2 threads = 16 CPUs
  64G/node = 64G memory

   kernel boot deferred init
   
node% (thr)speedup  time_ms (stdev)speedup  time_ms (stdev)
  (  0) --   1029.7 ( 42.3) --253.7 (  3.1)
  (  1)   3.4%995.3 ( 21.4)   4.5%242.7 (  5.5)
  12% (  2)  16.3%885.7 ( 24.4)  86.5%136.0 (  5.2)
  25% (  4)  23.3%835.0 ( 21.5) 195.0% 86.0 (  1.7)
  37% (  6)  28.0%804.7 ( 15.7) 249.1% 72.7 (  2.1)
  50% (  8)  26.3%815.3 ( 11.7) 290.3% 65.0 (  3.5)
  75% ( 12)  30.7%787.7 (  2.1) 284.3% 66.0 (  3.6)
 1

Re: [PATCH 6/7] mm: parallelize deferred_init_memmap()

2020-05-04 Thread Daniel Jordan
On Mon, May 04, 2020 at 03:33:58PM -0700, Alexander Duyck wrote:
> On Thu, Apr 30, 2020 at 1:12 PM Daniel Jordan
> > @@ -1778,15 +1798,25 @@ static int __init deferred_init_memmap(void *data)
> > goto zone_empty;
> >
> > /*
> > -* Initialize and free pages in MAX_ORDER sized increments so
> > -* that we can avoid introducing any issues with the buddy
> > -* allocator.
> > +* More CPUs always led to greater speedups on tested systems, up to
> > +* all the nodes' CPUs.  Use all since the system is otherwise idle 
> > now.
> >  */
> 
> I would be curious about your data. That isn't what I have seen in the
> past. Typically only up to about 8 or 10 CPUs gives you any benefit,
> beyond that I was usually cache/memory bandwidth bound.

I was surprised too!  For most of its development, this set had an interface to
get the number of cores on the theory that this was about where the bandwidth
got saturated, but the data showed otherwise.

There were diminishing returns, but they were more apparent on Haswell than
Skylake for instance.  I'll post some more data later in the thread where you
guys are talking about it.

> 
> > +   max_threads = max(cpumask_weight(cpumask), 1u);
> > +
> 
> We will need to gather data on if having a ton of threads works for
> all architectures.

Agreed.  I'll rope in some of the arch lists in the next version and include
the debugging knob to vary the thread count.

> For x86 I think we are freeing back pages in
> pageblock_order sized chunks so we only have to touch them once in
> initialize and then free the two pageblock_order chunks into the buddy
> allocator.
> 
> > for_each_free_mem_pfn_range_in_zone_from(i, zone, &spfn, &epfn) {
> > -   while (spfn < epfn) {
> > -   nr_pages += deferred_init_maxorder(zone, &spfn, 
> > epfn);
> > -   cond_resched();
> > -   }
> > +   struct def_init_args args = { zone, ATOMIC_LONG_INIT(0) };
> > +   struct padata_mt_job job = {
> > +   .thread_fn   = deferred_init_memmap_chunk,
> > +   .fn_arg  = &args,
> > +   .start   = spfn,
> > +   .size= epfn - spfn,
> > +   .align   = MAX_ORDER_NR_PAGES,
> > +   .min_chunk   = MAX_ORDER_NR_PAGES,
> > +   .max_threads = max_threads,
> > +   };
> > +
> > +   padata_do_multithreaded(&job);
> > +   nr_pages += atomic_long_read(&args.nr_pages);
> > }
> >  zone_empty:
> > /* Sanity check that the next zone really is unpopulated */
> 
> Okay so looking at this I can see why you wanted to structure the
> other patch the way you did. However I am not sure that is the best
> way to go about doing it. It might make more sense to go through and
> accumulate sections. If you hit the end of a range and the start of
> the next range is in another section, then you split it as a new job,
> otherwise I would just accumulate it into the current job. You then
> could section align the work and be more or less guaranteed that each
> worker thread should be generating finished work products, and not
> incomplete max order pages.

This guarantee holds now with the max-order alignment passed to padata, so I
don't see what more doing it on section boundaries buys us.


Re: [PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder()

2020-05-04 Thread Daniel Jordan
On Mon, May 04, 2020 at 03:10:46PM -0700, Alexander Duyck wrote:
> So we cannot stop in the middle of a max order block. That shouldn't
> be possible as part of the issue is that the buddy allocator will
> attempt to access the buddy for the page which could cause issues if
> it tries to merge the page with one that is not initialized. So if
> your code supports that then it is definitely broken. That was one of
> the reasons for all of the variable weirdness in
> deferred_init_maxorder. I was going through and making certain that
> while we were initializing the range we were freeing the pages in
> MAX_ORDER aligned blocks and skipping over whatever reserved blocks
> were there. Basically it was handling the case where a single
> MAX_ORDER block could span multiple ranges.
> 
> On x86 this was all pretty straightforward and I don't believe we
> needed the code, but I seem to recall there were some other
> architectures that had more complex memory layouts at the time and
> that was one of the reasons why I had to be careful to wait until I
> had processed the full MAX_ORDER block before I could start freeing
> the pages, otherwise it would start triggering memory corruptions.

Yes, thanks, I missed the case where deferred_grow_zone could stop
mid-max-order-block.

Maybe it's better to leave deferred_init_maxorder alone and adapt the
multithreading to the existing implementation.  That'd mean dealing with the
pesky opaque index somehow, so deferred_init_mem_pfn_range_in_zone() could be
generalized to find it in the thread function based on the start/end range, or
it could be maintained as part of the range that padata passes to the thread
function.

Or, keep this patch but make sure deferred_grow_zone stops on a
max-order-aligned boundary.


Re: [PATCH 0/7] padata: parallelize deferred page init

2020-04-30 Thread Daniel Jordan
On Thu, Apr 30, 2020 at 06:09:35PM -0700, Josh Triplett wrote:
> On Thu, Apr 30, 2020 at 04:11:18PM -0400, Daniel Jordan wrote:
> > Sometimes the kernel doesn't take full advantage of system memory
> > bandwidth, leading to a single CPU spending excessive time in
> > initialization paths where the data scales with memory size.
> > 
> > Multithreading naturally addresses this problem, and this series is the
> > first step.
> > 
> > It extends padata, a framework that handles many parallel singlethreaded
> > jobs, to handle multithreaded jobs as well by adding support for
> > splitting up the work evenly, specifying a minimum amount of work that's
> > appropriate for one helper thread to do, load balancing between helpers,
> > and coordinating them.  More documentation in patches 4 and 7.
> > 
> > The first user is deferred struct page init, a large bottleneck in
> > kernel boot--actually the largest for us and likely others too.  This
> > path doesn't require concurrency limits, resource control, or priority
> > adjustments like future users will (vfio, hugetlb fallocate, munmap)
> > because it happens during boot when the system is otherwise idle and
> > waiting on page init to finish.
> > 
> > This has been tested on a variety of x86 systems and speeds up kernel
> > boot by 6% to 49% by making deferred init 63% to 91% faster.  Patch 6
> > has detailed numbers.  Test results from other systems appreciated.
> > 
> > This series is based on v5.6 plus these three from mmotm:
> > 
> >   mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
> >   mm-initialize-deferred-pages-with-interrupts-enabled.patch
> >   mm-call-cond_resched-from-deferred_init_memmap.patch
> > 
> > All of the above can be found in this branch:
> > 
> >   git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v1
> >   
> > https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v1
> 
> For the series (and the three prerequisite patches):
> 
> Tested-by: Josh Triplett 

Appreciate the runs, Josh, thanks.


Re: [PATCH 0/7] padata: parallelize deferred page init

2020-04-30 Thread Daniel Jordan
On Thu, Apr 30, 2020 at 05:40:59PM -0400, Pavel Tatashin wrote:
> On Thu, Apr 30, 2020 at 5:31 PM Andrew Morton  
> wrote:
> > On Thu, 30 Apr 2020 16:11:18 -0400 Daniel Jordan 
> >  wrote:
> >
> > > Sometimes the kernel doesn't take full advantage of system memory
> > > bandwidth, leading to a single CPU spending excessive time in
> > > initialization paths where the data scales with memory size.
> > >
> > > Multithreading naturally addresses this problem, and this series is the
> > > first step.
> > >
> > > It extends padata, a framework that handles many parallel singlethreaded
> > > jobs, to handle multithreaded jobs as well by adding support for
> > > splitting up the work evenly, specifying a minimum amount of work that's
> > > appropriate for one helper thread to do, load balancing between helpers,
> > > and coordinating them.  More documentation in patches 4 and 7.
> > >
> > > The first user is deferred struct page init, a large bottleneck in
> > > kernel boot--actually the largest for us and likely others too.  This
> > > path doesn't require concurrency limits, resource control, or priority
> > > adjustments like future users will (vfio, hugetlb fallocate, munmap)
> > > because it happens during boot when the system is otherwise idle and
> > > waiting on page init to finish.
> > >
> > > This has been tested on a variety of x86 systems and speeds up kernel
> > > boot by 6% to 49% by making deferred init 63% to 91% faster.
> >
> > How long is this up-to-91% in seconds?  If it's 91% of a millisecond
> > then not impressed.  If it's 91% of two weeks then better :)

The largest system I could test had 384G per node and saved 1.5 out of 4
seconds.

> > Relatedly, how important is boot time on these large machines anyway?
> > They presumably have lengthy uptimes so boot time is relatively
> > unimportant?
> 
> Large machines indeed have a lengthy uptime, but they also can host a
> large number of VMs meaning that downtime of the host increases the
> downtime of VMs in cloud environments. Some VMs might be very sensible
> to downtime: game servers, traders, etc.
>
> > IOW, can you please explain more fully why this patchset is valuable to
> > our users?

I'll let the users speak for themselves, but I have a similar use case to Pavel
of limiting the downtime of VMs running on these large systems, and spinning up
instances as fast as possible is also desirable for our cloud users.


Re: [PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder()

2020-04-30 Thread Daniel Jordan
Hi Alex,

On Thu, Apr 30, 2020 at 02:43:28PM -0700, Alexander Duyck wrote:
> On 4/30/2020 1:11 PM, Daniel Jordan wrote:
> > padata will soon divide up pfn ranges between threads when parallelizing
> > deferred init, and deferred_init_maxorder() complicates that by using an
> > opaque index in addition to start and end pfns.  Move the index outside
> > the function to make splitting the job easier, and simplify the code
> > while at it.
> > 
> > deferred_init_maxorder() now always iterates within a single pfn range
> > instead of potentially multiple ranges, and advances start_pfn to the
> > end of that range instead of the max-order block so partial pfn ranges
> > in the block aren't skipped in a later iteration.  The section alignment
> > check in deferred_grow_zone() is removed as well since this alignment is
> > no longer guaranteed.  It's not clear what value the alignment provided
> > originally.
> > 
> > Signed-off-by: Daniel Jordan 
> 
> So part of the reason for splitting it up along section aligned boundaries
> was because we already had an existing functionality in deferred_grow_zone
> that was going in and pulling out a section aligned chunk and processing it
> to prepare enough memory for other threads to keep running. I suspect that
> the section alignment was done because normally I believe that is also the
> alignment for memory onlining.

I think Pavel added that functionality, maybe he could confirm.

My impression was that the reason deferred_grow_zone aligned the requested
order up to a section was to make enough memory available to avoid being called
on every allocation.

> With this already breaking things up over multiple threads how does this
> work with deferred_grow_zone? Which thread is it trying to allocate from if
> it needs to allocate some memory for itself?

I may not be following your question, but deferred_grow_zone doesn't allocate
memory during the multithreading in deferred_init_memmap because the latter
sets first_deferred_pfn so that deferred_grow_zone bails early.

> Also what is to prevent a worker from stop deferred_grow_zone from bailing
> out in the middle of a max order page block if there is a hole in the middle
> of the block?

deferred_grow_zone remains singlethreaded.  It could stop in the middle of a
max order block, but it can't run concurrently with deferred_init_memmap, as
per above, so if deferred_init_memmap were to init 'n free the remaining part
of the block, the previous portion would have already been initialized.


[PATCH 3/7] padata: allocate work structures for parallel jobs from a pool

2020-04-30 Thread Daniel Jordan
padata allocates per-CPU, per-instance work structs for parallel jobs.
A do_parallel call assigns a job to a sequence number and hashes the
number to a CPU, where the job will eventually run using the
corresponding work.

This approach fit with how padata used to bind a job to each CPU
round-robin, makes less sense after commit bfde23ce200e6 ("padata:
unbind parallel jobs from specific CPUs") because a work isn't bound to
a particular CPU anymore, and isn't needed at all for multithreaded jobs
because they don't have sequence numbers.

Replace the per-CPU works with a preallocated pool, which allows sharing
them between existing padata users and the upcoming multithreaded user.
The pool will also facilitate setting NUMA-aware concurrency limits with
later users.

The pool is sized according to the number of possible CPUs.  With this
limit, MAX_OBJ_NUM no longer makes sense, so remove it.

If the global pool is exhausted, a parallel job is run in the current
task instead to throttle a system trying to do too much in parallel.

Signed-off-by: Daniel Jordan 
---
 include/linux/padata.h |   8 +--
 kernel/padata.c| 118 +++--
 2 files changed, 78 insertions(+), 48 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 476ecfa41f363..3bfa503503ac5 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -24,7 +24,6 @@
  * @list: List entry, to attach to the padata lists.
  * @pd: Pointer to the internal control structure.
  * @cb_cpu: Callback cpu for serializatioon.
- * @cpu: Cpu for parallelization.
  * @seq_nr: Sequence number of the parallelized data object.
  * @info: Used to pass information from the parallel to the serial function.
  * @parallel: Parallel execution function.
@@ -34,7 +33,6 @@ struct padata_priv {
struct list_headlist;
struct parallel_data*pd;
int cb_cpu;
-   int cpu;
unsigned intseq_nr;
int info;
void(*parallel)(struct padata_priv *padata);
@@ -68,15 +66,11 @@ struct padata_serial_queue {
 /**
  * struct padata_parallel_queue - The percpu padata parallel queue
  *
- * @parallel: List to wait for parallelization.
  * @reorder: List to wait for reordering after parallel processing.
- * @work: work struct for parallelization.
  * @num_obj: Number of objects that are processed by this cpu.
  */
 struct padata_parallel_queue {
-   struct padata_listparallel;
struct padata_listreorder;
-   struct work_structwork;
atomic_t  num_obj;
 };
 
@@ -111,7 +105,7 @@ struct parallel_data {
struct padata_parallel_queue__percpu *pqueue;
struct padata_serial_queue  __percpu *squeue;
atomic_trefcnt;
-   atomic_tseq_nr;
+   unsigned intseq_nr;
unsigned intprocessed;
int cpu;
struct padata_cpumask   cpumask;
diff --git a/kernel/padata.c b/kernel/padata.c
index b05cd30f8905b..edd3ff551e262 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -32,7 +32,15 @@
 #include 
 #include 
 
-#define MAX_OBJ_NUM 1000
+struct padata_work {
+   struct work_struct  pw_work;
+   struct list_headpw_list;  /* padata_free_works linkage */
+   void*pw_data;
+};
+
+static DEFINE_SPINLOCK(padata_works_lock);
+static struct padata_work *padata_works;
+static LIST_HEAD(padata_free_works);
 
 static void padata_free_pd(struct parallel_data *pd);
 
@@ -58,30 +66,44 @@ static int padata_cpu_hash(struct parallel_data *pd, 
unsigned int seq_nr)
return padata_index_to_cpu(pd, cpu_index);
 }
 
-static void padata_parallel_worker(struct work_struct *parallel_work)
+static struct padata_work *padata_work_alloc(void)
 {
-   struct padata_parallel_queue *pqueue;
-   LIST_HEAD(local_list);
+   struct padata_work *pw;
 
-   local_bh_disable();
-   pqueue = container_of(parallel_work,
- struct padata_parallel_queue, work);
+   lockdep_assert_held(&padata_works_lock);
 
-   spin_lock(&pqueue->parallel.lock);
-   list_replace_init(&pqueue->parallel.list, &local_list);
-   spin_unlock(&pqueue->parallel.lock);
+   if (list_empty(&padata_free_works))
+   return NULL;/* No more work items allowed to be queued. */
 
-   while (!list_empty(&local_list)) {
-   struct padata_priv *padata;
+   pw = list_first_entry(&padata_free_works, struct padata_work, pw_list);
+   list_del(&pw->pw_list);
+   return pw;
+}
 
-   padata = list_entry(local_list.next,
-   struct padata_priv, list);
+static void padata_work_init(struc

[PATCH 1/7] padata: remove exit routine

2020-04-30 Thread Daniel Jordan
padata_driver_exit() is unnecessary because padata isn't built as a
module and doesn't exit.

padata's init routine will soon allocate memory, so getting rid of the
exit function now avoids pointless code to free it.

Signed-off-by: Daniel Jordan 
---
 kernel/padata.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/kernel/padata.c b/kernel/padata.c
index 72777c10bb9cb..36a8e98741bb3 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -1071,10 +1071,4 @@ static __init int padata_driver_init(void)
 }
 module_init(padata_driver_init);
 
-static __exit void padata_driver_exit(void)
-{
-   cpuhp_remove_multi_state(CPUHP_PADATA_DEAD);
-   cpuhp_remove_multi_state(hp_online);
-}
-module_exit(padata_driver_exit);
 #endif
-- 
2.26.2



[PATCH 4/7] padata: add basic support for multithreaded jobs

2020-04-30 Thread Daniel Jordan
Sometimes the kernel doesn't take full advantage of system memory
bandwidth, leading to a single CPU spending excessive time in
initialization paths where the data scales with memory size.

Multithreading naturally addresses this problem.

Extend padata, a framework that handles many parallel yet singlethreaded
jobs, to also handle multithreaded jobs by adding support for splitting
up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.

This is inspired by work from Pavel Tatashin and Steve Sistare.

Signed-off-by: Daniel Jordan 
---
 include/linux/padata.h |  29 
 kernel/padata.c| 152 -
 2 files changed, 178 insertions(+), 3 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 3bfa503503ac5..b0affa466a841 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -4,6 +4,9 @@
  *
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert 
+ *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan 
  */
 
 #ifndef PADATA_H
@@ -130,6 +133,31 @@ struct padata_shell {
struct list_headlist;
 };
 
+/**
+ * struct padata_mt_job - represents one multithreaded job
+ *
+ * @thread_fn: Called for each chunk of work that a padata thread does.
+ * @fn_arg: The thread function argument.
+ * @start: The start of the job (units are job-specific).
+ * @size: size of this node's work (units are job-specific).
+ * @align: Ranges passed to the thread function fall on this boundary, with the
+ * possible exceptions of the beginning and end of the job.
+ * @min_chunk: The minimum chunk size in job-specific units.  This allows
+ * the client to communicate the minimum amount of work that's
+ * appropriate for one worker thread to do at once.
+ * @max_threads: Max threads to use for the job, actual number may be less
+ *   depending on task size and minimum chunk size.
+ */
+struct padata_mt_job {
+   void (*thread_fn)(unsigned long start, unsigned long end, void *arg);
+   void*fn_arg;
+   unsigned long   start;
+   unsigned long   size;
+   unsigned long   align;
+   unsigned long   min_chunk;
+   int max_threads;
+};
+
 /**
  * struct padata_instance - The overall control structure.
  *
@@ -171,6 +199,7 @@ extern void padata_free_shell(struct padata_shell *ps);
 extern int padata_do_parallel(struct padata_shell *ps,
  struct padata_priv *padata, int *cb_cpu);
 extern void padata_do_serial(struct padata_priv *padata);
+extern void __init padata_do_multithreaded(struct padata_mt_job *job);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
  cpumask_var_t cpumask);
 extern int padata_start(struct padata_instance *pinst);
diff --git a/kernel/padata.c b/kernel/padata.c
index edd3ff551e262..ccb617d37677a 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -7,6 +7,9 @@
  * Copyright (C) 2008, 2009 secunet Security Networks AG
  * Copyright (C) 2008, 2009 Steffen Klassert 
  *
+ * Copyright (c) 2020 Oracle and/or its affiliates.
+ * Author: Daniel Jordan 
+ *
  * This program is free software; you can redistribute it and/or modify it
  * under the terms and conditions of the GNU General Public License,
  * version 2, as published by the Free Software Foundation.
@@ -21,6 +24,7 @@
  * 51 Franklin St - Fifth Floor, Boston, MA 02110-1301 USA.
  */
 
+#include 
 #include 
 #include 
 #include 
@@ -32,6 +36,8 @@
 #include 
 #include 
 
+#definePADATA_WORK_ONSTACK 1   /* Work's memory is on stack */
+
 struct padata_work {
struct work_struct  pw_work;
struct list_headpw_list;  /* padata_free_works linkage */
@@ -42,7 +48,17 @@ static DEFINE_SPINLOCK(padata_works_lock);
 static struct padata_work *padata_works;
 static LIST_HEAD(padata_free_works);
 
+struct padata_mt_job_state {
+   spinlock_t  lock;
+   struct completion   completion;
+   struct padata_mt_job*job;
+   int nworks;
+   int nworks_fini;
+   unsigned long   chunk_size;
+};
+
 static void padata_free_pd(struct parallel_data *pd);
+static void __init padata_mt_helper(struct work_struct *work);
 
 static int padata_index_to_cpu(struct parallel_data *pd, int cpu_index)
 {
@@ -81,18 +97,56 @@ static struct padata_work *padata_work_alloc(void)
 }
 
 static void padata_work_init(struct padata_work *pw, work_func_t work_fn,
-void *data)
+void *data, int flags)
 {
-   INIT_WORK(&pw->pw_work, work_fn);
+   if (flags & PADATA_WORK_ONSTACK)
+

[PATCH 5/7] mm: move zone iterator outside of deferred_init_maxorder()

2020-04-30 Thread Daniel Jordan
padata will soon divide up pfn ranges between threads when parallelizing
deferred init, and deferred_init_maxorder() complicates that by using an
opaque index in addition to start and end pfns.  Move the index outside
the function to make splitting the job easier, and simplify the code
while at it.

deferred_init_maxorder() now always iterates within a single pfn range
instead of potentially multiple ranges, and advances start_pfn to the
end of that range instead of the max-order block so partial pfn ranges
in the block aren't skipped in a later iteration.  The section alignment
check in deferred_grow_zone() is removed as well since this alignment is
no longer guaranteed.  It's not clear what value the alignment provided
originally.

Signed-off-by: Daniel Jordan 
---
 mm/page_alloc.c | 88 +++--
 1 file changed, 27 insertions(+), 61 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 68669d3a5a665..990514d8f0d94 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1708,55 +1708,23 @@ deferred_init_mem_pfn_range_in_zone(u64 *i, struct zone 
*zone,
 }
 
 /*
- * Initialize and free pages. We do it in two loops: first we initialize
- * struct page, then free to buddy allocator, because while we are
- * freeing pages we can access pages that are ahead (computing buddy
- * page in __free_one_page()).
- *
- * In order to try and keep some memory in the cache we have the loop
- * broken along max page order boundaries. This way we will not cause
- * any issues with the buddy page computation.
+ * Initialize the struct pages and then free them to the buddy allocator at
+ * most a max order block at a time because while we are freeing pages we can
+ * access pages that are ahead (computing buddy page in __free_one_page()).
+ * It's also cache friendly.
  */
 static unsigned long __init
-deferred_init_maxorder(u64 *i, struct zone *zone, unsigned long *start_pfn,
-  unsigned long *end_pfn)
+deferred_init_maxorder(struct zone *zone, unsigned long *start_pfn,
+  unsigned long end_pfn)
 {
-   unsigned long mo_pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
-   unsigned long spfn = *start_pfn, epfn = *end_pfn;
-   unsigned long nr_pages = 0;
-   u64 j = *i;
-
-   /* First we loop through and initialize the page values */
-   for_each_free_mem_pfn_range_in_zone_from(j, zone, start_pfn, end_pfn) {
-   unsigned long t;
-
-   if (mo_pfn <= *start_pfn)
-   break;
-
-   t = min(mo_pfn, *end_pfn);
-   nr_pages += deferred_init_pages(zone, *start_pfn, t);
-
-   if (mo_pfn < *end_pfn) {
-   *start_pfn = mo_pfn;
-   break;
-   }
-   }
-
-   /* Reset values and now loop through freeing pages as needed */
-   swap(j, *i);
-
-   for_each_free_mem_pfn_range_in_zone_from(j, zone, &spfn, &epfn) {
-   unsigned long t;
-
-   if (mo_pfn <= spfn)
-   break;
+   unsigned long nr_pages, pfn;
 
-   t = min(mo_pfn, epfn);
-   deferred_free_pages(spfn, t);
+   pfn = ALIGN(*start_pfn + 1, MAX_ORDER_NR_PAGES);
+   pfn = min(pfn, end_pfn);
 
-   if (mo_pfn <= epfn)
-   break;
-   }
+   nr_pages = deferred_init_pages(zone, *start_pfn, pfn);
+   deferred_free_pages(*start_pfn, pfn);
+   *start_pfn = pfn;
 
return nr_pages;
 }
@@ -1814,9 +1782,11 @@ static int __init deferred_init_memmap(void *data)
 * that we can avoid introducing any issues with the buddy
 * allocator.
 */
-   while (spfn < epfn) {
-   nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
-   cond_resched();
+   for_each_free_mem_pfn_range_in_zone_from(i, zone, &spfn, &epfn) {
+   while (spfn < epfn) {
+   nr_pages += deferred_init_maxorder(zone, &spfn, epfn);
+   cond_resched();
+   }
}
 zone_empty:
/* Sanity check that the next zone really is unpopulated */
@@ -1883,22 +1853,18 @@ deferred_grow_zone(struct zone *zone, unsigned int 
order)
 * that we can avoid introducing any issues with the buddy
 * allocator.
 */
-   while (spfn < epfn) {
-   /* update our first deferred PFN for this section */
-   first_deferred_pfn = spfn;
-
-   nr_pages += deferred_init_maxorder(&i, zone, &spfn, &epfn);
-   touch_nmi_watchdog();
-
-   /* We should only stop along section boundaries */
-   if ((first_deferred_pfn ^ spfn) < PAGES_PER_SECTION)
-   continue;
-
-   /* If our quota has been met we can stop her

[PATCH 0/7] padata: parallelize deferred page init

2020-04-30 Thread Daniel Jordan
Sometimes the kernel doesn't take full advantage of system memory
bandwidth, leading to a single CPU spending excessive time in
initialization paths where the data scales with memory size.

Multithreading naturally addresses this problem, and this series is the
first step.

It extends padata, a framework that handles many parallel singlethreaded
jobs, to handle multithreaded jobs as well by adding support for
splitting up the work evenly, specifying a minimum amount of work that's
appropriate for one helper thread to do, load balancing between helpers,
and coordinating them.  More documentation in patches 4 and 7.

The first user is deferred struct page init, a large bottleneck in
kernel boot--actually the largest for us and likely others too.  This
path doesn't require concurrency limits, resource control, or priority
adjustments like future users will (vfio, hugetlb fallocate, munmap)
because it happens during boot when the system is otherwise idle and
waiting on page init to finish.

This has been tested on a variety of x86 systems and speeds up kernel
boot by 6% to 49% by making deferred init 63% to 91% faster.  Patch 6
has detailed numbers.  Test results from other systems appreciated.

This series is based on v5.6 plus these three from mmotm:

  mm-call-touch_nmi_watchdog-on-max-order-boundaries-in-deferred-init.patch
  mm-initialize-deferred-pages-with-interrupts-enabled.patch
  mm-call-cond_resched-from-deferred_init_memmap.patch

All of the above can be found in this branch:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-definit-v1
  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-definit-v1

The future users and related features are available as work-in-progress
here:

  git://oss.oracle.com/git/linux-dmjordan.git padata-mt-wip-v0.3
  
https://oss.oracle.com/git/gitweb.cgi?p=linux-dmjordan.git;a=shortlog;h=refs/heads/padata-mt-wip-v0.3

Thanks to everyone who commented on the last version of this[0],
including Alex Williamson, Jason Gunthorpe, Jonathan Corbet, Michal
Hocko, Pavel Machek, Peter Zijlstra, Randy Dunlap, Robert Elliott, Tejun
Heo, and Zi Yan.

RFC v4 -> padata v1:
 - merged with padata (Peter)
 - got rid of the 'task' nomenclature (Peter, Jon)

future work branch:
 - made lockdep-aware (Jason, Peter)
 - adjust workqueue worker priority with renice_or_cancel() (Tejun)
 - fixed undo problem in VFIO (Alex)

The remaining feedback, mainly resource control awareness (cgroup etc),
is TODO for later series.

[0] 
https://lore.kernel.org/linux-mm/20181105165558.11698-1-daniel.m.jor...@oracle.com/

Daniel Jordan (7):
  padata: remove exit routine
  padata: initialize earlier
  padata: allocate work structures for parallel jobs from a pool
  padata: add basic support for multithreaded jobs
  mm: move zone iterator outside of deferred_init_maxorder()
  mm: parallelize deferred_init_memmap()
  padata: document multithreaded jobs

 Documentation/core-api/padata.rst |  41 +++--
 include/linux/padata.h|  43 -
 init/main.c   |   2 +
 kernel/padata.c   | 277 --
 mm/Kconfig|   6 +-
 mm/page_alloc.c   | 118 ++---
 6 files changed, 355 insertions(+), 132 deletions(-)


base-commit: 7111951b8d4973bda27ff663f2cf18b663d15b48
prerequisite-patch-id: 4ad522141e1119a325a9799dad2bd982fbac8b7c
prerequisite-patch-id: 169273327e56f5461101a71dfbd6b4cfd4570cf0
prerequisite-patch-id: 0f34692c8a9673d4c4f6a3545cf8ec3a2abf8620
-- 
2.26.2



[PATCH 7/7] padata: document multithreaded jobs

2020-04-30 Thread Daniel Jordan
Add Documentation for multithreaded jobs.

Signed-off-by: Daniel Jordan 
---
 Documentation/core-api/padata.rst | 41 +++
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/Documentation/core-api/padata.rst 
b/Documentation/core-api/padata.rst
index 9a24c111781d9..b7e047af993e8 100644
--- a/Documentation/core-api/padata.rst
+++ b/Documentation/core-api/padata.rst
@@ -4,23 +4,26 @@
 The padata parallel execution mechanism
 ===
 
-:Date: December 2019
+:Date: April 2020
 
 Padata is a mechanism by which the kernel can farm jobs out to be done in
-parallel on multiple CPUs while retaining their ordering.  It was developed for
-use with the IPsec code, which needs to be able to perform encryption and
-decryption on large numbers of packets without reordering those packets.  The
-crypto developers made a point of writing padata in a sufficiently general
-fashion that it could be put to other uses as well.
+parallel on multiple CPUs while optionally retaining their ordering.
 
-Usage
-=
+It was originally developed for IPsec, which needs to perform encryption and
+decryption on large numbers of packets without reordering those packets.  This
+is currently the sole consumer of padata's serialized job support.
+
+Padata also supports multithreaded jobs, splitting up the job evenly while load
+balancing and coordinating between threads.
+
+Running Serialized Jobs
+===
 
 Initializing
 
 
-The first step in using padata is to set up a padata_instance structure for
-overall control of how jobs are to be run::
+The first step in using padata to run parallel jobs is to set up a
+padata_instance structure for overall control of how jobs are to be run::
 
 #include 
 
@@ -162,6 +165,24 @@ functions that correspond to the allocation in reverse::
 It is the user's responsibility to ensure all outstanding jobs are complete
 before any of the above are called.
 
+Running Multithreaded Jobs
+==
+
+A multithreaded job has a main thread and zero or more helper threads, with the
+main thread participating in the job and then waiting until all helpers have
+finished.  padata splits the job into units called chunks, where a chunk is a
+piece of the job that one thread completes in one call to the thread function.
+
+A user has to do three things to run a multithreaded job.  First, describe the
+job by defining a padata_mt_job structure, which is explained in the Interface
+section.  This includes a pointer to the thread function, which padata will
+call each time it assigns a job chunk to a thread.  Then, define the thread
+function, which accepts three arguments, ``start``, ``end``, and ``arg``, where
+the first two delimit the range that the thread operates on and the last is a
+pointer to the job's shared state, if any.  Prepare the shared state, which is
+typically a stack-allocated structure that wraps the required data.  Last, call
+padata_do_multithreaded(), which will return once the job is finished.
+
 Interface
 =
 
-- 
2.26.2



[PATCH 6/7] mm: parallelize deferred_init_memmap()

2020-04-30 Thread Daniel Jordan
Deferred struct page init uses one thread per node, which is a
significant bottleneck at boot for big machines--often the largest.
Parallelize to reduce system downtime.

The maximum number of threads is capped at the number of CPUs on the
node because speedups always improve with additional threads on every
system tested, and at this phase of boot, the system is otherwise idle
and waiting on page init to finish.

Helper threads operate on MAX_ORDER_NR_PAGES-aligned ranges to avoid
accessing uninitialized buddy pages, so set the job's alignment
accordingly.

The minimum chunk size is also MAX_ORDER_NR_PAGES because there was
benefit to using multiple threads even on relatively small memory (1G)
systems.

Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
  2 nodes * 26 cores * 2 threads = 104 CPUs
  384G/node = 768G memory

   kernel boot deferred init
   
   speedup  time_ms (stdev)speedup  time_ms (stdev)
 base   --   4056.7 (  5.5) --   1763.3 (  4.2)
 test39.9%   2436.7 (  2.1)  91.8%144.3 (  5.9)

Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, bare metal)
  1 node * 16 cores * 2 threads = 32 CPUs
  192G/node = 192G memory

   kernel boot deferred init
   
   speedup  time_ms (stdev)speedup  time_ms (stdev)
 base   --   1957.3 ( 14.0) --   1093.7 ( 12.9)
 test49.1%996.0 (  7.2)  88.4%127.3 (  5.1)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
  2 nodes * 18 cores * 2 threads = 72 CPUs
  128G/node = 256G memory

   kernel boot deferred init
   
   speedup  time_ms (stdev)speedup  time_ms (stdev)
 base   --   1666.0 (  3.5) --618.0 (  3.5)
 test31.3%   1145.3 (  1.5)  85.6% 89.0 (  1.7)

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
  1 node * 8 cores * 2 threads = 16 CPUs
  64G/node = 64G memory

   kernel boot deferred init
   
   speedup  time_ms (stdev)speedup  time_ms (stdev)
 base   --   1029.7 ( 42.3) --253.7 (  3.1)
 test23.3%789.3 ( 15.0)  76.3% 60.0 (  5.6)

Server-oriented distros that enable deferred page init sometimes run in
small VMs, and they still benefit even though the fraction of boot time
saved is smaller:

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
  1 node * 2 cores * 2 threads = 4 CPUs
  16G/node = 16G memory

   kernel boot deferred init
   
   speedup  time_ms (stdev)speedup  time_ms (stdev)
 base   --757.7 ( 17.1) -- 57.0 (  0.0)
 test 6.2%710.3 ( 15.0)  63.2% 21.0 (  0.0)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
  1 node * 2 cores * 2 threads = 4 CPUs
  14G/node = 14G memory

   kernel boot deferred init
   
   speedup  time_ms (stdev)speedup  time_ms (stdev)
 base   --656.3 (  7.1) -- 57.3 (  1.5)
 test 8.6%599.7 (  5.9)  62.8% 21.3 (  1.2)

Signed-off-by: Daniel Jordan 
---
 mm/Kconfig  |  6 +++---
 mm/page_alloc.c | 46 ++
 2 files changed, 41 insertions(+), 11 deletions(-)

diff --git a/mm/Kconfig b/mm/Kconfig
index ab80933be65ff..e5007206c7601 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -622,13 +622,13 @@ config DEFERRED_STRUCT_PAGE_INIT
depends on SPARSEMEM
depends on !NEED_PER_CPU_KM
depends on 64BIT
+   select PADATA
help
  Ordinarily all struct pages are initialised during early boot in a
  single thread. On very large machines this can take a considerable
  amount of time. If this option is set, large machines will bring up
- a subset of memmap at boot and then initialise the rest in parallel
- by starting one-off "pgdatinitX" kernel thread for each node X. This
- has a potential performance impact on processes running early in the
+ a subset of memmap at boot and then initialise the rest in parallel.
+ This has a potential performance impact on tasks running early in the
  lifetime of the system until these kthr

[PATCH 2/7] padata: initialize earlier

2020-04-30 Thread Daniel Jordan
padata will soon initialize the system's struct pages in parallel, so it
needs to be ready by page_alloc_init_late().

The error return from padata_driver_init() triggers an initcall warning,
so add a warning to padata_init() to avoid silent failure.

Signed-off-by: Daniel Jordan 
---
 include/linux/padata.h |  6 ++
 init/main.c|  2 ++
 kernel/padata.c| 17 -
 3 files changed, 16 insertions(+), 9 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index a0d8b41850b25..476ecfa41f363 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -164,6 +164,12 @@ struct padata_instance {
 #definePADATA_INVALID  4
 };
 
+#ifdef CONFIG_PADATA
+extern void __init padata_init(void);
+#else
+static inline void __init padata_init(void) {}
+#endif
+
 extern struct padata_instance *padata_alloc_possible(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern struct padata_shell *padata_alloc_shell(struct padata_instance *pinst);
diff --git a/init/main.c b/init/main.c
index ee4947af823f3..5451a80e43016 100644
--- a/init/main.c
+++ b/init/main.c
@@ -94,6 +94,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -1438,6 +1439,7 @@ static noinline void __init kernel_init_freeable(void)
smp_init();
sched_init_smp();
 
+   padata_init();
page_alloc_init_late();
/* Initialize page ext after all struct pages are initialized. */
page_ext_init();
diff --git a/kernel/padata.c b/kernel/padata.c
index 36a8e98741bb3..b05cd30f8905b 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -31,7 +31,6 @@
 #include 
 #include 
 #include 
-#include 
 
 #define MAX_OBJ_NUM 1000
 
@@ -1049,26 +1048,26 @@ void padata_free_shell(struct padata_shell *ps)
 }
 EXPORT_SYMBOL(padata_free_shell);
 
-#ifdef CONFIG_HOTPLUG_CPU
-
-static __init int padata_driver_init(void)
+void __init padata_init(void)
 {
+#ifdef CONFIG_HOTPLUG_CPU
int ret;
 
ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN, "padata:online",
  padata_cpu_online, NULL);
if (ret < 0)
-   return ret;
+   goto err;
hp_online = ret;
 
ret = cpuhp_setup_state_multi(CPUHP_PADATA_DEAD, "padata:dead",
  NULL, padata_cpu_dead);
if (ret < 0) {
cpuhp_remove_multi_state(hp_online);
-   return ret;
+   goto err;
}
-   return 0;
-}
-module_init(padata_driver_init);
 
+   return;
+err:
+   pr_warn("padata: initialization failed\n");
 #endif
+}
-- 
2.26.2



Re: [PATCH] hwrng: omap - Fix RNG wait loop timeout

2019-10-14 Thread Daniel Thompson
On Mon, Oct 14, 2019 at 05:32:45PM +0530, Sumit Garg wrote:
> Existing RNG data read timeout is 200us but it doesn't cover EIP76 RNG
> data rate which takes approx. 700us to produce 16 bytes of output data
> as per testing results. So configure the timeout as 1000us to also take
> account of lack of udelay()'s reliability.

What "lack of udelay()'s reliability" are you concerned about?


Daniel.

> 
> Fixes: 383212425c92 ("hwrng: omap - Add device variant for SafeXcel IP-76 
> found in Armada 8K")
> Cc: 
> Signed-off-by: Sumit Garg 
> ---
>  drivers/char/hw_random/omap-rng.c | 9 -
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/char/hw_random/omap-rng.c 
> b/drivers/char/hw_random/omap-rng.c
> index b27f396..e329f82 100644
> --- a/drivers/char/hw_random/omap-rng.c
> +++ b/drivers/char/hw_random/omap-rng.c
> @@ -66,6 +66,13 @@
>  #define OMAP4_RNG_OUTPUT_SIZE0x8
>  #define EIP76_RNG_OUTPUT_SIZE0x10
>  
> +/*
> + * EIP76 RNG takes approx. 700us to produce 16 bytes of output data
> + * as per testing results. And to account for the lack of udelay()'s
> + * reliability, we keep the timeout as 1000us.
> + */
> +#define RNG_DATA_FILL_TIMEOUT100
> +
>  enum {
>   RNG_OUTPUT_0_REG = 0,
>   RNG_OUTPUT_1_REG,
> @@ -176,7 +183,7 @@ static int omap_rng_do_read(struct hwrng *rng, void 
> *data, size_t max,
>   if (max < priv->pdata->data_size)
>   return 0;
>  
> - for (i = 0; i < 20; i++) {
> + for (i = 0; i < RNG_DATA_FILL_TIMEOUT; i++) {
>   present = priv->pdata->data_present(priv);
>   if (present || !wait)
>   break;
> -- 
> 2.7.4
> 


Re: [PATCH v2 1/5] padata: make flushing work with async users

2019-09-18 Thread Daniel Jordan
On Thu, Sep 05, 2019 at 06:37:56PM -0400, Daniel Jordan wrote:
> On Thu, Sep 05, 2019 at 02:17:35PM +1000, Herbert Xu wrote:
> > I don't think waiting is an option.  In a pathological case the
> > hardware may not return at all.  We cannot and should not hold off
> > CPU hotplug for an arbitrary amount of time when the event we are
> > waiting for isn't even occuring on that CPU.
> 
> Ok, I hadn't considered hardware not returning.
> 
> > I don't think flushing is needed at all.  All we need to do is
> > maintain consistency before and after the CPU hotplug event.
> 
> I could imagine not flushing would work for replacing a pd.  The old pd could
> be freed by whatever drops the last reference and the new pd could be
> installed, all without flushing.
> 
> In the case of freeing an instance, though, padata needs to wait for all the
> jobs to complete so they don't use the instance's data after it's been freed.
> Holding the CPU hotplug lock isn't necessary for this, though, so I think 
> we're
> ok to wait here.

[FYI, I'm currently on leave until mid-October and will return to this series
then.]


Re: [PATCH v3 2/2] hwrng: npcm: add NPCM RNG driver

2019-09-13 Thread Daniel Thompson
On Thu, Sep 12, 2019 at 12:01:49PM +0300, Tomer Maimon wrote:
> Add Nuvoton NPCM BMC Random Number Generator(RNG) driver.
> 
> Signed-off-by: Tomer Maimon 

Reviewed-by: Daniel Thompson 

Note, you are welcome to preseve this if you have to respin and
change directory based on Vinod's feedback...


> ---
>  drivers/char/hw_random/Kconfig|  13 +++
>  drivers/char/hw_random/Makefile   |   1 +
>  drivers/char/hw_random/npcm-rng.c | 186 ++
>  3 files changed, 200 insertions(+)
>  create mode 100644 drivers/char/hw_random/npcm-rng.c
> 
> diff --git a/drivers/char/hw_random/Kconfig b/drivers/char/hw_random/Kconfig
> index 59f25286befe..87a1c30e7958 100644
> --- a/drivers/char/hw_random/Kconfig
> +++ b/drivers/char/hw_random/Kconfig
> @@ -440,6 +440,19 @@ config HW_RANDOM_OPTEE
>  
> If unsure, say Y.
>  
> +config HW_RANDOM_NPCM
> + tristate "NPCM Random Number Generator support"
> + depends on ARCH_NPCM || COMPILE_TEST
> + default HW_RANDOM
> + help
> +   This driver provides support for the Random Number
> +   Generator hardware available in Nuvoton NPCM SoCs.
> +
> +   To compile this driver as a module, choose M here: the
> +   module will be called npcm-rng.
> +
> +   If unsure, say Y.
> +
>  endif # HW_RANDOM
>  
>  config UML_RANDOM
> diff --git a/drivers/char/hw_random/Makefile b/drivers/char/hw_random/Makefile
> index 7c9ef4a7667f..17b6d4e6d591 100644
> --- a/drivers/char/hw_random/Makefile
> +++ b/drivers/char/hw_random/Makefile
> @@ -39,3 +39,4 @@ obj-$(CONFIG_HW_RANDOM_MTK) += mtk-rng.o
>  obj-$(CONFIG_HW_RANDOM_S390) += s390-trng.o
>  obj-$(CONFIG_HW_RANDOM_KEYSTONE) += ks-sa-rng.o
>  obj-$(CONFIG_HW_RANDOM_OPTEE) += optee-rng.o
> +obj-$(CONFIG_HW_RANDOM_NPCM) += npcm-rng.o
> diff --git a/drivers/char/hw_random/npcm-rng.c 
> b/drivers/char/hw_random/npcm-rng.c
> new file mode 100644
> index ..b7c8c7e13a49
> --- /dev/null
> +++ b/drivers/char/hw_random/npcm-rng.c
> @@ -0,0 +1,186 @@
> +// SPDX-License-Identifier: GPL-2.0
> +// Copyright (c) 2019 Nuvoton Technology corporation.
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +#define NPCM_RNGCS_REG   0x00/* Control and status register 
> */
> +#define NPCM_RNGD_REG0x04/* Data register */
> +#define NPCM_RNGMODE_REG 0x08/* Mode register */
> +
> +#define NPCM_RNG_CLK_SET_25MHZ   GENMASK(4, 3) /* 20-25 MHz */
> +#define NPCM_RNG_DATA_VALID  BIT(1)
> +#define NPCM_RNG_ENABLE  BIT(0)
> +#define NPCM_RNG_M1ROSEL BIT(1)
> +
> +#define NPCM_RNG_TIMEOUT_USEC2
> +#define NPCM_RNG_POLL_USEC   1000
> +
> +#define to_npcm_rng(p)   container_of(p, struct npcm_rng, rng)
> +
> +struct npcm_rng {
> + void __iomem *base;
> + struct hwrng rng;
> +};
> +
> +static int npcm_rng_init(struct hwrng *rng)
> +{
> + struct npcm_rng *priv = to_npcm_rng(rng);
> +
> + writel(NPCM_RNG_CLK_SET_25MHZ | NPCM_RNG_ENABLE,
> +priv->base + NPCM_RNGCS_REG);
> +
> + return 0;
> +}
> +
> +static void npcm_rng_cleanup(struct hwrng *rng)
> +{
> + struct npcm_rng *priv = to_npcm_rng(rng);
> +
> + writel(NPCM_RNG_CLK_SET_25MHZ, priv->base + NPCM_RNGCS_REG);
> +}
> +
> +static int npcm_rng_read(struct hwrng *rng, void *buf, size_t max, bool wait)
> +{
> + struct npcm_rng *priv = to_npcm_rng(rng);
> + int retval = 0;
> + int ready;
> +
> + pm_runtime_get_sync((struct device *)priv->rng.priv);
> +
> + while (max >= sizeof(u32)) {
> + if (wait) {
> + if (readl_poll_timeout(priv->base + NPCM_RNGCS_REG,
> +ready,
> +ready & NPCM_RNG_DATA_VALID,
> +NPCM_RNG_POLL_USEC,
> +NPCM_RNG_TIMEOUT_USEC))
> + break;
> + } else {
> + if ((readl(priv->base + NPCM_RNGCS_REG) &
> + NPCM_RNG_DATA_VALID) == 0)
> + break;
> + }
> +
> + *(u32 *)buf = readl(priv->base + NPCM_RNGD_REG);
> + retval += sizeof(u32);
> + buf += sizeof(u32);
> + max -= sizeof(u32);
> + }
> +
> + pm_runtime_mark_last_busy((struct device *)priv->rng.priv);
> + pm_runtim

Re: [PATCH v2 2/2] hwrng: npcm: add NPCM RNG driver

2019-09-11 Thread Daniel Thompson
er(&pdev->dev, &priv->rng);
> >+if (ret) {
> >+dev_err(&pdev->dev, "Failed to register rng device: %d\n",
> >+ret);
> 
> need to disable if CONFIG_PM ?
> 
> >+return ret;
> >+}
> >+
> >+dev_set_drvdata(&pdev->dev, priv);
> 
> This should probably be before the register.
> 
> >+pm_runtime_set_autosuspend_delay(&pdev->dev, 100);
> 
> So every 100ms power off, and if userspace does a read we
> will poll every 1ms for upto 20ms.
> 
> If userspace says try once a second with -ENODELAY so no wait,
> it never gets data.

I didn't follow this.

In the time before the device is suspended it should have generated
data and this can be sent to the userspace. Providing the suspend delay
is longer than the buffer size of the hardware then there won't
necessarily be performance problems because the device is "full" when
it is suspended.

Of course if the hardware loses state when it is suspended then the
driver would need extra code on the PM paths to preserve the data...


Daniel.


Re: [PATCH v2 1/2] dt-binding: hwrng: add NPCM RNG documentation

2019-09-10 Thread Daniel Thompson
On Tue, Sep 10, 2019 at 02:55:44PM +0300, Tomer Maimon wrote:
> Hi Daniel,
> 
> Sorry but I have probably miss it, thanks a lot for your comment
> 
> On Tue, 10 Sep 2019 at 13:25, Daniel Thompson 
> wrote:
> 
> > On Mon, Sep 09, 2019 at 03:38:39PM +0300, Tomer Maimon wrote:
> > > Added device tree binding documentation for Nuvoton BMC
> > > NPCM Random Number Generator (RNG).
> > >
> > > Signed-off-by: Tomer Maimon 
> > > ---
> > >  .../bindings/rng/nuvoton,npcm-rng.txt   | 17 +
> > >  1 file changed, 17 insertions(+)
> > >  create mode 100644
> > Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt
> > >
> > > diff --git a/Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt
> > b/Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt
> > > new file mode 100644
> > > index ..a697b4425fb3
> > > --- /dev/null
> > > +++ b/Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt
> > > @@ -0,0 +1,17 @@
> > > +NPCM SoC Random Number Generator
> > > +
> > > +Required properties:
> > > +- compatible  : "nuvoton,npcm750-rng" for the NPCM7XX BMC.
> > > +- reg : Specifies physical base address and size of the
> > registers.
> > > +
> > > +Optional property:
> > > +- quality : estimated number of bits of true entropy per 1024 bits
> > > + read from the rng.
> > > + If this property is not defined, it defaults to
> > 1000.
> >
> > There are pending unreplied review comments about this property (my own
> > as it happens):
> > https://patchwork.kernel.org/patch/9371/
> >
> > No, there isn't different SoCs.
> we had checked the quality of the hwrng and the results we got are set as
> default.
> we been asked from one of our client to have a dynamic quality, they will
> like to be more strict when using the hwrng.
> is it problematic to add it?

It's a slightly grey area but in general the role of devicetree is to
describe the hardware. This parameter is not doing that.

If you view the quality assessment of this RNG to be a user preference
it is better set the quality to zero which is what the vast majority of
hwrng devices do. When the driver sets the quality to zero then the
kernel does not stir the entropy pool automatically... instead it
relies on the userspace rngd to do that. If the user wants the kernel
to stir the pool automatically then the quality can be set using the
default_quality kernel parameter.


Daniel.

> 
> Having a controllable quality implies that the numeric quality of the
> peripheral changes when it is stamped out on different SoCs (otherwise
> the driver can confidently set the quality without needing any hint
> from the DT). Is that really true here?
> 
> 
> > Daniel.
> >
> > > +
> > > +Example:
> > > +
> > > +rng: rng@f000b000 {
> > > + compatible = "nuvoton,npcm750-rng";
> > > + reg = <0xf000b000 0x8>;
> > > +};
> > > --
> > > 2.18.0
> > >
> >


Re: [PATCH v1 2/2] hwrng: npcm: add NPCM RNG driver

2019-09-10 Thread Daniel Thompson
On Tue, Sep 10, 2019 at 01:52:35PM +0300, Tomer Maimon wrote:
> Hi Daniel,
> 
> Thanks for your prompt reply,
> 
> 
> 
> On Mon, 9 Sep 2019 at 18:10, Daniel Thompson 
> wrote:
> 
> > On Mon, Sep 09, 2019 at 05:31:30PM +0300, Tomer Maimon wrote:
> > > Hi Daniel,
> > >
> > > appreciate your comments and sorry for the late reply
> > >
> > > On Thu, 29 Aug 2019 at 13:47, Daniel Thompson <
> > daniel.thomp...@linaro.org>
> > > wrote:
> > >
> > > > On Wed, Aug 28, 2019 at 07:26:17PM +0300, Tomer Maimon wrote:
> > > > > Add Nuvoton NPCM BMC Random Number Generator(RNG) driver.
> > > > >
> > > > > Signed-off-by: Tomer Maimon 
> > > > > ---
> > > > >  drivers/char/hw_random/Kconfig|  13 ++
> > > > >  drivers/char/hw_random/Makefile   |   1 +
> > > > >  drivers/char/hw_random/npcm-rng.c | 207
> > ++
> > > > >  3 files changed, 221 insertions(+)
> > > > >  create mode 100644 drivers/char/hw_random/npcm-rng.c
> > > > >
> > > > > diff --git a/drivers/char/hw_random/npcm-rng.c
> > > > b/drivers/char/hw_random/npcm-rng.c
> > > > > new file mode 100644
> > > > > index ..5b4b1b6cb362
> > > > > --- /dev/null
> > > > > +++ b/drivers/char/hw_random/npcm-rng.c
> > > > > @@ -0,0 +1,207 @@
> > > > > +// SPDX-License-Identifier: GPL-2.0
> > > > > +// Copyright (c) 2019 Nuvoton Technology corporation.
> > > > > +
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +#include 
> > > > > +
> > > > > +#define NPCM_RNGCS_REG   0x00/* Control and status
> > > > register */
> > > > > +#define NPCM_RNGD_REG0x04/* Data register */
> > > > > +#define NPCM_RNGMODE_REG 0x08/* Mode register */
> > > > > +
> > > > > +#define NPCM_RNG_CLK_SET_25MHZ   GENMASK(4, 3) /* 20-25 MHz */
> > > > > +#define NPCM_RNG_DATA_VALID  BIT(1)
> > > > > +#define NPCM_RNG_ENABLE  BIT(0)
> > > > > +#define NPCM_RNG_M1ROSEL BIT(1)
> > > > > +
> > > > > +#define NPCM_RNG_TIMEOUT_POLL20
> > > >
> > > > Might be better to define this in real-world units (such as
> > > > milliseconds) since the timeout is effectively the longest time the
> > > > hardware can take to generate 4 bytes.
> > > >
> > > > > +
> > > > > +#define to_npcm_rng(p)   container_of(p, struct npcm_rng, rng)
> > > > > +
> > > > > +struct npcm_rng {
> > > > > + void __iomem *base;
> > > > > + struct hwrng rng;
> > > > > +};
> > > > > +
> > > > > +static int npcm_rng_init(struct hwrng *rng)
> > > > > +{
> > > > > + struct npcm_rng *priv = to_npcm_rng(rng);
> > > > > + u32 val;
> > > > > +
> > > > > + val = readl(priv->base + NPCM_RNGCS_REG);
> > > > > + val |= NPCM_RNG_ENABLE;
> > > > > + writel(val, priv->base + NPCM_RNGCS_REG);
> > > > > +
> > > > > + return 0;
> > > > > +}
> > > > > +
> > > > > +static void npcm_rng_cleanup(struct hwrng *rng)
> > > > > +{
> > > > > + struct npcm_rng *priv = to_npcm_rng(rng);
> > > > > + u32 val;
> > > > > +
> > > > > + val = readl(priv->base + NPCM_RNGCS_REG);
> > > > > + val &= ~NPCM_RNG_ENABLE;
> > > > > + writel(val, priv->base + NPCM_RNGCS_REG);
> > > > > +}
> > > > > +
> > > > > +static bool npcm_rng_wait_ready(struct hwrng *rng, bool wait)
> > > > > +{
> > > > > + struct npcm_rng *priv = to_npcm_rng(rng);
> > > > > + int timeout_cnt = 0;
> > > > > + int ready;
> > > > > +
> > > > > + ready = readl

Re: [PATCH v2 2/2] hwrng: npcm: add NPCM RNG driver

2019-09-10 Thread Daniel Thompson
  break;

> + }
> +
> + *(u32 *)buf = readl(priv->base + NPCM_RNGD_REG);
> + retval += sizeof(u32);
> + buf += sizeof(u32);
> + max -= sizeof(u32);
> + }
> +
> + pm_runtime_mark_last_busy((struct device *)priv->rng.priv);
> + pm_runtime_put_sync_autosuspend((struct device *)priv->rng.priv);
> +
> + return retval || !wait ? retval : -EIO;
> +}
> +
> +static int npcm_rng_probe(struct platform_device *pdev)
> +{
> + struct npcm_rng *priv;
> + struct resource *res;
> + bool pm_dis = false;
> + u32 quality;
> + int ret;
> +
> + priv = devm_kzalloc(&pdev->dev, sizeof(*priv), GFP_KERNEL);
> + if (!priv)
> + return -ENOMEM;
> +
> + res = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> + priv->base = devm_ioremap_resource(&pdev->dev, res);
> + if (IS_ERR(priv->base))
> + return PTR_ERR(priv->base);
> +
> + priv->rng.name = pdev->name;
> +#ifndef CONFIG_PM
> + pm_dis = true;
> + priv->rng.init = npcm_rng_init;
> + priv->rng.cleanup = npcm_rng_cleanup;
> +#endif
> + priv->rng.read = npcm_rng_read;
> + priv->rng.priv = (unsigned long)&pdev->dev;
> + if (of_property_read_u32(pdev->dev.of_node, "quality", &quality))
> + priv->rng.quality = 1000;
> + else
> + priv->rng.quality = quality;
> +
> + writel(NPCM_RNG_M1ROSEL, priv->base + NPCM_RNGMODE_REG);
> + if (pm_dis)
> + writel(NPCM_RNG_CLK_SET_25MHZ, priv->base + NPCM_RNGCS_REG);
> + else
> + writel(NPCM_RNG_CLK_SET_25MHZ | NPCM_RNG_ENABLE,
> +priv->base + NPCM_RNGCS_REG);

This still doesn't seem right and its not simply because pm_dis is an
obfuscated way to write IS_ENABLED(CONFIG_PM).

I'd like to understand why the call to pm_runtime_get_sync() isn't
resulting in the device resume callback running... it is simply
because the hwrng_register() happens before the pm_runtime_enable() ?


Daniel.

> +
> + ret = devm_hwrng_register(&pdev->dev, &priv->rng);
> + if (ret) {
> + dev_err(&pdev->dev, "Failed to register rng device: %d\n",
> + ret);
> + return ret;
> + }
> +
> + dev_set_drvdata(&pdev->dev, priv);
> + pm_runtime_set_autosuspend_delay(&pdev->dev, 100);
> + pm_runtime_use_autosuspend(&pdev->dev);
> + pm_runtime_enable(&pdev->dev);
> +
> + return 0;
> +}
> +
> +static int npcm_rng_remove(struct platform_device *pdev)
> +{
> + struct npcm_rng *priv = platform_get_drvdata(pdev);
> +
> + hwrng_unregister(&priv->rng);
> + pm_runtime_disable(&pdev->dev);
> + pm_runtime_set_suspended(&pdev->dev);
> +
> + return 0;
> +}
> +
> +#ifdef CONFIG_PM
> +static int npcm_rng_runtime_suspend(struct device *dev)
> +{
> + struct npcm_rng *priv = dev_get_drvdata(dev);
> +
> + npcm_rng_cleanup(&priv->rng);
> +
> + return 0;
> +}
> +
> +static int npcm_rng_runtime_resume(struct device *dev)
> +{
> + struct npcm_rng *priv = dev_get_drvdata(dev);
> +
> + return npcm_rng_init(&priv->rng);
> +}
> +#endif
> +
> +static const struct dev_pm_ops npcm_rng_pm_ops = {
> + SET_RUNTIME_PM_OPS(npcm_rng_runtime_suspend,
> +npcm_rng_runtime_resume, NULL)
> + SET_SYSTEM_SLEEP_PM_OPS(pm_runtime_force_suspend,
> + pm_runtime_force_resume)
> +};
> +
> +static const struct of_device_id rng_dt_id[] = {
> + { .compatible = "nuvoton,npcm750-rng",  },
> + {},
> +};
> +MODULE_DEVICE_TABLE(of, rng_dt_id);
> +
> +static struct platform_driver npcm_rng_driver = {
> + .driver = {
> + .name   = "npcm-rng",
> + .pm = &npcm_rng_pm_ops,
> + .owner  = THIS_MODULE,
> + .of_match_table = of_match_ptr(rng_dt_id),
> + },
> + .probe  = npcm_rng_probe,
> + .remove = npcm_rng_remove,
> +};
> +
> +module_platform_driver(npcm_rng_driver);
> +
> +MODULE_DESCRIPTION("Nuvoton NPCM Random Number Generator Driver");
> +MODULE_AUTHOR("Tomer Maimon ");
> +MODULE_LICENSE("GPL v2");
> -- 
> 2.18.0
> 


Re: [PATCH v2 1/2] dt-binding: hwrng: add NPCM RNG documentation

2019-09-10 Thread Daniel Thompson
On Mon, Sep 09, 2019 at 03:38:39PM +0300, Tomer Maimon wrote:
> Added device tree binding documentation for Nuvoton BMC
> NPCM Random Number Generator (RNG).
> 
> Signed-off-by: Tomer Maimon 
> ---
>  .../bindings/rng/nuvoton,npcm-rng.txt   | 17 +
>  1 file changed, 17 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt
> 
> diff --git a/Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt 
> b/Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt
> new file mode 100644
> index ..a697b4425fb3
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/rng/nuvoton,npcm-rng.txt
> @@ -0,0 +1,17 @@
> +NPCM SoC Random Number Generator
> +
> +Required properties:
> +- compatible  : "nuvoton,npcm750-rng" for the NPCM7XX BMC.
> +- reg : Specifies physical base address and size of the registers.
> +
> +Optional property:
> +- quality : estimated number of bits of true entropy per 1024 bits
> + read from the rng.
> + If this property is not defined, it defaults to 1000.

There are pending unreplied review comments about this property (my own
as it happens):
https://patchwork.kernel.org/patch/9371/


Daniel.

> +
> +Example:
> +
> +rng: rng@f000b000 {
> + compatible = "nuvoton,npcm750-rng";
> + reg = <0xf000b000 0x8>;
> +};
> -- 
> 2.18.0
> 


Re: [PATCH v1 2/2] hwrng: npcm: add NPCM RNG driver

2019-09-09 Thread Daniel Thompson
On Mon, Sep 09, 2019 at 05:31:30PM +0300, Tomer Maimon wrote:
> Hi Daniel,
> 
> appreciate your comments and sorry for the late reply
> 
> On Thu, 29 Aug 2019 at 13:47, Daniel Thompson 
> wrote:
> 
> > On Wed, Aug 28, 2019 at 07:26:17PM +0300, Tomer Maimon wrote:
> > > Add Nuvoton NPCM BMC Random Number Generator(RNG) driver.
> > >
> > > Signed-off-by: Tomer Maimon 
> > > ---
> > >  drivers/char/hw_random/Kconfig|  13 ++
> > >  drivers/char/hw_random/Makefile   |   1 +
> > >  drivers/char/hw_random/npcm-rng.c | 207 ++
> > >  3 files changed, 221 insertions(+)
> > >  create mode 100644 drivers/char/hw_random/npcm-rng.c
> > >
> > > diff --git a/drivers/char/hw_random/npcm-rng.c
> > b/drivers/char/hw_random/npcm-rng.c
> > > new file mode 100644
> > > index ..5b4b1b6cb362
> > > --- /dev/null
> > > +++ b/drivers/char/hw_random/npcm-rng.c
> > > @@ -0,0 +1,207 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +// Copyright (c) 2019 Nuvoton Technology corporation.
> > > +
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +#include 
> > > +
> > > +#define NPCM_RNGCS_REG   0x00/* Control and status
> > register */
> > > +#define NPCM_RNGD_REG0x04/* Data register */
> > > +#define NPCM_RNGMODE_REG 0x08/* Mode register */
> > > +
> > > +#define NPCM_RNG_CLK_SET_25MHZ   GENMASK(4, 3) /* 20-25 MHz */
> > > +#define NPCM_RNG_DATA_VALID  BIT(1)
> > > +#define NPCM_RNG_ENABLE  BIT(0)
> > > +#define NPCM_RNG_M1ROSEL BIT(1)
> > > +
> > > +#define NPCM_RNG_TIMEOUT_POLL20
> >
> > Might be better to define this in real-world units (such as
> > milliseconds) since the timeout is effectively the longest time the
> > hardware can take to generate 4 bytes.
> >
> > > +
> > > +#define to_npcm_rng(p)   container_of(p, struct npcm_rng, rng)
> > > +
> > > +struct npcm_rng {
> > > + void __iomem *base;
> > > + struct hwrng rng;
> > > +};
> > > +
> > > +static int npcm_rng_init(struct hwrng *rng)
> > > +{
> > > + struct npcm_rng *priv = to_npcm_rng(rng);
> > > + u32 val;
> > > +
> > > + val = readl(priv->base + NPCM_RNGCS_REG);
> > > + val |= NPCM_RNG_ENABLE;
> > > + writel(val, priv->base + NPCM_RNGCS_REG);
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static void npcm_rng_cleanup(struct hwrng *rng)
> > > +{
> > > + struct npcm_rng *priv = to_npcm_rng(rng);
> > > + u32 val;
> > > +
> > > + val = readl(priv->base + NPCM_RNGCS_REG);
> > > + val &= ~NPCM_RNG_ENABLE;
> > > + writel(val, priv->base + NPCM_RNGCS_REG);
> > > +}
> > > +
> > > +static bool npcm_rng_wait_ready(struct hwrng *rng, bool wait)
> > > +{
> > > + struct npcm_rng *priv = to_npcm_rng(rng);
> > > + int timeout_cnt = 0;
> > > + int ready;
> > > +
> > > + ready = readl(priv->base + NPCM_RNGCS_REG) & NPCM_RNG_DATA_VALID;
> > > + while ((ready == 0) && (timeout_cnt < NPCM_RNG_TIMEOUT_POLL)) {
> > > + usleep_range(500, 1000);
> > > + ready = readl(priv->base + NPCM_RNGCS_REG) &
> > > + NPCM_RNG_DATA_VALID;
> > > + timeout_cnt++;
> > > + }
> > > +
> > > + return !!ready;
> > > +}
> >
> > This looks like an open-coded version of readl_poll_timeout()... better
> > to use the library function.
> >
> > Also the sleep looks a bit long to me. What is the generation rate of
> > the peripheral? Most RNG drivers have short intervals between data
> > generation so they use delays rather than sleeps (a.k.a.
> > readl_poll_timeout_atomic() ).
>
> the HWRNG generate byte of random data in a few milliseconds so it is
> better to use the sleep command.

That's fine, just use readl_poll_timeout() then.


> > > +
> > > +static int npcm_rng_read(struct hwrng *rng, void *buf, size_t max, bool
> > wait)
> > > +{
> > 

[PATCH v3 9/9] padata: remove cpu_index from the parallel_queue

2019-09-05 Thread Daniel Jordan
With the removal of the ENODATA case from padata_get_next, the cpu_index
field is no longer useful, so it can go away.

Signed-off-by: Daniel Jordan 
Acked-by: Steffen Klassert 
Cc: Herbert Xu 
Cc: Lai Jiangshan 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 include/linux/padata.h |  2 --
 kernel/padata.c| 13 ++---
 2 files changed, 2 insertions(+), 13 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index 43d3fd9d17fc..23717eeaad23 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -75,14 +75,12 @@ struct padata_serial_queue {
  * @swork: work struct for serialization.
  * @work: work struct for parallelization.
  * @num_obj: Number of objects that are processed by this cpu.
- * @cpu_index: Index of the cpu.
  */
 struct padata_parallel_queue {
struct padata_listparallel;
struct padata_listreorder;
struct work_structwork;
atomic_t  num_obj;
-   int   cpu_index;
 };
 
 /**
diff --git a/kernel/padata.c b/kernel/padata.c
index 832224dcf2e1..c3fec1413295 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -400,21 +400,12 @@ static void padata_init_squeues(struct parallel_data *pd)
 /* Initialize all percpu queues used by parallel workers */
 static void padata_init_pqueues(struct parallel_data *pd)
 {
-   int cpu_index, cpu;
+   int cpu;
struct padata_parallel_queue *pqueue;
 
-   cpu_index = 0;
-   for_each_possible_cpu(cpu) {
+   for_each_cpu(cpu, pd->cpumask.pcpu) {
pqueue = per_cpu_ptr(pd->pqueue, cpu);
 
-   if (!cpumask_test_cpu(cpu, pd->cpumask.pcpu)) {
-   pqueue->cpu_index = -1;
-   continue;
-   }
-
-   pqueue->cpu_index = cpu_index;
-   cpu_index++;
-
__padata_list_init(&pqueue->reorder);
__padata_list_init(&pqueue->parallel);
INIT_WORK(&pqueue->work, padata_parallel_worker);
-- 
2.23.0



[PATCH v3 5/9] pcrypt: remove padata cpumask notifier

2019-09-05 Thread Daniel Jordan
Now that padata_do_parallel takes care of finding an alternate callback
CPU, there's no need for pcrypt's callback cpumask, so remove it and the
notifier callback that keeps it in sync.

Signed-off-by: Daniel Jordan 
Acked-by: Steffen Klassert 
Cc: Herbert Xu 
Cc: Lai Jiangshan 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 crypto/pcrypt.c | 125 +++-
 1 file changed, 18 insertions(+), 107 deletions(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index efca962ab12a..2ec36e6a132f 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -18,33 +18,8 @@
 #include 
 #include 
 
-struct padata_pcrypt {
-   struct padata_instance *pinst;
-
-   /*
-* Cpumask for callback CPUs. It should be
-* equal to serial cpumask of corresponding padata instance,
-* so it is updated when padata notifies us about serial
-* cpumask change.
-*
-* cb_cpumask is protected by RCU. This fact prevents us from
-* using cpumask_var_t directly because the actual type of
-* cpumsak_var_t depends on kernel configuration(particularly on
-* CONFIG_CPUMASK_OFFSTACK macro). Depending on the configuration
-* cpumask_var_t may be either a pointer to the struct cpumask
-* or a variable allocated on the stack. Thus we can not safely use
-* cpumask_var_t with RCU operations such as rcu_assign_pointer or
-* rcu_dereference. So cpumask_var_t is wrapped with struct
-* pcrypt_cpumask which makes possible to use it with RCU.
-*/
-   struct pcrypt_cpumask {
-   cpumask_var_t mask;
-   } *cb_cpumask;
-   struct notifier_block nblock;
-};
-
-static struct padata_pcrypt pencrypt;
-static struct padata_pcrypt pdecrypt;
+static struct padata_instance *pencrypt;
+static struct padata_instance *pdecrypt;
 static struct kset   *pcrypt_kset;
 
 struct pcrypt_instance_ctx {
@@ -128,7 +103,7 @@ static int pcrypt_aead_encrypt(struct aead_request *req)
   req->cryptlen, req->iv);
aead_request_set_ad(creq, req->assoclen);
 
-   err = padata_do_parallel(pencrypt.pinst, padata, &ctx->cb_cpu);
+   err = padata_do_parallel(pencrypt, padata, &ctx->cb_cpu);
if (!err)
return -EINPROGRESS;
 
@@ -170,7 +145,7 @@ static int pcrypt_aead_decrypt(struct aead_request *req)
   req->cryptlen, req->iv);
aead_request_set_ad(creq, req->assoclen);
 
-   err = padata_do_parallel(pdecrypt.pinst, padata, &ctx->cb_cpu);
+   err = padata_do_parallel(pdecrypt, padata, &ctx->cb_cpu);
if (!err)
return -EINPROGRESS;
 
@@ -317,36 +292,6 @@ static int pcrypt_create(struct crypto_template *tmpl, 
struct rtattr **tb)
return -EINVAL;
 }
 
-static int pcrypt_cpumask_change_notify(struct notifier_block *self,
-   unsigned long val, void *data)
-{
-   struct padata_pcrypt *pcrypt;
-   struct pcrypt_cpumask *new_mask, *old_mask;
-   struct padata_cpumask *cpumask = (struct padata_cpumask *)data;
-
-   if (!(val & PADATA_CPU_SERIAL))
-   return 0;
-
-   pcrypt = container_of(self, struct padata_pcrypt, nblock);
-   new_mask = kmalloc(sizeof(*new_mask), GFP_KERNEL);
-   if (!new_mask)
-   return -ENOMEM;
-   if (!alloc_cpumask_var(&new_mask->mask, GFP_KERNEL)) {
-   kfree(new_mask);
-   return -ENOMEM;
-   }
-
-   old_mask = pcrypt->cb_cpumask;
-
-   cpumask_copy(new_mask->mask, cpumask->cbcpu);
-   rcu_assign_pointer(pcrypt->cb_cpumask, new_mask);
-   synchronize_rcu();
-
-   free_cpumask_var(old_mask->mask);
-   kfree(old_mask);
-   return 0;
-}
-
 static int pcrypt_sysfs_add(struct padata_instance *pinst, const char *name)
 {
int ret;
@@ -359,63 +304,29 @@ static int pcrypt_sysfs_add(struct padata_instance 
*pinst, const char *name)
return ret;
 }
 
-static int pcrypt_init_padata(struct padata_pcrypt *pcrypt,
- const char *name)
+static int pcrypt_init_padata(struct padata_instance **pinst, const char *name)
 {
int ret = -ENOMEM;
-   struct pcrypt_cpumask *mask;
 
get_online_cpus();
 
-   pcrypt->pinst = padata_alloc_possible(name);
-   if (!pcrypt->pinst)
-   goto err;
-
-   mask = kmalloc(sizeof(*mask), GFP_KERNEL);
-   if (!mask)
-   goto err_free_padata;
-   if (!alloc_cpumask_var(&mask->mask, GFP_KERNEL)) {
-   kfree(mask);
-   goto err_free_padata;
-   }
-
-   cpumask_and(mask->mask, cpu_possible_mask, cpu_online_mask);
-   rcu_assign_pointer(pcrypt->cb_cpumask, mask);
-
-   pcrypt->nblock.notifier_ca

[PATCH v3 4/9] padata: make padata_do_parallel find alternate callback CPU

2019-09-05 Thread Daniel Jordan
padata_do_parallel currently returns -EINVAL if the callback CPU isn't
in the callback cpumask.

pcrypt tries to prevent this situation by keeping its own callback
cpumask in sync with padata's and checks that the callback CPU it passes
to padata is valid.  Make padata handle this instead.

padata_do_parallel now takes a pointer to the callback CPU and updates
it for the caller if an alternate CPU is used.  Overall behavior in
terms of which callback CPUs are chosen stays the same.

Prepares for removal of the padata cpumask notifier in pcrypt, which
will fix a lockdep complaint about nested acquisition of the CPU hotplug
lock later in the series.

Signed-off-by: Daniel Jordan 
Acked-by: Steffen Klassert 
Cc: Herbert Xu 
Cc: Lai Jiangshan 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 crypto/pcrypt.c| 33 ++---
 include/linux/padata.h |  2 +-
 kernel/padata.c| 27 ---
 3 files changed, 23 insertions(+), 39 deletions(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index d67293063c7f..efca962ab12a 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -57,35 +57,6 @@ struct pcrypt_aead_ctx {
unsigned int cb_cpu;
 };
 
-static int pcrypt_do_parallel(struct padata_priv *padata, unsigned int *cb_cpu,
- struct padata_pcrypt *pcrypt)
-{
-   unsigned int cpu_index, cpu, i;
-   struct pcrypt_cpumask *cpumask;
-
-   cpu = *cb_cpu;
-
-   rcu_read_lock_bh();
-   cpumask = rcu_dereference_bh(pcrypt->cb_cpumask);
-   if (cpumask_test_cpu(cpu, cpumask->mask))
-   goto out;
-
-   if (!cpumask_weight(cpumask->mask))
-   goto out;
-
-   cpu_index = cpu % cpumask_weight(cpumask->mask);
-
-   cpu = cpumask_first(cpumask->mask);
-   for (i = 0; i < cpu_index; i++)
-   cpu = cpumask_next(cpu, cpumask->mask);
-
-   *cb_cpu = cpu;
-
-out:
-   rcu_read_unlock_bh();
-   return padata_do_parallel(pcrypt->pinst, padata, cpu);
-}
-
 static int pcrypt_aead_setkey(struct crypto_aead *parent,
  const u8 *key, unsigned int keylen)
 {
@@ -157,7 +128,7 @@ static int pcrypt_aead_encrypt(struct aead_request *req)
   req->cryptlen, req->iv);
aead_request_set_ad(creq, req->assoclen);
 
-   err = pcrypt_do_parallel(padata, &ctx->cb_cpu, &pencrypt);
+   err = padata_do_parallel(pencrypt.pinst, padata, &ctx->cb_cpu);
if (!err)
return -EINPROGRESS;
 
@@ -199,7 +170,7 @@ static int pcrypt_aead_decrypt(struct aead_request *req)
   req->cryptlen, req->iv);
aead_request_set_ad(creq, req->assoclen);
 
-   err = pcrypt_do_parallel(padata, &ctx->cb_cpu, &pdecrypt);
+   err = padata_do_parallel(pdecrypt.pinst, padata, &ctx->cb_cpu);
if (!err)
return -EINPROGRESS;
 
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 839d9319920a..f7851f8e2190 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -154,7 +154,7 @@ struct padata_instance {
 extern struct padata_instance *padata_alloc_possible(const char *name);
 extern void padata_free(struct padata_instance *pinst);
 extern int padata_do_parallel(struct padata_instance *pinst,
- struct padata_priv *padata, int cb_cpu);
+ struct padata_priv *padata, int *cb_cpu);
 extern void padata_do_serial(struct padata_priv *padata);
 extern int padata_set_cpumask(struct padata_instance *pinst, int cpumask_type,
  cpumask_var_t cpumask);
diff --git a/kernel/padata.c b/kernel/padata.c
index 58728cd7f40c..9a17922ec436 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -94,17 +94,19 @@ static void padata_parallel_worker(struct work_struct 
*parallel_work)
  *
  * @pinst: padata instance
  * @padata: object to be parallelized
- * @cb_cpu: cpu the serialization callback function will run on,
- *  must be in the serial cpumask of padata(i.e. cpumask.cbcpu).
+ * @cb_cpu: pointer to the CPU that the serialization callback function should
+ *  run on.  If it's not in the serial cpumask of @pinst
+ *  (i.e. cpumask.cbcpu), this function selects a fallback CPU and if
+ *  none found, returns -EINVAL.
  *
  * The parallelization callback function will run with BHs off.
  * Note: Every object which is parallelized by padata_do_parallel
  * must be seen by padata_do_serial.
  */
 int padata_do_parallel(struct padata_instance *pinst,
-  struct padata_priv *padata, int cb_cpu)
+  struct padata_priv *padata, int *cb_cpu)
 {
-   int target_cpu, err;
+   int i, cpu, cpu_index, target_cpu, err;
struct padata_parallel_queue *

[PATCH v3 1/9] padata: allocate workqueue internally

2019-09-05 Thread Daniel Jordan
Move workqueue allocation inside of padata to prepare for further
changes to how padata uses workqueues.

Guarantees the workqueue is created with max_active=1, which padata
relies on to work correctly.  No functional change.

Signed-off-by: Daniel Jordan 
Acked-by: Steffen Klassert 
Cc: Herbert Xu 
Cc: Jonathan Corbet 
Cc: Lai Jiangshan 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: linux-crypto@vger.kernel.org
Cc: linux-...@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 Documentation/padata.txt | 12 ++--
 crypto/pcrypt.c  | 13 ++---
 include/linux/padata.h   |  3 +--
 kernel/padata.c  | 24 +++-
 4 files changed, 24 insertions(+), 28 deletions(-)

diff --git a/Documentation/padata.txt b/Documentation/padata.txt
index b103d0c82000..b37ba1eaace3 100644
--- a/Documentation/padata.txt
+++ b/Documentation/padata.txt
@@ -16,10 +16,12 @@ overall control of how tasks are to be run::
 
 #include 
 
-struct padata_instance *padata_alloc(struct workqueue_struct *wq,
+struct padata_instance *padata_alloc(const char *name,
 const struct cpumask *pcpumask,
 const struct cpumask *cbcpumask);
 
+'name' simply identifies the instance.
+
 The pcpumask describes which processors will be used to execute work
 submitted to this instance in parallel. The cbcpumask defines which
 processors are allowed to be used as the serialization callback processor.
@@ -128,8 +130,7 @@ in that CPU mask or about a not running instance.
 
 Each task submitted to padata_do_parallel() will, in turn, be passed to
 exactly one call to the above-mentioned parallel() function, on one CPU, so
-true parallelism is achieved by submitting multiple tasks.  Despite the
-fact that the workqueue is used to make these calls, parallel() is run with
+true parallelism is achieved by submitting multiple tasks.  parallel() runs 
with
 software interrupts disabled and thus cannot sleep.  The parallel()
 function gets the padata_priv structure pointer as its lone parameter;
 information about the actual work to be done is probably obtained by using
@@ -148,7 +149,7 @@ fact with a call to::
 At some point in the future, padata_do_serial() will trigger a call to the
 serial() function in the padata_priv structure.  That call will happen on
 the CPU requested in the initial call to padata_do_parallel(); it, too, is
-done through the workqueue, but with local software interrupts disabled.
+run with local software interrupts disabled.
 Note that this call may be deferred for a while since the padata code takes
 pains to ensure that tasks are completed in the order in which they were
 submitted.
@@ -159,5 +160,4 @@ when a padata instance is no longer needed::
 void padata_free(struct padata_instance *pinst);
 
 This function will busy-wait while any remaining tasks are completed, so it
-might be best not to call it while there is work outstanding.  Shutting
-down the workqueue, if necessary, should be done separately.
+might be best not to call it while there is work outstanding.
diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 0edf5b54fc77..d67293063c7f 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -20,7 +20,6 @@
 
 struct padata_pcrypt {
struct padata_instance *pinst;
-   struct workqueue_struct *wq;
 
/*
 * Cpumask for callback CPUs. It should be
@@ -397,14 +396,9 @@ static int pcrypt_init_padata(struct padata_pcrypt *pcrypt,
 
get_online_cpus();
 
-   pcrypt->wq = alloc_workqueue("%s", WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE,
-1, name);
-   if (!pcrypt->wq)
-   goto err;
-
-   pcrypt->pinst = padata_alloc_possible(pcrypt->wq);
+   pcrypt->pinst = padata_alloc_possible(name);
if (!pcrypt->pinst)
-   goto err_destroy_workqueue;
+   goto err;
 
mask = kmalloc(sizeof(*mask), GFP_KERNEL);
if (!mask)
@@ -437,8 +431,6 @@ static int pcrypt_init_padata(struct padata_pcrypt *pcrypt,
kfree(mask);
 err_free_padata:
padata_free(pcrypt->pinst);
-err_destroy_workqueue:
-   destroy_workqueue(pcrypt->wq);
 err:
put_online_cpus();
 
@@ -452,7 +444,6 @@ static void pcrypt_fini_padata(struct padata_pcrypt *pcrypt)
 
padata_stop(pcrypt->pinst);
padata_unregister_cpumask_notifier(pcrypt->pinst, &pcrypt->nblock);
-   destroy_workqueue(pcrypt->wq);
padata_free(pcrypt->pinst);
 }
 
diff --git a/include/linux/padata.h b/include/linux/padata.h
index 8da673861d99..839d9319920a 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -151,8 +151,7 @@ struct padata_instance {
 #definePADATA_INVALID  4
 };
 
-extern struct padata_instance *padata_alloc_possible(
-   struct workqueue_struct *wq);
+extern struct padata_

[PATCH v3 0/9] padata: use unbound workqueues for parallel jobs

2019-09-05 Thread Daniel Jordan
48  24653  218 143756 508
   5.6x   1604096  24333   20 136752 548
   5.0x   1608192  23310   15 117660 481

(pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=211)

   2.4x   160  16  5347148279 128047   31328
   3.4x   160  64  3771220855 128187   31074
   4.5x   160 256  27911 4378 126430   31084
   4.9x   160 512  25346  175 123870   29099
   3.1x   1601024  3845223118 120817   26846
   4.7x   1602048  24612  187 115036   23942
   4.5x   1604096  24217  114 109583   21559
   4.2x   1608192  23144  108  96850   16686

multibuffer (pcrypt(rfc4106-gcm-aesni)) encryption (tcrypt mode=215)

   1.0x   160  16 412157 3855 4269731591
   1.0x   160  64 412600 4410 4319204224
   1.1x   160 256 410352 3254 453691   17831
   1.2x   160 512 406293 4948 473491   39818
   1.2x   1601024 395123 7804 478539   27660
   1.2x   1602048 385144 7601 453720   17579
   1.2x   1604096 371989 3631 449923   15331
   1.2x   1608192 346723 1617 399824   18559

multibuffer (pcrypt(rfc4106-gcm-aesni)) decryption (tcrypt mode=215)

   1.1x   160  16 407317 1487 452619   14404
   1.1x   160  64 411821 4261 464059   23541
   1.2x   160 256 408941 4945 477483   36576
   1.2x   160 512 406451  611 472661   11038
   1.2x   1601024 394813 2667 456357   11452
   1.2x   1602048 390291 4175 4489288957
   1.2x   1604096 371904 1068 449344   14225
   1.2x   1608192 344227 1973 404397   19540

Testing
---

In addition to the bare metal performance runs above, this series was
tested in a kvm guest with the tcrypt module (mode=215).  All
combinations of CPUs among parallel_cpumask, serial_cpumask, and CPU
hotplug online/offline were run with 3 possible CPUs, and over 2000
random combinations of these were run with 8 possible CPUs.  Workqueue
events were used throughout to verify that all parallel and serial
workers executed on only the CPUs allowed by the cpumask sysfs files.

Finally, tcrypt mode=215 was run at each patch in the series when built
with and without CONFIG_PADATA/CONFIG_CRYPTO_PCRYPT.

v2:  
https://lore.kernel.org/linux-crypto/20190829173038.21040-1-daniel.m.jor...@oracle.com/
v1:  
https://lore.kernel.org/linux-crypto/20190813005224.30779-1-daniel.m.jor...@oracle.com/
RFC: 
https://lore.kernel.org/lkml/20190725212505.15055-1-daniel.m.jor...@oracle.com/

Daniel Jordan (9):
  padata: allocate workqueue internally
  workqueue: unconfine alloc/apply/free_workqueue_attrs()
  workqueue: require CPU hotplug read exclusion for
apply_workqueue_attrs
  padata: make padata_do_parallel find alternate callback CPU
  pcrypt: remove padata cpumask notifier
  padata, pcrypt: take CPU hotplug lock internally in
padata_alloc_possible
  padata: use separate workqueues for parallel and serial work
  padata: unbind parallel jobs from specific CPUs
  padata: remove cpu_index from the parallel_queue

 Documentation/padata.txt  |  12 +--
 crypto/pcrypt.c   | 167 ---
 include/linux/padata.h|  16 +--
 include/linux/workqueue.h |   4 +
 kernel/padata.c   | 201 ++
 kernel/workqueue.c|  25 +++--
 6 files changed, 170 insertions(+), 255 deletions(-)

-- 
2.23.0



[PATCH v3 3/9] workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs

2019-09-05 Thread Daniel Jordan
Change the calling convention for apply_workqueue_attrs to require CPU
hotplug read exclusion.

Avoids lockdep complaints about nested calls to get_online_cpus in a
future patch where padata calls apply_workqueue_attrs when changing
other CPU-hotplug-sensitive data structures with the CPU read lock
already held.

Signed-off-by: Daniel Jordan 
Acked-by: Tejun Heo 
Acked-by: Steffen Klassert 
Cc: Herbert Xu 
Cc: Lai Jiangshan 
Cc: Peter Zijlstra 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 kernel/workqueue.c | 19 ++-
 1 file changed, 14 insertions(+), 5 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index f53705ff3ff1..bc2e09a8ea61 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4030,6 +4030,8 @@ static int apply_workqueue_attrs_locked(struct 
workqueue_struct *wq,
  *
  * Performs GFP_KERNEL allocations.
  *
+ * Assumes caller has CPU hotplug read exclusion, i.e. get_online_cpus().
+ *
  * Return: 0 on success and -errno on failure.
  */
 int apply_workqueue_attrs(struct workqueue_struct *wq,
@@ -4037,9 +4039,11 @@ int apply_workqueue_attrs(struct workqueue_struct *wq,
 {
int ret;
 
-   apply_wqattrs_lock();
+   lockdep_assert_cpus_held();
+
+   mutex_lock(&wq_pool_mutex);
ret = apply_workqueue_attrs_locked(wq, attrs);
-   apply_wqattrs_unlock();
+   mutex_unlock(&wq_pool_mutex);
 
return ret;
 }
@@ -4152,16 +4156,21 @@ static int alloc_and_link_pwqs(struct workqueue_struct 
*wq)
mutex_unlock(&wq->mutex);
}
return 0;
-   } else if (wq->flags & __WQ_ORDERED) {
+   }
+
+   get_online_cpus();
+   if (wq->flags & __WQ_ORDERED) {
ret = apply_workqueue_attrs(wq, ordered_wq_attrs[highpri]);
/* there should only be single pwq for ordering guarantee */
WARN(!ret && (wq->pwqs.next != &wq->dfl_pwq->pwqs_node ||
  wq->pwqs.prev != &wq->dfl_pwq->pwqs_node),
 "ordering guarantee broken for workqueue %s\n", wq->name);
-   return ret;
} else {
-   return apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
+   ret = apply_workqueue_attrs(wq, unbound_std_wq_attrs[highpri]);
}
+   put_online_cpus();
+
+   return ret;
 }
 
 static int wq_clamp_max_active(int max_active, unsigned int flags,
-- 
2.23.0



[PATCH v3 6/9] padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible

2019-09-05 Thread Daniel Jordan
With pcrypt's cpumask no longer used, take the CPU hotplug lock inside
padata_alloc_possible.

Useful later in the series for avoiding nested acquisition of the CPU
hotplug lock in padata when padata_alloc_possible is allocating an
unbound workqueue.

Without this patch, this nested acquisition would happen later in the
series:

  pcrypt_init_padata
get_online_cpus
alloc_padata_possible
  alloc_padata
alloc_workqueue(WQ_UNBOUND)   // later in the series
  alloc_and_link_pwqs
apply_wqattrs_lock
  get_online_cpus // recursive rwsem acquisition

Signed-off-by: Daniel Jordan 
Acked-by: Steffen Klassert 
Cc: Herbert Xu 
Cc: Lai Jiangshan 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 crypto/pcrypt.c |  4 
 kernel/padata.c | 17 +
 2 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/crypto/pcrypt.c b/crypto/pcrypt.c
index 2ec36e6a132f..543792e0ebf0 100644
--- a/crypto/pcrypt.c
+++ b/crypto/pcrypt.c
@@ -308,8 +308,6 @@ static int pcrypt_init_padata(struct padata_instance 
**pinst, const char *name)
 {
int ret = -ENOMEM;
 
-   get_online_cpus();
-
*pinst = padata_alloc_possible(name);
if (!*pinst)
return ret;
@@ -318,8 +316,6 @@ static int pcrypt_init_padata(struct padata_instance 
**pinst, const char *name)
if (ret)
padata_free(*pinst);
 
-   put_online_cpus();
-
return ret;
 }
 
diff --git a/kernel/padata.c b/kernel/padata.c
index 9a17922ec436..8a362923c488 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -955,8 +955,6 @@ static struct kobj_type padata_attr_type = {
  * @name: used to identify the instance
  * @pcpumask: cpumask that will be used for padata parallelization
  * @cbcpumask: cpumask that will be used for padata serialization
- *
- * Must be called from a cpus_read_lock() protected region
  */
 static struct padata_instance *padata_alloc(const char *name,
const struct cpumask *pcpumask,
@@ -974,11 +972,13 @@ static struct padata_instance *padata_alloc(const char 
*name,
if (!pinst->wq)
goto err_free_inst;
 
+   get_online_cpus();
+
if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
-   goto err_free_wq;
+   goto err_put_cpus;
if (!alloc_cpumask_var(&pinst->cpumask.cbcpu, GFP_KERNEL)) {
free_cpumask_var(pinst->cpumask.pcpu);
-   goto err_free_wq;
+   goto err_put_cpus;
}
if (!padata_validate_cpumask(pinst, pcpumask) ||
!padata_validate_cpumask(pinst, cbcpumask))
@@ -1002,12 +1002,16 @@ static struct padata_instance *padata_alloc(const char 
*name,
 #ifdef CONFIG_HOTPLUG_CPU
cpuhp_state_add_instance_nocalls_cpuslocked(hp_online, &pinst->node);
 #endif
+
+   put_online_cpus();
+
return pinst;
 
 err_free_masks:
free_cpumask_var(pinst->cpumask.pcpu);
free_cpumask_var(pinst->cpumask.cbcpu);
-err_free_wq:
+err_put_cpus:
+   put_online_cpus();
destroy_workqueue(pinst->wq);
 err_free_inst:
kfree(pinst);
@@ -1021,12 +1025,9 @@ static struct padata_instance *padata_alloc(const char 
*name,
  * parallel workers.
  *
  * @name: used to identify the instance
- *
- * Must be called from a cpus_read_lock() protected region
  */
 struct padata_instance *padata_alloc_possible(const char *name)
 {
-   lockdep_assert_cpus_held();
return padata_alloc(name, cpu_possible_mask, cpu_possible_mask);
 }
 EXPORT_SYMBOL(padata_alloc_possible);
-- 
2.23.0



[PATCH v3 8/9] padata: unbind parallel jobs from specific CPUs

2019-09-05 Thread Daniel Jordan
Padata binds the parallel part of a job to a single CPU and round-robins
over all CPUs in the system for each successive job.  Though the serial
parts rely on per-CPU queues for correct ordering, they're not necessary
for parallel work, and it improves performance to run the job locally on
NUMA machines and let the scheduler pick the CPU within a node on a busy
system.

So, make the parallel workqueue unbound.

Update the parallel workqueue's cpumask when the instance's parallel
cpumask changes.

Now that parallel jobs no longer run on max_active=1 workqueues, two or
more parallel works that hash to the same CPU may run simultaneously,
finish out of order, and so be serialized out of order.  Prevent this by
keeping the works sorted on the reorder list by sequence number and
checking that in the reordering logic.

padata_get_next becomes padata_find_next so it can be reused for the end
of padata_reorder, where it's used to avoid uselessly queueing work when
the next job by sequence number isn't finished yet but a later job that
hashed to the same CPU has.

The ENODATA case in padata_find_next no longer makes sense because
parallel jobs aren't bound to specific CPUs.  The EINPROGRESS case takes
care of the scenario where a parallel job is potentially running on the
same CPU as padata_find_next, and with only one error code left, just
use NULL instead.

Signed-off-by: Daniel Jordan 
Cc: Herbert Xu 
Cc: Lai Jiangshan 
Cc: Peter Zijlstra 
Cc: Steffen Klassert 
Cc: Tejun Heo 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 include/linux/padata.h |   3 ++
 kernel/padata.c| 118 +++--
 2 files changed, 68 insertions(+), 53 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index e7978f8942ca..43d3fd9d17fc 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -35,6 +35,7 @@ struct padata_priv {
struct parallel_data*pd;
int cb_cpu;
int cpu;
+   unsigned intseq_nr;
int info;
void(*parallel)(struct padata_priv *padata);
void(*serial)(struct padata_priv *padata);
@@ -105,6 +106,7 @@ struct padata_cpumask {
  * @reorder_objects: Number of objects waiting in the reorder queues.
  * @refcnt: Number of objects holding a reference on this parallel_data.
  * @max_seq_nr:  Maximal used sequence number.
+ * @processed: Number of already processed objects.
  * @cpu: Next CPU to be processed.
  * @cpumask: The cpumasks in use for parallel and serial workers.
  * @reorder_work: work struct for reordering.
@@ -117,6 +119,7 @@ struct parallel_data {
atomic_treorder_objects;
atomic_trefcnt;
atomic_tseq_nr;
+   unsigned intprocessed;
int cpu;
struct padata_cpumask   cpumask;
struct work_struct  reorder_work;
diff --git a/kernel/padata.c b/kernel/padata.c
index 669f5d53d357..832224dcf2e1 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -46,18 +46,13 @@ static int padata_index_to_cpu(struct parallel_data *pd, 
int cpu_index)
return target_cpu;
 }
 
-static int padata_cpu_hash(struct parallel_data *pd)
+static int padata_cpu_hash(struct parallel_data *pd, unsigned int seq_nr)
 {
-   unsigned int seq_nr;
-   int cpu_index;
-
/*
 * Hash the sequence numbers to the cpus by taking
 * seq_nr mod. number of cpus in use.
 */
-
-   seq_nr = atomic_inc_return(&pd->seq_nr);
-   cpu_index = seq_nr % cpumask_weight(pd->cpumask.pcpu);
+   int cpu_index = seq_nr % cpumask_weight(pd->cpumask.pcpu);
 
return padata_index_to_cpu(pd, cpu_index);
 }
@@ -144,7 +139,8 @@ int padata_do_parallel(struct padata_instance *pinst,
padata->pd = pd;
padata->cb_cpu = *cb_cpu;
 
-   target_cpu = padata_cpu_hash(pd);
+   padata->seq_nr = atomic_inc_return(&pd->seq_nr);
+   target_cpu = padata_cpu_hash(pd, padata->seq_nr);
padata->cpu = target_cpu;
queue = per_cpu_ptr(pd->pqueue, target_cpu);
 
@@ -152,7 +148,7 @@ int padata_do_parallel(struct padata_instance *pinst,
list_add_tail(&padata->list, &queue->parallel.list);
spin_unlock(&queue->parallel.lock);
 
-   queue_work_on(target_cpu, pinst->parallel_wq, &queue->work);
+   queue_work(pinst->parallel_wq, &queue->work);
 
 out:
rcu_read_unlock_bh();
@@ -162,21 +158,19 @@ int padata_do_parallel(struct padata_instance *pinst,
 EXPORT_SYMBOL(padata_do_parallel);
 
 /*
- * padata_get_next - Get the next object that needs serialization.
+ * padata_find_next - Find the next object that needs serialization.
  *
  * R

[PATCH v3 7/9] padata: use separate workqueues for parallel and serial work

2019-09-05 Thread Daniel Jordan
padata currently uses one per-CPU workqueue per instance for all work.

Prepare for running parallel jobs on an unbound workqueue by introducing
dedicated workqueues for parallel and serial work.

Signed-off-by: Daniel Jordan 
Acked-by: Steffen Klassert 
Cc: Herbert Xu 
Cc: Lai Jiangshan 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 include/linux/padata.h |  6 --
 kernel/padata.c| 28 ++--
 2 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/include/linux/padata.h b/include/linux/padata.h
index f7851f8e2190..e7978f8942ca 100644
--- a/include/linux/padata.h
+++ b/include/linux/padata.h
@@ -127,7 +127,8 @@ struct parallel_data {
  * struct padata_instance - The overall control structure.
  *
  * @cpu_notifier: cpu hotplug notifier.
- * @wq: The workqueue in use.
+ * @parallel_wq: The workqueue used for parallel work.
+ * @serial_wq: The workqueue used for serial work.
  * @pd: The internal control structure.
  * @cpumask: User supplied cpumasks for parallel and serial works.
  * @cpumask_change_notifier: Notifiers chain for user-defined notify
@@ -139,7 +140,8 @@ struct parallel_data {
  */
 struct padata_instance {
struct hlist_nodenode;
-   struct workqueue_struct *wq;
+   struct workqueue_struct *parallel_wq;
+   struct workqueue_struct *serial_wq;
struct parallel_data*pd;
struct padata_cpumask   cpumask;
struct blocking_notifier_headcpumask_change_notifier;
diff --git a/kernel/padata.c b/kernel/padata.c
index 8a362923c488..669f5d53d357 100644
--- a/kernel/padata.c
+++ b/kernel/padata.c
@@ -152,7 +152,7 @@ int padata_do_parallel(struct padata_instance *pinst,
list_add_tail(&padata->list, &queue->parallel.list);
spin_unlock(&queue->parallel.lock);
 
-   queue_work_on(target_cpu, pinst->wq, &queue->work);
+   queue_work_on(target_cpu, pinst->parallel_wq, &queue->work);
 
 out:
rcu_read_unlock_bh();
@@ -261,7 +261,7 @@ static void padata_reorder(struct parallel_data *pd)
list_add_tail(&padata->list, &squeue->serial.list);
spin_unlock(&squeue->serial.lock);
 
-   queue_work_on(cb_cpu, pinst->wq, &squeue->work);
+   queue_work_on(cb_cpu, pinst->serial_wq, &squeue->work);
}
 
spin_unlock_bh(&pd->lock);
@@ -278,7 +278,7 @@ static void padata_reorder(struct parallel_data *pd)
 
next_queue = per_cpu_ptr(pd->pqueue, pd->cpu);
if (!list_empty(&next_queue->reorder.list))
-   queue_work(pinst->wq, &pd->reorder_work);
+   queue_work(pinst->serial_wq, &pd->reorder_work);
 }
 
 static void invoke_padata_reorder(struct work_struct *work)
@@ -818,7 +818,8 @@ static void __padata_free(struct padata_instance *pinst)
padata_free_pd(pinst->pd);
free_cpumask_var(pinst->cpumask.pcpu);
free_cpumask_var(pinst->cpumask.cbcpu);
-   destroy_workqueue(pinst->wq);
+   destroy_workqueue(pinst->serial_wq);
+   destroy_workqueue(pinst->parallel_wq);
kfree(pinst);
 }
 
@@ -967,18 +968,23 @@ static struct padata_instance *padata_alloc(const char 
*name,
if (!pinst)
goto err;
 
-   pinst->wq = alloc_workqueue("%s", WQ_MEM_RECLAIM | WQ_CPU_INTENSIVE,
-   1, name);
-   if (!pinst->wq)
+   pinst->parallel_wq = alloc_workqueue("%s_parallel", WQ_MEM_RECLAIM |
+WQ_CPU_INTENSIVE, 1, name);
+   if (!pinst->parallel_wq)
goto err_free_inst;
 
get_online_cpus();
 
-   if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
+   pinst->serial_wq = alloc_workqueue("%s_serial", WQ_MEM_RECLAIM |
+  WQ_CPU_INTENSIVE, 1, name);
+   if (!pinst->serial_wq)
goto err_put_cpus;
+
+   if (!alloc_cpumask_var(&pinst->cpumask.pcpu, GFP_KERNEL))
+   goto err_free_serial_wq;
if (!alloc_cpumask_var(&pinst->cpumask.cbcpu, GFP_KERNEL)) {
free_cpumask_var(pinst->cpumask.pcpu);
-   goto err_put_cpus;
+   goto err_free_serial_wq;
}
if (!padata_validate_cpumask(pinst, pcpumask) ||
!padata_validate_cpumask(pinst, cbcpumask))
@@ -1010,9 +1016,11 @@ static struct padata_instance *padata_alloc(const char 
*name,
 err_free_masks:
free_cpumask_var(pinst->cpumask.pcpu);
free_cpumask_var(pinst->cpumask.cbcpu);
+err_free_serial_wq:
+   destroy_workqueue(pinst->serial_wq);
 err_put_cpus:
put_online_cpus();
-   destroy_workqueue(pinst->wq);
+   destroy_workqueue(pinst->parallel_wq);
 err_free_inst:
kfree(pinst);
 err:
-- 
2.23.0



[PATCH v3 2/9] workqueue: unconfine alloc/apply/free_workqueue_attrs()

2019-09-05 Thread Daniel Jordan
padata will use these these interfaces in a later patch, so unconfine them.

Signed-off-by: Daniel Jordan 
Acked-by: Tejun Heo 
Acked-by: Steffen Klassert 
Cc: Herbert Xu 
Cc: Lai Jiangshan 
Cc: Peter Zijlstra 
Cc: linux-crypto@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
---
 include/linux/workqueue.h | 4 
 kernel/workqueue.c| 6 +++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/linux/workqueue.h b/include/linux/workqueue.h
index b7c585b5ec1c..4261d1c6e87b 100644
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -435,6 +435,10 @@ struct workqueue_struct *alloc_workqueue(const char *fmt,
 
 extern void destroy_workqueue(struct workqueue_struct *wq);
 
+struct workqueue_attrs *alloc_workqueue_attrs(void);
+void free_workqueue_attrs(struct workqueue_attrs *attrs);
+int apply_workqueue_attrs(struct workqueue_struct *wq,
+ const struct workqueue_attrs *attrs);
 int workqueue_set_unbound_cpumask(cpumask_var_t cpumask);
 
 extern bool queue_work_on(int cpu, struct workqueue_struct *wq,
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 601d61150b65..f53705ff3ff1 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -3329,7 +3329,7 @@ EXPORT_SYMBOL_GPL(execute_in_process_context);
  *
  * Undo alloc_workqueue_attrs().
  */
-static void free_workqueue_attrs(struct workqueue_attrs *attrs)
+void free_workqueue_attrs(struct workqueue_attrs *attrs)
 {
if (attrs) {
free_cpumask_var(attrs->cpumask);
@@ -3345,7 +3345,7 @@ static void free_workqueue_attrs(struct workqueue_attrs 
*attrs)
  *
  * Return: The allocated new workqueue_attr on success. %NULL on failure.
  */
-static struct workqueue_attrs *alloc_workqueue_attrs(void)
+struct workqueue_attrs *alloc_workqueue_attrs(void)
 {
struct workqueue_attrs *attrs;
 
@@ -4032,7 +4032,7 @@ static int apply_workqueue_attrs_locked(struct 
workqueue_struct *wq,
  *
  * Return: 0 on success and -errno on failure.
  */
-static int apply_workqueue_attrs(struct workqueue_struct *wq,
+int apply_workqueue_attrs(struct workqueue_struct *wq,
  const struct workqueue_attrs *attrs)
 {
int ret;
-- 
2.23.0



Re: [PATCH v2 0/9] padata: use unbound workqueues for parallel jobs

2019-09-05 Thread Daniel Jordan
On Thu, Sep 05, 2019 at 02:35:48PM +1000, Herbert Xu wrote:
> On Thu, Aug 29, 2019 at 01:30:29PM -0400, Daniel Jordan wrote:
> > Hello,
> > 
> > Everything in the Testing section has been rerun after the suggestion
> > from Herbert last round.  Thanks again to Steffen for giving this a run.
> > 
> > Any comments welcome.
> > 
> > Daniel
> > 
> > v1[*]  -> v2:
> >  - Updated patch 8 to avoid queueing the reorder work if the next object
> >by sequence number isn't ready yet (Herbert)
> >  - Added Steffen's ack to all but patch 8 since that one changed.
> 
> This doesn't apply against cryptodev.  Perhaps it depends on the
> flushing patch series? If that's the case please combine both into
> one series.

I had developed this on top of the flushing series, but this doesn't depend on
it, so I've rebased this onto today's cryptodev and will send it soon.


Re: [PATCH v2 1/5] padata: make flushing work with async users

2019-09-05 Thread Daniel Jordan
On Thu, Sep 05, 2019 at 02:17:35PM +1000, Herbert Xu wrote:
> On Wed, Aug 28, 2019 at 06:14:21PM -0400, Daniel Jordan wrote:
> >
> > @@ -453,24 +456,15 @@ static void padata_free_pd(struct parallel_data *pd)
> >  /* Flush all objects out of the padata queues. */
> >  static void padata_flush_queues(struct parallel_data *pd)
> >  {
> > -   int cpu;
> > -   struct padata_parallel_queue *pqueue;
> > -   struct padata_serial_queue *squeue;
> > -
> > -   for_each_cpu(cpu, pd->cpumask.pcpu) {
> > -   pqueue = per_cpu_ptr(pd->pqueue, cpu);
> > -   flush_work(&pqueue->work);
> > -   }
> > -
> > -   if (atomic_read(&pd->reorder_objects))
> > -   padata_reorder(pd);
> > +   if (!(pd->pinst->flags & PADATA_INIT))
> > +   return;
> >  
> > -   for_each_cpu(cpu, pd->cpumask.cbcpu) {
> > -   squeue = per_cpu_ptr(pd->squeue, cpu);
> > -   flush_work(&squeue->work);
> > -   }
> > +   if (atomic_dec_return(&pd->refcnt) == 0)
> > +   complete(&pd->flushing_done);
> >  
> > -   BUG_ON(atomic_read(&pd->refcnt) != 0);
> > +   wait_for_completion(&pd->flushing_done);
> > +   reinit_completion(&pd->flushing_done);
> > +   atomic_set(&pd->refcnt, 1);
> >  }
> 
> I don't think waiting is an option.  In a pathological case the
> hardware may not return at all.  We cannot and should not hold off
> CPU hotplug for an arbitrary amount of time when the event we are
> waiting for isn't even occuring on that CPU.

Ok, I hadn't considered hardware not returning.

> I don't think flushing is needed at all.  All we need to do is
> maintain consistency before and after the CPU hotplug event.

I could imagine not flushing would work for replacing a pd.  The old pd could
be freed by whatever drops the last reference and the new pd could be
installed, all without flushing.

In the case of freeing an instance, though, padata needs to wait for all the
jobs to complete so they don't use the instance's data after it's been freed.
Holding the CPU hotplug lock isn't necessary for this, though, so I think we're
ok to wait here.


  1   2   3   >