Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2023-01-19 Thread Phil Auld
Hi Greg, et alia,

On Tue, Dec 13, 2022 at 03:31:06PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Dec 13, 2022 at 08:22:58AM -0500, Phil Auld wrote:

> > > 
> > > The idea seems good, the implementation might need a bit of work :)
> > 
> > More than the one comment below? Let me know.
> 
> No idea, resubmit a working patch and I'll review it properly :)
> 

I finally got this posted, twice :(. Sorry for the delay, I ran into
what turned out to be an unrelated issue while testing it, plus end of
the year holidays and what not. 

https://lore.kernel.org/lkml/20230119150758.880189-1-pa...@redhat.com/T/#u


Cheers,
Phil

-- 



Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-13 Thread Phil Auld
On Wed, Dec 14, 2022 at 10:41:25AM +1100 Michael Ellerman wrote:
> Phil Auld  writes:
> > On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> >> On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> >> > Hi,
> >> > 
> >> > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> >> > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> >> > > > 
> >> > > > Thanks Greg & Peter for your direction. 
> >> > > > 
> >> > > > While we pursue the idea of having debugfs based on kernfs, we 
> >> > > > thought about
> >> > > > having a boot time parameter which would disable creating and 
> >> > > > updating of the
> >> > > > sched_domain debugfs files and this would also be useful even when 
> >> > > > the kernfs
> >> > > > solution kicks in, as users who may not care about these debugfs 
> >> > > > files would
> >> > > > benefit from a faster CPU hotplug operation.
> >> > > 
> >> > > Ick, no, you would be adding a new user/kernel api that you will be
> >> > > required to support for the next 20+ years.  Just to get over a
> >> > > short-term issue before you solve the problem properly.
> >> > 
> >> > I'm not convinced moving these files from debugfs to kernfs is the right
> >> > fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
> >> > I don't think either of those numbers is reasonable.
> >> > 
> >> > The issue as I see it is the full rebuild for every change with no way to
> >> > batch the changes. How about something like the below?
> >> > 
> >> > This puts the domains/* files under the sched_verbose flag. About the 
> >> > only
> >> > thing under that flag now are the detailed topology discovery printks 
> >> > anyway
> >> > so this fits together nicely.
> >> > 
> >> > This way the files would be off by default (assuming you don't boot with
> >> > sched_verbose) and can be created at runtime by enabling verbose. 
> >> > Multiple
> >> > changes could also be batched by disabling/makeing changes/re-enabling.
> >> > 
> >> > It does not create a new API, uses one that is already there.
> >> 
> >> The idea seems good, the implementation might need a bit of work :)
> >
> > More than the one comment below? Let me know.
> >
> >> 
> >> > > If you really do not want these debugfs files, just disable debugfs 
> >> > > from
> >> > > your system.  That should be a better short-term solution, right?
> >> > 
> >> > We do find these files useful at times for debugging issue and looking
> >> > at what's going on on the system.
> >> > 
> >> > > 
> >> > > Or better yet, disable SCHED_DEBUG, why can't you do that?
> >> > 
> >> > Same with this... useful information with (modulo issues like this)
> >> > small cost. There are also tuning knobs that are only available
> >> > with SCHED_DEBUG. 
> >> > 
> >> > 
> >> > Cheers,
> >> > Phil
> >> > 
> >> > ---
> >> > 
> >> > sched/debug: Put sched/domains files under verbose flag
> >> > 
> >> > The debug files under sched/domains can take a long time to regenerate,
> >> > especially when updates are done one at a time. Move these files under
> >> > the verbose debug flag. Allow changes to verbose to trigger generation
> >> > of the files. This lets a user batch the updates but still have the
> >> > information available.  The detailed topology printk messages are also
> >> > under verbose.
> >> > 
> >> > Signed-off-by: Phil Auld 
> >> > ---
> >> >  kernel/sched/debug.c | 68 ++--
> >> >  1 file changed, 66 insertions(+), 2 deletions(-)
> >> > 
> >> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> >> > index 1637b65ba07a..2eb51ee3ccab 100644
> >> > --- a/kernel/sched/debug.c
> >> > +++ b/kernel/sched/debug.c
> >> > @@ -280,6 +280,31 @@ static const struct file_operations 
> >> > sched_dynamic_fops = {
> &g

Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-13 Thread Phil Auld
On Tue, Dec 13, 2022 at 03:31:06PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Dec 13, 2022 at 08:22:58AM -0500, Phil Auld wrote:
> > On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> > > On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> > > > Hi,
> > > > 
> > > > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> > > > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > > > > > 
> > > > > > Thanks Greg & Peter for your direction. 
> > > > > > 
> > > > > > While we pursue the idea of having debugfs based on kernfs, we 
> > > > > > thought about
> > > > > > having a boot time parameter which would disable creating and 
> > > > > > updating of the
> > > > > > sched_domain debugfs files and this would also be useful even when 
> > > > > > the kernfs
> > > > > > solution kicks in, as users who may not care about these debugfs 
> > > > > > files would
> > > > > > benefit from a faster CPU hotplug operation.
> > > > > 
> > > > > Ick, no, you would be adding a new user/kernel api that you will be
> > > > > required to support for the next 20+ years.  Just to get over a
> > > > > short-term issue before you solve the problem properly.
> > > > 
> > > > I'm not convinced moving these files from debugfs to kernfs is the right
> > > > fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
> > > > I don't think either of those numbers is reasonable.
> > > > 
> > > > The issue as I see it is the full rebuild for every change with no way 
> > > > to
> > > > batch the changes. How about something like the below?
> > > > 
> > > > This puts the domains/* files under the sched_verbose flag. About the 
> > > > only
> > > > thing under that flag now are the detailed topology discovery printks 
> > > > anyway
> > > > so this fits together nicely.
> > > > 
> > > > This way the files would be off by default (assuming you don't boot with
> > > > sched_verbose) and can be created at runtime by enabling verbose. 
> > > > Multiple
> > > > changes could also be batched by disabling/makeing changes/re-enabling.
> > > > 
> > > > It does not create a new API, uses one that is already there.
> > > 
> > > The idea seems good, the implementation might need a bit of work :)
> > 
> > More than the one comment below? Let me know.
> 
> No idea, resubmit a working patch and I'll review it properly :)
>

Will do. 


Thanks,
Phil


-- 



Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-13 Thread Phil Auld
On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> > Hi,
> > 
> > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > > > 
> > > > Thanks Greg & Peter for your direction. 
> > > > 
> > > > While we pursue the idea of having debugfs based on kernfs, we thought 
> > > > about
> > > > having a boot time parameter which would disable creating and updating 
> > > > of the
> > > > sched_domain debugfs files and this would also be useful even when the 
> > > > kernfs
> > > > solution kicks in, as users who may not care about these debugfs files 
> > > > would
> > > > benefit from a faster CPU hotplug operation.
> > > 
> > > Ick, no, you would be adding a new user/kernel api that you will be
> > > required to support for the next 20+ years.  Just to get over a
> > > short-term issue before you solve the problem properly.
> > 
> > I'm not convinced moving these files from debugfs to kernfs is the right
> > fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
> > I don't think either of those numbers is reasonable.
> > 
> > The issue as I see it is the full rebuild for every change with no way to
> > batch the changes. How about something like the below?
> > 
> > This puts the domains/* files under the sched_verbose flag. About the only
> > thing under that flag now are the detailed topology discovery printks anyway
> > so this fits together nicely.
> > 
> > This way the files would be off by default (assuming you don't boot with
> > sched_verbose) and can be created at runtime by enabling verbose. Multiple
> > changes could also be batched by disabling/makeing changes/re-enabling.
> > 
> > It does not create a new API, uses one that is already there.
> 
> The idea seems good, the implementation might need a bit of work :)

More than the one comment below? Let me know.

> 
> > > If you really do not want these debugfs files, just disable debugfs from
> > > your system.  That should be a better short-term solution, right?
> > 
> > We do find these files useful at times for debugging issue and looking
> > at what's going on on the system.
> > 
> > > 
> > > Or better yet, disable SCHED_DEBUG, why can't you do that?
> > 
> > Same with this... useful information with (modulo issues like this)
> > small cost. There are also tuning knobs that are only available
> > with SCHED_DEBUG. 
> > 
> > 
> > Cheers,
> > Phil
> > 
> > ---
> > 
> > sched/debug: Put sched/domains files under verbose flag
> > 
> > The debug files under sched/domains can take a long time to regenerate,
> > especially when updates are done one at a time. Move these files under
> > the verbose debug flag. Allow changes to verbose to trigger generation
> > of the files. This lets a user batch the updates but still have the
> > information available.  The detailed topology printk messages are also
> > under verbose.
> > 
> > Signed-off-by: Phil Auld 
> > ---
> >  kernel/sched/debug.c | 68 ++--
> >  1 file changed, 66 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index 1637b65ba07a..2eb51ee3ccab 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops 
> > = {
> >  
> >  __read_mostly bool sched_debug_verbose;
> >  
> > +static ssize_t sched_verbose_write(struct file *filp, const char __user 
> > *ubuf,
> > +  size_t cnt, loff_t *ppos);
> > +
> > +static int sched_verbose_show(struct seq_file *m, void *v)
> > +{
> > +   if (sched_debug_verbose)
> > +   seq_puts(m,"Y\n");
> > +   else
> > +   seq_puts(m,"N\n");
> > +   return 0;
> > +}
> > +
> > +static int sched_verbose_open(struct inode *inode, struct file *filp)
> > +{
> > +   return single_open(filp, sched_verbose_show, NULL);
> > +}
> > +
> > +static const struct file_operations sched_verbose_fops = {
> > +   .open   = sched_verbose_open,
> > +   .write  = sched_verbose_write,
> > +   .read   = seq_read,
&g

Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-12 Thread Phil Auld
Hi,

On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > 
> > Thanks Greg & Peter for your direction. 
> > 
> > While we pursue the idea of having debugfs based on kernfs, we thought about
> > having a boot time parameter which would disable creating and updating of 
> > the
> > sched_domain debugfs files and this would also be useful even when the 
> > kernfs
> > solution kicks in, as users who may not care about these debugfs files would
> > benefit from a faster CPU hotplug operation.
> 
> Ick, no, you would be adding a new user/kernel api that you will be
> required to support for the next 20+ years.  Just to get over a
> short-term issue before you solve the problem properly.

I'm not convinced moving these files from debugfs to kernfs is the right
fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
I don't think either of those numbers is reasonable.

The issue as I see it is the full rebuild for every change with no way to
batch the changes. How about something like the below?

This puts the domains/* files under the sched_verbose flag. About the only
thing under that flag now are the detailed topology discovery printks anyway
so this fits together nicely.

This way the files would be off by default (assuming you don't boot with
sched_verbose) and can be created at runtime by enabling verbose. Multiple
changes could also be batched by disabling/makeing changes/re-enabling.

It does not create a new API, uses one that is already there.

> 
> If you really do not want these debugfs files, just disable debugfs from
> your system.  That should be a better short-term solution, right?

We do find these files useful at times for debugging issue and looking
at what's going on on the system.

> 
> Or better yet, disable SCHED_DEBUG, why can't you do that?

Same with this... useful information with (modulo issues like this)
small cost. There are also tuning knobs that are only available
with SCHED_DEBUG. 


Cheers,
Phil

---

sched/debug: Put sched/domains files under verbose flag

The debug files under sched/domains can take a long time to regenerate,
especially when updates are done one at a time. Move these files under
the verbose debug flag. Allow changes to verbose to trigger generation
of the files. This lets a user batch the updates but still have the
information available.  The detailed topology printk messages are also
under verbose.

Signed-off-by: Phil Auld 
---
 kernel/sched/debug.c | 68 ++--
 1 file changed, 66 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..2eb51ee3ccab 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops = {
 
 __read_mostly bool sched_debug_verbose;
 
+static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
+  size_t cnt, loff_t *ppos);
+
+static int sched_verbose_show(struct seq_file *m, void *v)
+{
+   if (sched_debug_verbose)
+   seq_puts(m,"Y\n");
+   else
+   seq_puts(m,"N\n");
+   return 0;
+}
+
+static int sched_verbose_open(struct inode *inode, struct file *filp)
+{
+   return single_open(filp, sched_verbose_show, NULL);
+}
+
+static const struct file_operations sched_verbose_fops = {
+   .open   = sched_verbose_open,
+   .write  = sched_verbose_write,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= seq_release,
+};
+
 static const struct seq_operations sched_debug_sops;
 
 static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
debugfs_sched = debugfs_create_dir("sched", NULL);
 
debugfs_create_file("features", 0644, debugfs_sched, NULL, 
&sched_feat_fops);
-   debugfs_create_bool("verbose", 0644, debugfs_sched, 
&sched_debug_verbose);
+   debugfs_create_file("verbose", 0644, debugfs_sched, NULL, 
&sched_verbose_fops);
 #ifdef CONFIG_PREEMPT_DYNAMIC
debugfs_create_file("preempt", 0644, debugfs_sched, NULL, 
&sched_dynamic_fops);
 #endif
@@ -402,15 +427,23 @@ void update_sched_domain_debugfs(void)
if (!debugfs_sched)
return;
 
+   if (!sched_debug_verbose)
+   return;
+
if (!cpumask_available(sd_sysctl_cpus)) {
if (!alloc_cpumask_var(&sd_sysctl_cpus, GFP_KERNEL))
return;
cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
}
 
-   if (!sd_dentry)
+   if (!sd_dentry) 

Re: [PATCH v3 1/2] powerpc/vcpu: Assume dedicated processors as non-preempt

2019-12-05 Thread Phil Auld
On Thu, Dec 05, 2019 at 02:02:17PM +0530 Srikar Dronamraju wrote:
> With commit 247f2f6f3c70 ("sched/core: Don't schedule threads on pre-empted
> vCPUs"), scheduler avoids preempted vCPUs to schedule tasks on wakeup.
> This leads to wrong choice of CPU, which in-turn leads to larger wakeup
> latencies. Eventually, it leads to performance regression in latency
> sensitive benchmarks like soltp, schbench etc.
> 
> On Powerpc, vcpu_is_preempted only looks at yield_count. If the
> yield_count is odd, the vCPU is assumed to be preempted. However
> yield_count is increased whenever LPAR enters CEDE state. So any CPU
> that has entered CEDE state is assumed to be preempted.
> 
> Even if vCPU of dedicated LPAR is preempted/donated, it should have
> right of first-use since they are suppose to own the vCPU.
> 
> On a Power9 System with 32 cores
>  # lscpu
> Architecture:ppc64le
> Byte Order:  Little Endian
> CPU(s):  128
> On-line CPU(s) list: 0-127
> Thread(s) per core:  8
> Core(s) per socket:  1
> Socket(s):   16
> NUMA node(s):2
> Model:   2.2 (pvr 004e 0202)
> Model name:  POWER9 (architected), altivec supported
> Hypervisor vendor:   pHyp
> Virtualization type: para
> L1d cache:   32K
> L1i cache:   32K
> L2 cache:512K
> L3 cache:10240K
> NUMA node0 CPU(s):   0-63
> NUMA node1 CPU(s):   64-127
> 
>   # perf stat -a -r 5 ./schbench
> v5.4 v5.4 + patch
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 39
>   75.th: 62   75.th: 53
>   90.th: 71   90.th: 67
>   95.th: 77   95.th: 76
>   *99.th: 91  *99.th: 89
>   99.5000th: 707  99.5000th: 93
>   99.9000th: 6920 99.9000th: 118
>   min=0, max=10048min=0, max=211
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 61   75.th: 45
>   90.th: 72   90.th: 53
>   95.th: 79   95.th: 56
>   *99.th: 691 *99.th: 61
>   99.5000th: 3972 99.5000th: 63
>   99.9000th: 8368 99.9000th: 78
>   min=0, max=16606min=0, max=228
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 61   75.th: 45
>   90.th: 71   90.th: 53
>   95.th: 77   95.th: 57
>   *99.th: 106 *99.th: 63
>   99.5000th: 2364 99.5000th: 68
>   99.9000th: 7480 99.9000th: 100
>   min=0, max=10001min=0, max=134
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 45   50.th: 34
>   75.th: 62   75.th: 46
>   90.th: 72   90.th: 53
>   95.th: 78   95.th: 56
>   *99.th: 93  *99.th: 61
>   99.5000th: 108  99.5000th: 64
>   99.9000th: 6792 99.9000th: 85
>   min=0, max=17681min=0, max=121
> Latency percentiles (usec)   Latency percentiles (usec)
>   50.th: 46   50.th: 33
>   75.th: 62   75.th: 44
>   90.th: 73   90.th: 51
>   95.th: 79   95.th: 54
>   *99.th: 113 *99.th: 61
>   99.5000th: 2724 99.5000th: 64
>   99.9000th: 6184 99.9000th: 82
>   min=0, max=9887 min=0, max=121
> 
>  Performance counter stats for 'system wide' (5 runs):
> 
> context-switches43,373  ( +-  0.40% )   44,597 ( +-  0.55% )
> cpu-migrations   1,211  ( +-  5.04% )  220 ( +-  6.23% )
> page-faults 15,983  ( +-  5.21% )   15,360 ( +-  3.38% )
> 
> Waiman Long suggested using static_keys.
> 
> Reported-by: Parth 

Re: [PATCH v3 2/2] powerpc/shared: Use static key to detect shared processor

2019-12-05 Thread Phil Auld
On Thu, Dec 05, 2019 at 02:02:18PM +0530 Srikar Dronamraju wrote:
> With the static key shared processor available, is_shared_processor()
> can return without having to query the lppaca structure.
> 
> Cc: Parth Shah 
> Cc: Ihor Pasichnyk 
> Cc: Juri Lelli 
> Cc: Phil Auld 
> Cc: Waiman Long 
> Signed-off-by: Srikar Dronamraju 
> ---
> Changelog v1 (https://patchwork.ozlabs.org/patch/1204192/) ->v2:
> Now that we no more refer to lppaca, remove the comment.
> 
> Changelog v2->v3:
> Code is now under CONFIG_PPC_SPLPAR as it depends on CONFIG_PPC_PSERIES.
> This was suggested by Waiman Long.
> 
>  arch/powerpc/include/asm/spinlock.h | 9 ++---
>  1 file changed, 2 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/spinlock.h 
> b/arch/powerpc/include/asm/spinlock.h
> index de817c25deff..e83d57f27566 100644
> --- a/arch/powerpc/include/asm/spinlock.h
> +++ b/arch/powerpc/include/asm/spinlock.h
> @@ -111,13 +111,8 @@ static inline void splpar_rw_yield(arch_rwlock_t *lock) 
> {};
>  
>  static inline bool is_shared_processor(void)
>  {
> -/*
> - * LPPACA is only available on Pseries so guard anything LPPACA related to
> - * allow other platforms (which include this common header) to compile.
> - */
> -#ifdef CONFIG_PPC_PSERIES
> - return (IS_ENABLED(CONFIG_PPC_SPLPAR) &&
> - lppaca_shared_proc(local_paca->lppaca_ptr));
> +#ifdef CONFIG_PPC_SPLPAR
> +     return static_branch_unlikely(&shared_processor);
>  #else
>   return false;
>  #endif
> -- 
> 2.18.1
> 

Fwiw,

Acked-by: Phil Auld 
--