Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2023-01-19 Thread Phil Auld
Hi Greg, et alia,

On Tue, Dec 13, 2022 at 03:31:06PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Dec 13, 2022 at 08:22:58AM -0500, Phil Auld wrote:

> > > 
> > > The idea seems good, the implementation might need a bit of work :)
> > 
> > More than the one comment below? Let me know.
> 
> No idea, resubmit a working patch and I'll review it properly :)
> 

I finally got this posted, twice :(. Sorry for the delay, I ran into
what turned out to be an unrelated issue while testing it, plus end of
the year holidays and what not. 

https://lore.kernel.org/lkml/20230119150758.880189-1-pa...@redhat.com/T/#u


Cheers,
Phil

-- 



Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-13 Thread Phil Auld
On Wed, Dec 14, 2022 at 10:41:25AM +1100 Michael Ellerman wrote:
> Phil Auld  writes:
> > On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> >> On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> >> > Hi,
> >> > 
> >> > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> >> > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> >> > > > 
> >> > > > Thanks Greg & Peter for your direction. 
> >> > > > 
> >> > > > While we pursue the idea of having debugfs based on kernfs, we 
> >> > > > thought about
> >> > > > having a boot time parameter which would disable creating and 
> >> > > > updating of the
> >> > > > sched_domain debugfs files and this would also be useful even when 
> >> > > > the kernfs
> >> > > > solution kicks in, as users who may not care about these debugfs 
> >> > > > files would
> >> > > > benefit from a faster CPU hotplug operation.
> >> > > 
> >> > > Ick, no, you would be adding a new user/kernel api that you will be
> >> > > required to support for the next 20+ years.  Just to get over a
> >> > > short-term issue before you solve the problem properly.
> >> > 
> >> > I'm not convinced moving these files from debugfs to kernfs is the right
> >> > fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
> >> > I don't think either of those numbers is reasonable.
> >> > 
> >> > The issue as I see it is the full rebuild for every change with no way to
> >> > batch the changes. How about something like the below?
> >> > 
> >> > This puts the domains/* files under the sched_verbose flag. About the 
> >> > only
> >> > thing under that flag now are the detailed topology discovery printks 
> >> > anyway
> >> > so this fits together nicely.
> >> > 
> >> > This way the files would be off by default (assuming you don't boot with
> >> > sched_verbose) and can be created at runtime by enabling verbose. 
> >> > Multiple
> >> > changes could also be batched by disabling/makeing changes/re-enabling.
> >> > 
> >> > It does not create a new API, uses one that is already there.
> >> 
> >> The idea seems good, the implementation might need a bit of work :)
> >
> > More than the one comment below? Let me know.
> >
> >> 
> >> > > If you really do not want these debugfs files, just disable debugfs 
> >> > > from
> >> > > your system.  That should be a better short-term solution, right?
> >> > 
> >> > We do find these files useful at times for debugging issue and looking
> >> > at what's going on on the system.
> >> > 
> >> > > 
> >> > > Or better yet, disable SCHED_DEBUG, why can't you do that?
> >> > 
> >> > Same with this... useful information with (modulo issues like this)
> >> > small cost. There are also tuning knobs that are only available
> >> > with SCHED_DEBUG. 
> >> > 
> >> > 
> >> > Cheers,
> >> > Phil
> >> > 
> >> > ---
> >> > 
> >> > sched/debug: Put sched/domains files under verbose flag
> >> > 
> >> > The debug files under sched/domains can take a long time to regenerate,
> >> > especially when updates are done one at a time. Move these files under
> >> > the verbose debug flag. Allow changes to verbose to trigger generation
> >> > of the files. This lets a user batch the updates but still have the
> >> > information available.  The detailed topology printk messages are also
> >> > under verbose.
> >> > 
> >> > Signed-off-by: Phil Auld 
> >> > ---
> >> >  kernel/sched/debug.c | 68 ++--
> >> >  1 file changed, 66 insertions(+), 2 deletions(-)
> >> > 
> >> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> >> > index 1637b65ba07a..2eb51ee3ccab 100644
> >> > --- a/kernel/sched/debug.c
> >> > +++ b/kernel/sched/debug.c
> >> > @@ -280,6 +280,31 @@ static const struct file_operations 
> >> > sched_dynamic_fops = {
> >> >  
> >> >  __read_mostly bool sched_debug_verbose;
> >> >  
> >> > +static ssize_t sched_verbose_write(struct file *filp, const char __user 
> >> > *ubuf,
> >> > +   size_t cnt, loff_t *ppos);
> >> > +
> >> > +static int sched_verbose_show(struct seq_file *m, void *v)
> >> > +{
> >> > +if (sched_debug_verbose)
> >> > +seq_puts(m,"Y\n");
> >> > +else
> >> > +seq_puts(m,"N\n");
> >> > +return 0;
> >> > +}
> >> > +
> >> > +static int sched_verbose_open(struct inode *inode, struct file *filp)
> >> > +{
> >> > +return single_open(filp, sched_verbose_show, NULL);
> >> > +}
> >> > +
> >> > +static const struct file_operations sched_verbose_fops = {
> >> > +.open   = sched_verbose_open,
> >> > +.write  = sched_verbose_write,
> >> > +.read   = seq_read,
> >> > +.llseek = seq_lseek,
> >> > +.release= seq_release,
> >> > +};
> >> > +
> >> >  static const struct seq_operations sched_debug_sops;
> >> >  
> >> >  static int sched_debug_open(struct inode *inode, struct file *filp)
> >> > @@ 

Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-13 Thread Michael Ellerman
Phil Auld  writes:
> On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
>> On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
>> > Hi,
>> > 
>> > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
>> > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
>> > > > 
>> > > > Thanks Greg & Peter for your direction. 
>> > > > 
>> > > > While we pursue the idea of having debugfs based on kernfs, we thought 
>> > > > about
>> > > > having a boot time parameter which would disable creating and updating 
>> > > > of the
>> > > > sched_domain debugfs files and this would also be useful even when the 
>> > > > kernfs
>> > > > solution kicks in, as users who may not care about these debugfs files 
>> > > > would
>> > > > benefit from a faster CPU hotplug operation.
>> > > 
>> > > Ick, no, you would be adding a new user/kernel api that you will be
>> > > required to support for the next 20+ years.  Just to get over a
>> > > short-term issue before you solve the problem properly.
>> > 
>> > I'm not convinced moving these files from debugfs to kernfs is the right
>> > fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
>> > I don't think either of those numbers is reasonable.
>> > 
>> > The issue as I see it is the full rebuild for every change with no way to
>> > batch the changes. How about something like the below?
>> > 
>> > This puts the domains/* files under the sched_verbose flag. About the only
>> > thing under that flag now are the detailed topology discovery printks 
>> > anyway
>> > so this fits together nicely.
>> > 
>> > This way the files would be off by default (assuming you don't boot with
>> > sched_verbose) and can be created at runtime by enabling verbose. Multiple
>> > changes could also be batched by disabling/makeing changes/re-enabling.
>> > 
>> > It does not create a new API, uses one that is already there.
>> 
>> The idea seems good, the implementation might need a bit of work :)
>
> More than the one comment below? Let me know.
>
>> 
>> > > If you really do not want these debugfs files, just disable debugfs from
>> > > your system.  That should be a better short-term solution, right?
>> > 
>> > We do find these files useful at times for debugging issue and looking
>> > at what's going on on the system.
>> > 
>> > > 
>> > > Or better yet, disable SCHED_DEBUG, why can't you do that?
>> > 
>> > Same with this... useful information with (modulo issues like this)
>> > small cost. There are also tuning knobs that are only available
>> > with SCHED_DEBUG. 
>> > 
>> > 
>> > Cheers,
>> > Phil
>> > 
>> > ---
>> > 
>> > sched/debug: Put sched/domains files under verbose flag
>> > 
>> > The debug files under sched/domains can take a long time to regenerate,
>> > especially when updates are done one at a time. Move these files under
>> > the verbose debug flag. Allow changes to verbose to trigger generation
>> > of the files. This lets a user batch the updates but still have the
>> > information available.  The detailed topology printk messages are also
>> > under verbose.
>> > 
>> > Signed-off-by: Phil Auld 
>> > ---
>> >  kernel/sched/debug.c | 68 ++--
>> >  1 file changed, 66 insertions(+), 2 deletions(-)
>> > 
>> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
>> > index 1637b65ba07a..2eb51ee3ccab 100644
>> > --- a/kernel/sched/debug.c
>> > +++ b/kernel/sched/debug.c
>> > @@ -280,6 +280,31 @@ static const struct file_operations 
>> > sched_dynamic_fops = {
>> >  
>> >  __read_mostly bool sched_debug_verbose;
>> >  
>> > +static ssize_t sched_verbose_write(struct file *filp, const char __user 
>> > *ubuf,
>> > + size_t cnt, loff_t *ppos);
>> > +
>> > +static int sched_verbose_show(struct seq_file *m, void *v)
>> > +{
>> > +  if (sched_debug_verbose)
>> > +  seq_puts(m,"Y\n");
>> > +  else
>> > +  seq_puts(m,"N\n");
>> > +  return 0;
>> > +}
>> > +
>> > +static int sched_verbose_open(struct inode *inode, struct file *filp)
>> > +{
>> > +  return single_open(filp, sched_verbose_show, NULL);
>> > +}
>> > +
>> > +static const struct file_operations sched_verbose_fops = {
>> > +  .open   = sched_verbose_open,
>> > +  .write  = sched_verbose_write,
>> > +  .read   = seq_read,
>> > +  .llseek = seq_lseek,
>> > +  .release= seq_release,
>> > +};
>> > +
>> >  static const struct seq_operations sched_debug_sops;
>> >  
>> >  static int sched_debug_open(struct inode *inode, struct file *filp)
>> > @@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
>> >debugfs_sched = debugfs_create_dir("sched", NULL);
>> >  
>> >debugfs_create_file("features", 0644, debugfs_sched, NULL, 
>> > _feat_fops);
>> > -  debugfs_create_bool("verbose", 0644, debugfs_sched, 
>> > _debug_verbose);
>> > +  debugfs_create_file("verbose", 0644, debugfs_sched, NULL, 
>> > _verbose_fops);
>> >  #ifdef 

Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-13 Thread Phil Auld
On Tue, Dec 13, 2022 at 03:31:06PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Dec 13, 2022 at 08:22:58AM -0500, Phil Auld wrote:
> > On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> > > On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> > > > Hi,
> > > > 
> > > > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> > > > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > > > > > 
> > > > > > Thanks Greg & Peter for your direction. 
> > > > > > 
> > > > > > While we pursue the idea of having debugfs based on kernfs, we 
> > > > > > thought about
> > > > > > having a boot time parameter which would disable creating and 
> > > > > > updating of the
> > > > > > sched_domain debugfs files and this would also be useful even when 
> > > > > > the kernfs
> > > > > > solution kicks in, as users who may not care about these debugfs 
> > > > > > files would
> > > > > > benefit from a faster CPU hotplug operation.
> > > > > 
> > > > > Ick, no, you would be adding a new user/kernel api that you will be
> > > > > required to support for the next 20+ years.  Just to get over a
> > > > > short-term issue before you solve the problem properly.
> > > > 
> > > > I'm not convinced moving these files from debugfs to kernfs is the right
> > > > fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
> > > > I don't think either of those numbers is reasonable.
> > > > 
> > > > The issue as I see it is the full rebuild for every change with no way 
> > > > to
> > > > batch the changes. How about something like the below?
> > > > 
> > > > This puts the domains/* files under the sched_verbose flag. About the 
> > > > only
> > > > thing under that flag now are the detailed topology discovery printks 
> > > > anyway
> > > > so this fits together nicely.
> > > > 
> > > > This way the files would be off by default (assuming you don't boot with
> > > > sched_verbose) and can be created at runtime by enabling verbose. 
> > > > Multiple
> > > > changes could also be batched by disabling/makeing changes/re-enabling.
> > > > 
> > > > It does not create a new API, uses one that is already there.
> > > 
> > > The idea seems good, the implementation might need a bit of work :)
> > 
> > More than the one comment below? Let me know.
> 
> No idea, resubmit a working patch and I'll review it properly :)
>

Will do. 


Thanks,
Phil


-- 



Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-13 Thread Greg Kroah-Hartman
On Tue, Dec 13, 2022 at 08:22:58AM -0500, Phil Auld wrote:
> On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> > On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> > > Hi,
> > > 
> > > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> > > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > > > > 
> > > > > Thanks Greg & Peter for your direction. 
> > > > > 
> > > > > While we pursue the idea of having debugfs based on kernfs, we 
> > > > > thought about
> > > > > having a boot time parameter which would disable creating and 
> > > > > updating of the
> > > > > sched_domain debugfs files and this would also be useful even when 
> > > > > the kernfs
> > > > > solution kicks in, as users who may not care about these debugfs 
> > > > > files would
> > > > > benefit from a faster CPU hotplug operation.
> > > > 
> > > > Ick, no, you would be adding a new user/kernel api that you will be
> > > > required to support for the next 20+ years.  Just to get over a
> > > > short-term issue before you solve the problem properly.
> > > 
> > > I'm not convinced moving these files from debugfs to kernfs is the right
> > > fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
> > > I don't think either of those numbers is reasonable.
> > > 
> > > The issue as I see it is the full rebuild for every change with no way to
> > > batch the changes. How about something like the below?
> > > 
> > > This puts the domains/* files under the sched_verbose flag. About the only
> > > thing under that flag now are the detailed topology discovery printks 
> > > anyway
> > > so this fits together nicely.
> > > 
> > > This way the files would be off by default (assuming you don't boot with
> > > sched_verbose) and can be created at runtime by enabling verbose. Multiple
> > > changes could also be batched by disabling/makeing changes/re-enabling.
> > > 
> > > It does not create a new API, uses one that is already there.
> > 
> > The idea seems good, the implementation might need a bit of work :)
> 
> More than the one comment below? Let me know.

No idea, resubmit a working patch and I'll review it properly :)

> > > + r = kstrtobool_from_user(ubuf, cnt, );
> > > + if (!r) {
> > > + mutex_lock(_domains_mutex);
> > > + r = debugfs_file_get(dentry);
> > > + if (unlikely(r))
> > > + return r;
> > > + sched_debug_verbose = bv;
> > > + debugfs_file_put(dentry);
> > 
> > Why the get/put of the debugfs dentry? for just this single value?
> 
> That's what debugfs_file_write_bool() does, which is where I got that since
> that's really what this is doing. I couldn't see a good way to make this
> just call that.
> 
> I suppose the get/put may not be needed since the only way this should
> go away is under that mutex too.

Yes, it should not be needed.

> ... erm, yeah, that return is a problem ... I'll fix that.
> 
> Also, this was originally on v6.1-rc7. I can rebase when I repost but I
> didn't want to do it on a random commit so I picked (at the time) the latest
> tag.  Should I just use the head of Linux? 

Yes, or linux-next.

thanks,

greg k-h


Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-13 Thread Phil Auld
On Tue, Dec 13, 2022 at 07:23:54AM +0100 Greg Kroah-Hartman wrote:
> On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> > Hi,
> > 
> > On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> > > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > > > 
> > > > Thanks Greg & Peter for your direction. 
> > > > 
> > > > While we pursue the idea of having debugfs based on kernfs, we thought 
> > > > about
> > > > having a boot time parameter which would disable creating and updating 
> > > > of the
> > > > sched_domain debugfs files and this would also be useful even when the 
> > > > kernfs
> > > > solution kicks in, as users who may not care about these debugfs files 
> > > > would
> > > > benefit from a faster CPU hotplug operation.
> > > 
> > > Ick, no, you would be adding a new user/kernel api that you will be
> > > required to support for the next 20+ years.  Just to get over a
> > > short-term issue before you solve the problem properly.
> > 
> > I'm not convinced moving these files from debugfs to kernfs is the right
> > fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
> > I don't think either of those numbers is reasonable.
> > 
> > The issue as I see it is the full rebuild for every change with no way to
> > batch the changes. How about something like the below?
> > 
> > This puts the domains/* files under the sched_verbose flag. About the only
> > thing under that flag now are the detailed topology discovery printks anyway
> > so this fits together nicely.
> > 
> > This way the files would be off by default (assuming you don't boot with
> > sched_verbose) and can be created at runtime by enabling verbose. Multiple
> > changes could also be batched by disabling/makeing changes/re-enabling.
> > 
> > It does not create a new API, uses one that is already there.
> 
> The idea seems good, the implementation might need a bit of work :)

More than the one comment below? Let me know.

> 
> > > If you really do not want these debugfs files, just disable debugfs from
> > > your system.  That should be a better short-term solution, right?
> > 
> > We do find these files useful at times for debugging issue and looking
> > at what's going on on the system.
> > 
> > > 
> > > Or better yet, disable SCHED_DEBUG, why can't you do that?
> > 
> > Same with this... useful information with (modulo issues like this)
> > small cost. There are also tuning knobs that are only available
> > with SCHED_DEBUG. 
> > 
> > 
> > Cheers,
> > Phil
> > 
> > ---
> > 
> > sched/debug: Put sched/domains files under verbose flag
> > 
> > The debug files under sched/domains can take a long time to regenerate,
> > especially when updates are done one at a time. Move these files under
> > the verbose debug flag. Allow changes to verbose to trigger generation
> > of the files. This lets a user batch the updates but still have the
> > information available.  The detailed topology printk messages are also
> > under verbose.
> > 
> > Signed-off-by: Phil Auld 
> > ---
> >  kernel/sched/debug.c | 68 ++--
> >  1 file changed, 66 insertions(+), 2 deletions(-)
> > 
> > diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> > index 1637b65ba07a..2eb51ee3ccab 100644
> > --- a/kernel/sched/debug.c
> > +++ b/kernel/sched/debug.c
> > @@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops 
> > = {
> >  
> >  __read_mostly bool sched_debug_verbose;
> >  
> > +static ssize_t sched_verbose_write(struct file *filp, const char __user 
> > *ubuf,
> > +  size_t cnt, loff_t *ppos);
> > +
> > +static int sched_verbose_show(struct seq_file *m, void *v)
> > +{
> > +   if (sched_debug_verbose)
> > +   seq_puts(m,"Y\n");
> > +   else
> > +   seq_puts(m,"N\n");
> > +   return 0;
> > +}
> > +
> > +static int sched_verbose_open(struct inode *inode, struct file *filp)
> > +{
> > +   return single_open(filp, sched_verbose_show, NULL);
> > +}
> > +
> > +static const struct file_operations sched_verbose_fops = {
> > +   .open   = sched_verbose_open,
> > +   .write  = sched_verbose_write,
> > +   .read   = seq_read,
> > +   .llseek = seq_lseek,
> > +   .release= seq_release,
> > +};
> > +
> >  static const struct seq_operations sched_debug_sops;
> >  
> >  static int sched_debug_open(struct inode *inode, struct file *filp)
> > @@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
> > debugfs_sched = debugfs_create_dir("sched", NULL);
> >  
> > debugfs_create_file("features", 0644, debugfs_sched, NULL, 
> > _feat_fops);
> > -   debugfs_create_bool("verbose", 0644, debugfs_sched, 
> > _debug_verbose);
> > +   debugfs_create_file("verbose", 0644, debugfs_sched, NULL, 
> > _verbose_fops);
> >  #ifdef CONFIG_PREEMPT_DYNAMIC
> > debugfs_create_file("preempt", 0644, debugfs_sched, NULL, 
> > _dynamic_fops);
> >  #endif
> > @@ -402,15 +427,23 

Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-12 Thread Greg Kroah-Hartman
On Mon, Dec 12, 2022 at 02:17:58PM -0500, Phil Auld wrote:
> Hi,
> 
> On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > > 
> > > Thanks Greg & Peter for your direction. 
> > > 
> > > While we pursue the idea of having debugfs based on kernfs, we thought 
> > > about
> > > having a boot time parameter which would disable creating and updating of 
> > > the
> > > sched_domain debugfs files and this would also be useful even when the 
> > > kernfs
> > > solution kicks in, as users who may not care about these debugfs files 
> > > would
> > > benefit from a faster CPU hotplug operation.
> > 
> > Ick, no, you would be adding a new user/kernel api that you will be
> > required to support for the next 20+ years.  Just to get over a
> > short-term issue before you solve the problem properly.
> 
> I'm not convinced moving these files from debugfs to kernfs is the right
> fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
> I don't think either of those numbers is reasonable.
> 
> The issue as I see it is the full rebuild for every change with no way to
> batch the changes. How about something like the below?
> 
> This puts the domains/* files under the sched_verbose flag. About the only
> thing under that flag now are the detailed topology discovery printks anyway
> so this fits together nicely.
> 
> This way the files would be off by default (assuming you don't boot with
> sched_verbose) and can be created at runtime by enabling verbose. Multiple
> changes could also be batched by disabling/makeing changes/re-enabling.
> 
> It does not create a new API, uses one that is already there.

The idea seems good, the implementation might need a bit of work :)

> > If you really do not want these debugfs files, just disable debugfs from
> > your system.  That should be a better short-term solution, right?
> 
> We do find these files useful at times for debugging issue and looking
> at what's going on on the system.
> 
> > 
> > Or better yet, disable SCHED_DEBUG, why can't you do that?
> 
> Same with this... useful information with (modulo issues like this)
> small cost. There are also tuning knobs that are only available
> with SCHED_DEBUG. 
> 
> 
> Cheers,
> Phil
> 
> ---
> 
> sched/debug: Put sched/domains files under verbose flag
> 
> The debug files under sched/domains can take a long time to regenerate,
> especially when updates are done one at a time. Move these files under
> the verbose debug flag. Allow changes to verbose to trigger generation
> of the files. This lets a user batch the updates but still have the
> information available.  The detailed topology printk messages are also
> under verbose.
> 
> Signed-off-by: Phil Auld 
> ---
>  kernel/sched/debug.c | 68 ++--
>  1 file changed, 66 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
> index 1637b65ba07a..2eb51ee3ccab 100644
> --- a/kernel/sched/debug.c
> +++ b/kernel/sched/debug.c
> @@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops = 
> {
>  
>  __read_mostly bool sched_debug_verbose;
>  
> +static ssize_t sched_verbose_write(struct file *filp, const char __user 
> *ubuf,
> +size_t cnt, loff_t *ppos);
> +
> +static int sched_verbose_show(struct seq_file *m, void *v)
> +{
> + if (sched_debug_verbose)
> + seq_puts(m,"Y\n");
> + else
> + seq_puts(m,"N\n");
> + return 0;
> +}
> +
> +static int sched_verbose_open(struct inode *inode, struct file *filp)
> +{
> + return single_open(filp, sched_verbose_show, NULL);
> +}
> +
> +static const struct file_operations sched_verbose_fops = {
> + .open   = sched_verbose_open,
> + .write  = sched_verbose_write,
> + .read   = seq_read,
> + .llseek = seq_lseek,
> + .release= seq_release,
> +};
> +
>  static const struct seq_operations sched_debug_sops;
>  
>  static int sched_debug_open(struct inode *inode, struct file *filp)
> @@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
>   debugfs_sched = debugfs_create_dir("sched", NULL);
>  
>   debugfs_create_file("features", 0644, debugfs_sched, NULL, 
> _feat_fops);
> - debugfs_create_bool("verbose", 0644, debugfs_sched, 
> _debug_verbose);
> + debugfs_create_file("verbose", 0644, debugfs_sched, NULL, 
> _verbose_fops);
>  #ifdef CONFIG_PREEMPT_DYNAMIC
>   debugfs_create_file("preempt", 0644, debugfs_sched, NULL, 
> _dynamic_fops);
>  #endif
> @@ -402,15 +427,23 @@ void update_sched_domain_debugfs(void)
>   if (!debugfs_sched)
>   return;
>  
> + if (!sched_debug_verbose)
> + return;
> +
>   if (!cpumask_available(sd_sysctl_cpus)) {
>   if (!alloc_cpumask_var(_sysctl_cpus, GFP_KERNEL))
>   return;
>   

Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-12-12 Thread Phil Auld
Hi,

On Tue, Nov 08, 2022 at 01:24:39PM +0100 Greg Kroah-Hartman wrote:
> On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> > 
> > Thanks Greg & Peter for your direction. 
> > 
> > While we pursue the idea of having debugfs based on kernfs, we thought about
> > having a boot time parameter which would disable creating and updating of 
> > the
> > sched_domain debugfs files and this would also be useful even when the 
> > kernfs
> > solution kicks in, as users who may not care about these debugfs files would
> > benefit from a faster CPU hotplug operation.
> 
> Ick, no, you would be adding a new user/kernel api that you will be
> required to support for the next 20+ years.  Just to get over a
> short-term issue before you solve the problem properly.

I'm not convinced moving these files from debugfs to kernfs is the right
fix.  That will take it from ~50 back to ~20 _minutes_ on these systems.
I don't think either of those numbers is reasonable.

The issue as I see it is the full rebuild for every change with no way to
batch the changes. How about something like the below?

This puts the domains/* files under the sched_verbose flag. About the only
thing under that flag now are the detailed topology discovery printks anyway
so this fits together nicely.

This way the files would be off by default (assuming you don't boot with
sched_verbose) and can be created at runtime by enabling verbose. Multiple
changes could also be batched by disabling/makeing changes/re-enabling.

It does not create a new API, uses one that is already there.

> 
> If you really do not want these debugfs files, just disable debugfs from
> your system.  That should be a better short-term solution, right?

We do find these files useful at times for debugging issue and looking
at what's going on on the system.

> 
> Or better yet, disable SCHED_DEBUG, why can't you do that?

Same with this... useful information with (modulo issues like this)
small cost. There are also tuning knobs that are only available
with SCHED_DEBUG. 


Cheers,
Phil

---

sched/debug: Put sched/domains files under verbose flag

The debug files under sched/domains can take a long time to regenerate,
especially when updates are done one at a time. Move these files under
the verbose debug flag. Allow changes to verbose to trigger generation
of the files. This lets a user batch the updates but still have the
information available.  The detailed topology printk messages are also
under verbose.

Signed-off-by: Phil Auld 
---
 kernel/sched/debug.c | 68 ++--
 1 file changed, 66 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 1637b65ba07a..2eb51ee3ccab 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -280,6 +280,31 @@ static const struct file_operations sched_dynamic_fops = {
 
 __read_mostly bool sched_debug_verbose;
 
+static ssize_t sched_verbose_write(struct file *filp, const char __user *ubuf,
+  size_t cnt, loff_t *ppos);
+
+static int sched_verbose_show(struct seq_file *m, void *v)
+{
+   if (sched_debug_verbose)
+   seq_puts(m,"Y\n");
+   else
+   seq_puts(m,"N\n");
+   return 0;
+}
+
+static int sched_verbose_open(struct inode *inode, struct file *filp)
+{
+   return single_open(filp, sched_verbose_show, NULL);
+}
+
+static const struct file_operations sched_verbose_fops = {
+   .open   = sched_verbose_open,
+   .write  = sched_verbose_write,
+   .read   = seq_read,
+   .llseek = seq_lseek,
+   .release= seq_release,
+};
+
 static const struct seq_operations sched_debug_sops;
 
 static int sched_debug_open(struct inode *inode, struct file *filp)
@@ -303,7 +328,7 @@ static __init int sched_init_debug(void)
debugfs_sched = debugfs_create_dir("sched", NULL);
 
debugfs_create_file("features", 0644, debugfs_sched, NULL, 
_feat_fops);
-   debugfs_create_bool("verbose", 0644, debugfs_sched, 
_debug_verbose);
+   debugfs_create_file("verbose", 0644, debugfs_sched, NULL, 
_verbose_fops);
 #ifdef CONFIG_PREEMPT_DYNAMIC
debugfs_create_file("preempt", 0644, debugfs_sched, NULL, 
_dynamic_fops);
 #endif
@@ -402,15 +427,23 @@ void update_sched_domain_debugfs(void)
if (!debugfs_sched)
return;
 
+   if (!sched_debug_verbose)
+   return;
+
if (!cpumask_available(sd_sysctl_cpus)) {
if (!alloc_cpumask_var(_sysctl_cpus, GFP_KERNEL))
return;
cpumask_copy(sd_sysctl_cpus, cpu_possible_mask);
}
 
-   if (!sd_dentry)
+   if (!sd_dentry) {
sd_dentry = debugfs_create_dir("domains", debugfs_sched);
 
+   /* rebuild sd_sysclt_cpus if empty since it gets cleared below 
*/
+   if (cpumask_first(sd_sysctl_cpus) >=  nr_cpu_ids)
+   

Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-11-08 Thread Greg Kroah-Hartman
On Tue, Nov 08, 2022 at 08:21:00PM +0530, Srikar Dronamraju wrote:
> * Greg Kroah-Hartman  [2022-11-08 13:24:39]:
> 
> > On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> 
> Hi Greg, 
> 
> > > 
> > > Thanks Greg & Peter for your direction. 
> > > 
> > > While we pursue the idea of having debugfs based on kernfs, we thought 
> > > about
> > > having a boot time parameter which would disable creating and updating of 
> > > the
> > > sched_domain debugfs files and this would also be useful even when the 
> > > kernfs
> > > solution kicks in, as users who may not care about these debugfs files 
> > > would
> > > benefit from a faster CPU hotplug operation.
> > 
> > Ick, no, you would be adding a new user/kernel api that you will be
> > required to support for the next 20+ years.  Just to get over a
> > short-term issue before you solve the problem properly.
> > 
> > If you really do not want these debugfs files, just disable debugfs from
> > your system.  That should be a better short-term solution, right?
> > 
> > Or better yet, disable SCHED_DEBUG, why can't you do that?
> 
> Thanks a lot for your quick inputs.
> 
> CONFIG_SCHED_DEBUG disables a lot more stuff than just updation of debugfs
> files. Information like /sys/kernel/debug/sched/debug and system-wide and
> per process wide information would be lost when that config is disabled.
> 
> Most users would still be using distribution kernels and most distribution
> kernels that I know of seem to have CONFIG_SCHED_DEBUG enabled.

Then work with the distros to remove that option if it doesn't do well
on very large systems.

Odds are they really do not want that enabled either, but that's not our
issue, that's theirs :)

> In a large system, lets say close to 2000 CPUs and we are offlining around
> 1750 CPUs. For example ppc64_cpu --smt=1  on a powerpc. Even if we move to a
> lesser overhead kernfs based implementation, we would still be creating
> files and deleting files for every CPU offline. Most users may not even be
> aware of these files. However for a few users who may be using these files
> once a while, we end up creating and deleting these files for all users. The
> overhead increases exponentially with the number of CPUs. I would assume the
> max number of CPUs are going to increase in future further.

I understand the issue, you don't have to explain it again.  The
scheduler developers like to see these files, and for them it's useful.
Perhaps for distros that is not a useful thing to have around, that's
up to them.

> Hence our approach was to reduce the overhead for those users who are sure
> they don't depend on these files. We still keep the creating of the files as
> the default approach so that others who depend on it are not going to be
> impacted.

No, you are adding a new user/kernel api to the kernel that you then
have to support for the next 20+ years because you haven't fixed the
real issue here.

I think you could have done the kernfs conversion already, it shouldn't
be that complex, right?

Note, when you do it, you might want to move away from returning a raw
dentry from debugfs calls, and instead use an opaque type "debugfs_file"
or something like that, instead, which might make this easier over time.

thanks,

greg k-h


Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-11-08 Thread Srikar Dronamraju
* Greg Kroah-Hartman  [2022-11-08 13:24:39]:

> On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:

Hi Greg, 

> > 
> > Thanks Greg & Peter for your direction. 
> > 
> > While we pursue the idea of having debugfs based on kernfs, we thought about
> > having a boot time parameter which would disable creating and updating of 
> > the
> > sched_domain debugfs files and this would also be useful even when the 
> > kernfs
> > solution kicks in, as users who may not care about these debugfs files would
> > benefit from a faster CPU hotplug operation.
> 
> Ick, no, you would be adding a new user/kernel api that you will be
> required to support for the next 20+ years.  Just to get over a
> short-term issue before you solve the problem properly.
> 
> If you really do not want these debugfs files, just disable debugfs from
> your system.  That should be a better short-term solution, right?
> 
> Or better yet, disable SCHED_DEBUG, why can't you do that?

Thanks a lot for your quick inputs.

CONFIG_SCHED_DEBUG disables a lot more stuff than just updation of debugfs
files. Information like /sys/kernel/debug/sched/debug and system-wide and
per process wide information would be lost when that config is disabled.

Most users would still be using distribution kernels and most distribution
kernels that I know of seem to have CONFIG_SCHED_DEBUG enabled.

In a large system, lets say close to 2000 CPUs and we are offlining around
1750 CPUs. For example ppc64_cpu --smt=1  on a powerpc. Even if we move to a
lesser overhead kernfs based implementation, we would still be creating
files and deleting files for every CPU offline. Most users may not even be
aware of these files. However for a few users who may be using these files
once a while, we end up creating and deleting these files for all users. The
overhead increases exponentially with the number of CPUs. I would assume the
max number of CPUs are going to increase in future further.

Hence our approach was to reduce the overhead for those users who are sure
they don't depend on these files. We still keep the creating of the files as
the default approach so that others who depend on it are not going to be
impacted.

> 
> thanks,
> 
> greg k-h

-- 
Thanks and Regards
Srikar Dronamraju


Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-11-08 Thread Greg Kroah-Hartman
On Tue, Nov 08, 2022 at 03:30:46PM +0530, Vishal Chourasia wrote:
> 
> Thanks Greg & Peter for your direction. 
> 
> While we pursue the idea of having debugfs based on kernfs, we thought about
> having a boot time parameter which would disable creating and updating of the
> sched_domain debugfs files and this would also be useful even when the kernfs
> solution kicks in, as users who may not care about these debugfs files would
> benefit from a faster CPU hotplug operation.

Ick, no, you would be adding a new user/kernel api that you will be
required to support for the next 20+ years.  Just to get over a
short-term issue before you solve the problem properly.

If you really do not want these debugfs files, just disable debugfs from
your system.  That should be a better short-term solution, right?

Or better yet, disable SCHED_DEBUG, why can't you do that?

thanks,

greg k-h


Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-11-08 Thread Vishal Chourasia

Thanks Greg & Peter for your direction. 

While we pursue the idea of having debugfs based on kernfs, we thought about
having a boot time parameter which would disable creating and updating of the
sched_domain debugfs files and this would also be useful even when the kernfs
solution kicks in, as users who may not care about these debugfs files would
benefit from a faster CPU hotplug operation.

However, these sched_domain debugfs files are created by default.

-- vishal.c

-->8-8<--

From f66f66ee05a9f719b58822d13e501d65391dd9d3 Mon Sep 17 00:00:00 2001
From: Vishal Chourasia 
Date: Tue, 8 Nov 2022 14:21:15 +0530
Subject: [PATCH] Add kernel parameter to disable creation of sched_domain
 files

For large systems, creation of sched_domain debug files takes unusually long
time. In which case, sched_sd_export can be passed as kernel command line
parameter during boot time to prevent kernel from creating sched_domain files.

This commit adds a kernel command line parameter, sched_sd_export, which can be
used to, optionally, disable the creation of sched_domain debug files. 
---
 kernel/sched/debug.c|  9 ++---
 kernel/sched/sched.h|  1 +
 kernel/sched/topology.c | 11 ++-
 3 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index bb3d63bdf4ae..bd307847b76a 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -279,6 +279,7 @@ static const struct file_operations sched_dynamic_fops = {
 #endif /* CONFIG_PREEMPT_DYNAMIC */
 
 __read_mostly bool sched_debug_verbose;
+__read_mostly int sched_debug_export = 1;
 
 static const struct seq_operations sched_debug_sops;
 
@@ -321,9 +322,11 @@ static __init int sched_init_debug(void)
debugfs_create_u32("migration_cost_ns", 0644, debugfs_sched, 
_sched_migration_cost);
debugfs_create_u32("nr_migrate", 0644, debugfs_sched, 
_sched_nr_migrate);
 
-   mutex_lock(_domains_mutex);
-   update_sched_domain_debugfs();
-   mutex_unlock(_domains_mutex);
+   if (likely(sched_debug_export)) {
+   mutex_lock(_domains_mutex);
+   update_sched_domain_debugfs();
+   mutex_unlock(_domains_mutex);
+   }
 #endif
 
 #ifdef CONFIG_NUMA_BALANCING
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e26688d387ae..a4d06588d876 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2738,6 +2738,7 @@ extern struct sched_entity *__pick_last_entity(struct 
cfs_rq *cfs_rq);
 
 #ifdef CONFIG_SCHED_DEBUG
 extern bool sched_debug_verbose;
+extern int sched_debug_export;
 
 extern void print_cfs_stats(struct seq_file *m, int cpu);
 extern void print_rt_stats(struct seq_file *m, int cpu);
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8739c2a5a54e..7bcdbc2f856d 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -19,6 +19,13 @@ static int __init sched_debug_setup(char *str)
 }
 early_param("sched_verbose", sched_debug_setup);
 
+static int __init sched_debug_disable_export(char *str)
+{
+   sched_debug_export = 0;
+   return 0;
+}
+early_param("sched_sd_export", sched_debug_disable_export);
+
 static inline bool sched_debug(void)
 {
return sched_debug_verbose;
@@ -152,6 +159,7 @@ static void sched_domain_debug(struct sched_domain *sd, int 
cpu)
 #else /* !CONFIG_SCHED_DEBUG */
 
 # define sched_debug_verbose 0
+# define sched_debug_export 1
 # define sched_domain_debug(sd, cpu) do { } while (0)
 static inline bool sched_debug(void)
 {
@@ -2632,7 +2640,8 @@ void partition_sched_domains_locked(int ndoms_new, 
cpumask_var_t doms_new[],
dattr_cur = dattr_new;
ndoms_cur = ndoms_new;
 
-   update_sched_domain_debugfs();
+   if (likely(sched_debug_export))
+   update_sched_domain_debugfs();
 }
 
 /*

base-commit: 7e18e42e4b280c85b76967a9106a13ca61c16179
-- 
2.31.1

  


signature.asc
Description: PGP signature


Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-10-26 Thread Vishal Chourasia
On Tue, Oct 18, 2022 at 01:04:40PM +0200, Greg Kroah-Hartman wrote:

> Why do you need to?  What tools require these debugfs files to be
> present?

We are not entirely sure what applications (if any) might be using this 
interface.

> And if you only have 7-8 files per CPU, that does not seem like a lot of
> files overall (14000-16000)?  If you only offline 1 cpu, how is removing
> 7 or 8 files a bottleneck?  Do you really offline 1999 cpus for a 2k
> system?

It's 7-8 files per domain per cpu, so, in a system with approx 2k cpus and five
domains, the total file count goes above 70k-80k files. And, when we offline 1
CPU, the entire directory is rebuilt, resulting in creation of all the files
again.

Thanks

-- vishal.c 



signature.asc
Description: PGP signature


Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-10-26 Thread Peter Zijlstra
On Wed, Oct 26, 2022 at 09:02:28AM +0200, Greg Kroah-Hartman wrote:
> On Wed, Oct 26, 2022 at 12:07:01PM +0530, Vishal Chourasia wrote:
> > On Tue, Oct 18, 2022 at 01:04:40PM +0200, Greg Kroah-Hartman wrote:
> > 
> > > Why do you need to?  What tools require these debugfs files to be
> > > present?
> > 
> > We are not entirely sure what applications (if any) might be using this 
> > interface.
> 
> Then just disable it and see what happens :)

It's mostly a debug interface for developers. A lot of people complained
when I moved things to debugfs, and I told them their program was broken
for a SCHED_DEBUG=n build anyway, but nobody complained about
this particular thing IIRC.

It's mostly affected by things like hotplug and cpusets, you can
discover the resulting topology by looking at these files.

Also; while we generally try and keep SCHED_DEBUG impact low, it is
still measurable; there are a number of people that run SCHED_DEBUG=n
kernels for the extra little gain.

> > > And if you only have 7-8 files per CPU, that does not seem like a lot of
> > > files overall (14000-16000)?  If you only offline 1 cpu, how is removing
> > > 7 or 8 files a bottleneck?  Do you really offline 1999 cpus for a 2k
> > > system?
> > 
> > It's 7-8 files per domain per cpu, so, in a system with approx 2k cpus and 
> > five
> > domains, the total file count goes above 70k-80k files. And, when we 
> > offline 1
> > CPU, the entire directory is rebuilt, resulting in creation of all the files
> > again.
> 
> Perhaps change the logic to not rebuild the whole thing and instead just
> remove the required files?

Unplugging a single cpu can change the topology and the other cpus might
need to be updated too.

Simplest example would be the SMT case, if you reduce from SMT>1 to SMT1
the SMT domain goes away (because a single CPU domain is as pointless as
it sounds) and that affects the CPU that remains.

Tracking all that is a pain. Simply rebuilding the whole thing is by
*far* the simplest option. And given this all is debug code, simple is
good.

> Or as I mentioned before, you can move debugfs to use kernfs, which
> should resolve most of these issues automatically.  Why not take the
> time to do that which will solve the problem no matter what gets added
> in the future in other subsystems?

This sounds like a good approach.


Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-10-26 Thread Greg Kroah-Hartman
On Wed, Oct 26, 2022 at 12:07:01PM +0530, Vishal Chourasia wrote:
> On Tue, Oct 18, 2022 at 01:04:40PM +0200, Greg Kroah-Hartman wrote:
> 
> > Why do you need to?  What tools require these debugfs files to be
> > present?
> 
> We are not entirely sure what applications (if any) might be using this 
> interface.

Then just disable it and see what happens :)

> > And if you only have 7-8 files per CPU, that does not seem like a lot of
> > files overall (14000-16000)?  If you only offline 1 cpu, how is removing
> > 7 or 8 files a bottleneck?  Do you really offline 1999 cpus for a 2k
> > system?
> 
> It's 7-8 files per domain per cpu, so, in a system with approx 2k cpus and 
> five
> domains, the total file count goes above 70k-80k files. And, when we offline 1
> CPU, the entire directory is rebuilt, resulting in creation of all the files
> again.

Perhaps change the logic to not rebuild the whole thing and instead just
remove the required files?

Or as I mentioned before, you can move debugfs to use kernfs, which
should resolve most of these issues automatically.  Why not take the
time to do that which will solve the problem no matter what gets added
in the future in other subsystems?

thanks,

greg k-h


Re: sched/debug: CPU hotplug operation suffers in a large cpu systems

2022-10-18 Thread Greg Kroah-Hartman
On Tue, Oct 18, 2022 at 04:07:06PM +0530, Vishal Chourasia wrote:
> On Mon, Oct 17, 2022 at 04:54:11PM +0200, Greg Kroah-Hartman wrote:
> > On Mon, Oct 17, 2022 at 04:19:31PM +0200, Peter Zijlstra wrote:
> > > 
> > > +GregKH who actually knows about debugfs.
> > > 
> > > On Mon, Oct 17, 2022 at 06:40:49PM +0530, Vishal Chourasia wrote:
> > > > smt=off operation on system with 1920 CPUs is taking approx 59 mins on 
> > > > v5.14
> > > > versus 29 mins on v5.11 measured using:
> > > > # time ppc64_cpu --smt=off
> > > > 
> > > > 
> > > > |++--|
> > > > | method | sysctl | debugfs  |
> > > > |++--|
> > > > | unregister_sysctl_table|   0.020050 s   | NA   |
> > > > | build_sched_domains|   3.090563 s   | 3.119130 s   |
> > > > | register_sched_domain_sysctl   |   0.065487 s   | NA   |
> > > > | update_sched_domain_debugfs|   NA   | 2.791232 s   |
> > > > | partition_sched_domains_locked |   3.195958 s   | 5.933254 s   |
> > > > |++--|
> > > > 
> > > > Note: partition_sched_domains_locked internally calls 
> > > > build_sched_domains
> > > >   and calls other functions respective to what's being currently 
> > > > used to
> > > >   export information i.e. sysctl or debugfs
> > > > 
> > > > Above numbers are quoted from the case where we tried offlining 1 cpu 
> > > > in system
> > > > with 1920 online cpus.
> > > > 
> > > > From the above table, register_sched_domain_sysctl and
> > > > unregister_sysctl_table collectively took ~0.085 secs, whereas
> > > > update_sched_domain_debugfs took ~2.79 secs. 
> > > > 
> > > > Root cause:
> > > > 
> > > > The observed regression stems from the way these two pseudo-filesystems 
> > > > handle
> > > > creation and deletion of files and directories internally.  
> > 
> > Yes, debugfs is not optimized for speed or memory usage at all.  This
> > happens to be the first code path I have seen that cares about this for
> > debugfs files.
> > 
> > You can either work on not creating so many debugfs files (do you really
> > really need all of them all the time?)  Or you can work on moving
> > debugfs to use kernfs as the backend logic, which will save you both
> > speed and memory usage overall as kernfs is used to being used on
> > semi-fast paths.
> > 
> > Maybe do both?
> > 
> > hope this helps,
> > 
> > greg k-h
> 
> Yes, we need to create 7-8 files per domain per CPU, eventually ending up
> creating a lot of files. 

Why do you need to?  What tools require these debugfs files to be
present?

And if you only have 7-8 files per CPU, that does not seem like a lot of
files overall (14000-16000)?  If you only offline 1 cpu, how is removing
7 or 8 files a bottleneck?  Do you really offline 1999 cpus for a 2k
system?

> Is there a possibility of reverting back to /proc/sys/kernel/sched_domain/?

No, these are debugging-only things, they do not belong in /proc/

If you rely on them for real functionality, that's a different story,
but I want to know what tool uses them and for what functionality as
debugfs should never be relied on for normal operation of a system.

thanks,

greg k-h