from:"Phil Auld"

Re: [PATCH] sched/rt: Clean up usage of rt_task()

2024-05-15 Thread Phil Auld

On Wed, May 15, 2024 at 01:06:13PM +0100 Qais Yousef wrote:
> On 05/15/24 07:20, Phil Auld wrote:
> > On Wed, May 15, 2024 at 10:32:38AM +0200 Peter Zijlstra wrote:
> > > On Tue, May 14, 2024 at 07:58:51PM -0400, Phil Auld wrote:
> > > > 
> > > > Hi Qais,
> > > > 
> > > > On Wed, May 15, 2024 at 12:41:12AM +0100 Qais Yousef wrote:
> > > > > rt_task() checks if a task has RT priority. But depends on your
> > > > > dictionary, this could mean it belongs to RT class, or is a 'realtime'
> > > > > task, which includes RT and DL classes.
> > > > > 
> > > > > Since this has caused some confusion already on discussion [1], it
> > > > > seemed a clean up is due.
> > > > > 
> > > > > I define the usage of rt_task() to be tasks that belong to RT class.
> > > > > Make sure that it returns true only for RT class and audit the users 
> > > > > and
> > > > > replace them with the new realtime_task() which returns true for RT 
> > > > > and
> > > > > DL classes - the old behavior. Introduce similar realtime_prio() to
> > > > > create similar distinction to rt_prio() and update the users.
> > > > 
> > > > I think making the difference clear is good. However, I think rt_task() 
> > > > is
> > > > a better name. We have dl_task() still.  And rt tasks are things managed
> > > > by rt.c, basically. Not realtime.c :)  I know that doesn't work for 
> > > > deadline.c
> > > > and dl_ but this change would be the reverse of that pattern.
> > > 
> > > It's going to be a mess either way around, but I think rt_task() and
> > > dl_task() being distinct is more sensible than the current overlap.
> > >
> > 
> > Yes, indeed.
> > 
> > My point was just to call it rt_task() still. 
> 
> It is called rt_task() still. I just added a new realtime_task() to return 
> true
> for RT and DL. rt_task() will return true only for RT now.
> 
> How do you see this should be done instead? I'm not seeing the problem.
>

Right, sorry. I misread your commit message completely and then all the
places where you changed rt_task() to realtime_task() fit my misreading.

rt_task() means rt class and realtime_task does what rt_task() used to do.
That's how I would do it, too :)


(Re)

Reviewed-by: Phil Auld 

Cheers,
Phil

--

Re: [PATCH] sched/rt: Clean up usage of rt_task()

2024-05-15 Thread Phil Auld

On Wed, May 15, 2024 at 10:32:38AM +0200 Peter Zijlstra wrote:
> On Tue, May 14, 2024 at 07:58:51PM -0400, Phil Auld wrote:
> > 
> > Hi Qais,
> > 
> > On Wed, May 15, 2024 at 12:41:12AM +0100 Qais Yousef wrote:
> > > rt_task() checks if a task has RT priority. But depends on your
> > > dictionary, this could mean it belongs to RT class, or is a 'realtime'
> > > task, which includes RT and DL classes.
> > > 
> > > Since this has caused some confusion already on discussion [1], it
> > > seemed a clean up is due.
> > > 
> > > I define the usage of rt_task() to be tasks that belong to RT class.
> > > Make sure that it returns true only for RT class and audit the users and
> > > replace them with the new realtime_task() which returns true for RT and
> > > DL classes - the old behavior. Introduce similar realtime_prio() to
> > > create similar distinction to rt_prio() and update the users.
> > 
> > I think making the difference clear is good. However, I think rt_task() is
> > a better name. We have dl_task() still.  And rt tasks are things managed
> > by rt.c, basically. Not realtime.c :)  I know that doesn't work for 
> > deadline.c
> > and dl_ but this change would be the reverse of that pattern.
> 
> It's going to be a mess either way around, but I think rt_task() and
> dl_task() being distinct is more sensible than the current overlap.
>

Yes, indeed.

My point was just to call it rt_task() still. 


Cheers,
Phil

> > > Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.
> > > 
> > > Document the functions to make it more obvious what is the difference
> > > between them. PI-boosted tasks is a factor that must be taken into
> > > account when choosing which function to use.
> > > 
> > > Rename task_is_realtime() to task_has_realtime_policy() as the old name
> > > is confusing against the new realtime_task().
> 
> realtime_task_policy() perhaps?
> 

--

Re: [PATCH] sched/rt: Clean up usage of rt_task()

2024-05-14 Thread Phil Auld



Hi Qais,

On Wed, May 15, 2024 at 12:41:12AM +0100 Qais Yousef wrote:
> rt_task() checks if a task has RT priority. But depends on your
> dictionary, this could mean it belongs to RT class, or is a 'realtime'
> task, which includes RT and DL classes.
> 
> Since this has caused some confusion already on discussion [1], it
> seemed a clean up is due.
> 
> I define the usage of rt_task() to be tasks that belong to RT class.
> Make sure that it returns true only for RT class and audit the users and
> replace them with the new realtime_task() which returns true for RT and
> DL classes - the old behavior. Introduce similar realtime_prio() to
> create similar distinction to rt_prio() and update the users.

I think making the difference clear is good. However, I think rt_task() is
a better name. We have dl_task() still.  And rt tasks are things managed
by rt.c, basically. Not realtime.c :)  I know that doesn't work for deadline.c
and dl_ but this change would be the reverse of that pattern.

> 
> Move MAX_DL_PRIO to prio.h so it can be used in the new definitions.
> 
> Document the functions to make it more obvious what is the difference
> between them. PI-boosted tasks is a factor that must be taken into
> account when choosing which function to use.
> 
> Rename task_is_realtime() to task_has_realtime_policy() as the old name
> is confusing against the new realtime_task().

Keeping it rt_task() above could mean this stays as it was but this
change makes sense as you have written it too. 



Cheers,
Phil

> 
> No functional changes were intended.
> 
> [1] 
> https://lore.kernel.org/lkml/20240506100509.gl40...@noisy.programming.kicks-ass.net/
> 
> Signed-off-by: Qais Yousef 
> ---
>  fs/select.c   |  2 +-
>  include/linux/ioprio.h|  2 +-
>  include/linux/sched/deadline.h|  6 --
>  include/linux/sched/prio.h|  1 +
>  include/linux/sched/rt.h  | 27 ++-
>  kernel/locking/rtmutex.c  |  4 ++--
>  kernel/locking/rwsem.c|  4 ++--
>  kernel/locking/ww_mutex.h |  2 +-
>  kernel/sched/core.c   |  6 +++---
>  kernel/time/hrtimer.c |  6 +++---
>  kernel/trace/trace_sched_wakeup.c |  2 +-
>  mm/page-writeback.c   |  4 ++--
>  mm/page_alloc.c   |  2 +-
>  13 files changed, 48 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/select.c b/fs/select.c
> index 9515c3fa1a03..8d5c1419416c 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -82,7 +82,7 @@ u64 select_estimate_accuracy(struct timespec64 *tv)
>* Realtime tasks get a slack of 0 for obvious reasons.
>*/
>  
> - if (rt_task(current))
> + if (realtime_task(current))
>   return 0;
>  
>   ktime_get_ts64();
> diff --git a/include/linux/ioprio.h b/include/linux/ioprio.h
> index db1249cd9692..6c00342b6166 100644
> --- a/include/linux/ioprio.h
> +++ b/include/linux/ioprio.h
> @@ -40,7 +40,7 @@ static inline int task_nice_ioclass(struct task_struct 
> *task)
>  {
>   if (task->policy == SCHED_IDLE)
>   return IOPRIO_CLASS_IDLE;
> - else if (task_is_realtime(task))
> + else if (task_has_realtime_policy(task))
>   return IOPRIO_CLASS_RT;
>   else
>   return IOPRIO_CLASS_BE;
> diff --git a/include/linux/sched/deadline.h b/include/linux/sched/deadline.h
> index df3aca89d4f5..5cb88b748ad6 100644
> --- a/include/linux/sched/deadline.h
> +++ b/include/linux/sched/deadline.h
> @@ -10,8 +10,6 @@
>  
>  #include 
>  
> -#define MAX_DL_PRIO  0
> -
>  static inline int dl_prio(int prio)
>  {
>   if (unlikely(prio < MAX_DL_PRIO))
> @@ -19,6 +17,10 @@ static inline int dl_prio(int prio)
>   return 0;
>  }
>  
> +/*
> + * Returns true if a task has a priority that belongs to DL class. PI-boosted
> + * tasks will return true. Use dl_policy() to ignore PI-boosted tasks.
> + */
>  static inline int dl_task(struct task_struct *p)
>  {
>   return dl_prio(p->prio);
> diff --git a/include/linux/sched/prio.h b/include/linux/sched/prio.h
> index ab83d85e1183..6ab43b4f72f9 100644
> --- a/include/linux/sched/prio.h
> +++ b/include/linux/sched/prio.h
> @@ -14,6 +14,7 @@
>   */
>  
>  #define MAX_RT_PRIO  100
> +#define MAX_DL_PRIO  0
>  
>  #define MAX_PRIO (MAX_RT_PRIO + NICE_WIDTH)
>  #define DEFAULT_PRIO (MAX_RT_PRIO + NICE_WIDTH / 2)
> diff --git a/include/linux/sched/rt.h b/include/linux/sched/rt.h
> index b2b9e6eb9683..b31be3c50152 100644
> --- a/include/linux/sched/rt.h
> +++ b/include/linux/sched/rt.h
> @@ -7,18 +7,43 @@
>  struct task_struct;
>  
>  static inline int rt_prio(int prio)
> +{
> + if (unlikely(prio < MAX_RT_PRIO && prio >= MAX_DL_PRIO))
> + return 1;
> + return 0;
> +}
> +
> +static inline int realtime_prio(int prio)
>  {
>   if (unlikely(prio < MAX_RT_PRIO))
>   return 1;
>   return 0;
>  }
>  
> +/*
> + * Returns true if a task has a

Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks

2021-04-19 Thread Phil Auld

On Mon, Apr 19, 2021 at 06:17:47PM +0100 Valentin Schneider wrote:
> On 19/04/21 08:59, Phil Auld wrote:
> > On Fri, Apr 16, 2021 at 10:43:38AM +0100 Valentin Schneider wrote:
> >> On 15/04/21 16:39, Rik van Riel wrote:
> >> > On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
> >> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
> >> >> struct lb_env *env)
> >> >>  if (tsk_cache_hot == -1)
> >> >>  tsk_cache_hot = task_hot(p, env);
> >> >>
> >> >> +   /*
> >> >> +* On a (sane) asymmetric CPU capacity system, the increase in
> >> >> compute
> >> >> +* capacity should offset any potential performance hit caused
> >> >> by a
> >> >> +* migration.
> >> >> +*/
> >> >> +   if ((env->dst_grp_type == group_has_spare) &&
> >> >> +   !migrate_degrades_capacity(p, env))
> >> >> +   tsk_cache_hot = 0;
> >> >
> >> > ... I'm starting to wonder if we should not rename the
> >> > tsk_cache_hot variable to something else to make this
> >> > code more readable. Probably in another patch :)
> >> >
> >>
> >> I'd tend to agree, but naming is hard. "migration_harmful" ?
> >
> > I thought Rik meant tsk_cache_hot, for which I'd suggest at least
> > buying a vowel and putting an 'a' in there :)
> >
> 
> That's the one I was eyeing: s/tsk_cache_hot/migration_harmful/ or
> somesuch. Right now we're feeding it:
>

Fair enough, my bad, the migration part immediately drew me to
migrate_degrades_capacity().

> o migrate_degrades_locality()
> o task_hot()
> 
> and I'm adding another one, so that's 2/3 which don't actually care about
> cache hotness, but rather "does doing this migration degrade/improve
> $criterion?"
> 

prefer_no_migrate? 


Cheers,
Phil
--

Re: [PATCH 2/2] sched/fair: Relax task_hot() for misfit tasks

2021-04-19 Thread Phil Auld

On Fri, Apr 16, 2021 at 10:43:38AM +0100 Valentin Schneider wrote:
> On 15/04/21 16:39, Rik van Riel wrote:
> > On Thu, 2021-04-15 at 18:58 +0100, Valentin Schneider wrote:
> >> Consider the following topology:
> >>
> >> Long story short, preempted misfit tasks are affected by task_hot(),
> >> while
> >> currently running misfit tasks are intentionally preempted by the
> >> stopper
> >> task to migrate them over to a higher-capacity CPU.
> >>
> >> Align detach_tasks() with the active-balance logic and let it pick a
> >> cache-hot misfit task when the destination CPU can provide a capacity
> >> uplift.
> >>
> >> Signed-off-by: Valentin Schneider 
> >
> > Reviewed-by: Rik van Riel 
> >
> 
> Thanks!
> 
> >
> > This patch looks good, but...
> >
> >> @@ -7672,6 +7698,15 @@ int can_migrate_task(struct task_struct *p,
> >> struct lb_env *env)
> >>  if (tsk_cache_hot == -1)
> >>  tsk_cache_hot = task_hot(p, env);
> >>
> >> +  /*
> >> +   * On a (sane) asymmetric CPU capacity system, the increase in
> >> compute
> >> +   * capacity should offset any potential performance hit caused
> >> by a
> >> +   * migration.
> >> +   */
> >> +  if ((env->dst_grp_type == group_has_spare) &&
> >> +  !migrate_degrades_capacity(p, env))
> >> +  tsk_cache_hot = 0;
> >
> > ... I'm starting to wonder if we should not rename the
> > tsk_cache_hot variable to something else to make this
> > code more readable. Probably in another patch :)
> >
> 
> I'd tend to agree, but naming is hard. "migration_harmful" ?

I thought Rik meant tsk_cache_hot, for which I'd suggest at least
buying a vowel and putting an 'a' in there :) 


Cheers,
Phil

> 
> > --
> > All Rights Reversed.
> 

--

Re: [PATCH v4 1/4] sched/fair: Introduce primitives for CFS bandwidth burst

2021-03-18 Thread Phil Auld

On Thu, Mar 18, 2021 at 09:26:58AM +0800 changhuaixin wrote:
> 
> 
> > On Mar 17, 2021, at 4:06 PM, Peter Zijlstra  wrote:
> > 
> > On Wed, Mar 17, 2021 at 03:16:18PM +0800, changhuaixin wrote:
> > 
> >>> Why do you allow such a large burst? I would expect something like:
> >>> 
> >>>   if (burst > quote)
> >>>   return -EINVAL;
> >>> 
> >>> That limits the variance in the system. Allowing super long bursts seems
> >>> to defeat the entire purpose of bandwidth control.
> >> 
> >> I understand your concern. Surely large burst value might allow super
> >> long bursts thus preventing bandwidth control entirely for a long
> >> time.
> >> 
> >> However, I am afraid it is hard to decide what the maximum burst
> >> should be from the bandwidth control mechanism itself. Allowing some
> >> burst to the maximum of quota is helpful, but not enough. There are
> >> cases where workloads are bursty that they need many times more than
> >> quota in a single period. In such cases, limiting burst to the maximum
> >> of quota fails to meet the needs.
> >> 
> >> Thus, I wonder whether is it acceptable to leave the maximum burst to
> >> users. If the desired behavior is to allow some burst, configure burst
> >> accordingly. If that is causing variance, use share or other fairness
> >> mechanism. And if fairness mechanism still fails to coordinate, do not
> >> use burst maybe.
> > 
> > It's not fairness, bandwidth control is about isolation, and burst
> > introduces interference.
> > 
> >> In this way, cfs_b->buffer can be removed while cfs_b->max_overrun is
> >> still needed maybe.
> > 
> > So what is the typical avg,stdev,max and mode for the workloads where you 
> > find
> > you need this?
> > 
> > I would really like to put a limit on the burst. IMO a workload that has
> > a burst many times longer than the quota is plain broken.
> 
> I see. Then the problem comes down to how large the limit on burst shall be.
> 
> I have sampled the CPU usage of a bursty container in 100ms periods. The 
> statistics are:
> average   : 42.2%
> stddev: 81.5%
> max   : 844.5%
> P95   : 183.3%
> P99   : 437.0%
> 
> If quota is 10ms, burst buffer needs to be 8 times more in order for this 
> workload not to be throttled.
> I can't say this is typical, but these workloads exist. On a machine running 
> Kubernetes containers,
> where there is often room for such burst and the interference is hard to 
> notice, users would prefer
> allowing such burst to being throttled occasionally.
>

I admit to not having followed all the history of this patch set. That said, 
when I see the above I just
think your quota is too low for your workload.

The burst (mis?)feature seems to be a way to bypass the quota.  And it sort of 
assumes cooperative
containers that will only burst when they need it and then go back to normal. 

> In this sense, I suggest limit burst buffer to 16 times of quota or around. 
> That should be enough for users to
> improve tail latency caused by throttling. And users might choose a smaller 
> one or even none, if the interference
> is unacceptable. What do you think?
> 

Having quotas that can regularly be exceeded by 16 times seems to make the 
concept of a quota
meaningless.  I'd have thought a burst would be some small percentage.

What if several such containers burst at the same time? Can't that lead to 
overcommit that can effect
other well-behaved containers?


Cheers,
Phil

--

Re: [PATCH v1] sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal

2020-11-09 Thread Phil Auld

On Mon, Nov 09, 2020 at 03:38:15PM + Mel Gorman wrote:
> On Mon, Nov 09, 2020 at 10:24:11AM -0500, Phil Auld wrote:
> > Hi,
> > 
> > On Fri, Nov 06, 2020 at 04:00:10PM + Mel Gorman wrote:
> > > On Fri, Nov 06, 2020 at 02:33:56PM +0100, Vincent Guittot wrote:
> > > > On Fri, 6 Nov 2020 at 13:03, Mel Gorman  
> > > > wrote:
> > > > >
> > > > > On Wed, Nov 04, 2020 at 09:42:05AM +, Mel Gorman wrote:
> > > > > > While it's possible that some other factor masked the impact of the 
> > > > > > patch,
> > > > > > the fact it's neutral for two workloads in 5.10-rc2 is suspicious 
> > > > > > as it
> > > > > > indicates that if the patch was implemented against 5.10-rc2, it 
> > > > > > would
> > > > > > likely not have been merged. I've queued the tests on the remaining
> > > > > > machines to see if something more conclusive falls out.
> > > > > >
> > > > >
> > > > > It's not as conclusive as I would like. fork_test generally benefits
> > > > > across the board but I do not put much weight in that.
> > > > >
> > > > > Otherwise, it's workload and machine-specific.
> > > > >
> > > > > schbench: (wakeup latency sensitive), all machines benefitted from the
> > > > > revert at the low utilisation except one 2-socket haswell 
> > > > > machine
> > > > > which showed higher variability when the machine was fully
> > > > > utilised.
> > > > 
> > > > There is a pending patch to should improve this bench:
> > > > https://lore.kernel.org/patchwork/patch/1330614/
> > > > 
> > > 
> > > Ok, I've slotted this one in with a bunch of other stuff I wanted to run
> > > over the weekend. That particular patch was on my radar anyway. It just
> > > got bumped up the schedule a little bit.
> > >
> > 
> > 
> > We've run some of our perf tests against various kernels in this thread.
> > By default RHEL configs run with the performance governor.
> > 
> 
> This aspect is somewhat critical because the patches affect CPU
> selection. If a mostly idle CPU is used due to spreading load wider,
> it can take longer to ramp up to the highest frequency. It can be a
> dominating factor and may account for some of the differences.
> 
> Generally my tests are not based on the performance governor because a)
> it's not a universal win and b) the powersave/ondemand govenors should
> be able to function reasonably well. For short-lived workloads it may
> not matter but ultimately schedutil should be good enough that it can
> keep track of task utilisation after migrations and select appropriate
> frequencies based on the tasks historical behaviour.
>

Yes, I suspect that is why we don't see the more general performance hits
you're seeing.

I agree that schedutil would be nice. I don't think it's quite there yet, but
that's anecdotal. Current RHEL configs don't even enable it so it's harder to
test. That's something I'm working on getting changed. I'd like to make it
the default eventually but at least we need to have it available...

Cheers,
Phil


> -- 
> Mel Gorman
> SUSE Labs
> 

--

Re: [PATCH v1] sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal

2020-11-09 Thread Phil Auld

Hi,

On Fri, Nov 06, 2020 at 04:00:10PM + Mel Gorman wrote:
> On Fri, Nov 06, 2020 at 02:33:56PM +0100, Vincent Guittot wrote:
> > On Fri, 6 Nov 2020 at 13:03, Mel Gorman  wrote:
> > >
> > > On Wed, Nov 04, 2020 at 09:42:05AM +, Mel Gorman wrote:
> > > > While it's possible that some other factor masked the impact of the 
> > > > patch,
> > > > the fact it's neutral for two workloads in 5.10-rc2 is suspicious as it
> > > > indicates that if the patch was implemented against 5.10-rc2, it would
> > > > likely not have been merged. I've queued the tests on the remaining
> > > > machines to see if something more conclusive falls out.
> > > >
> > >
> > > It's not as conclusive as I would like. fork_test generally benefits
> > > across the board but I do not put much weight in that.
> > >
> > > Otherwise, it's workload and machine-specific.
> > >
> > > schbench: (wakeup latency sensitive), all machines benefitted from the
> > > revert at the low utilisation except one 2-socket haswell machine
> > > which showed higher variability when the machine was fully
> > > utilised.
> > 
> > There is a pending patch to should improve this bench:
> > https://lore.kernel.org/patchwork/patch/1330614/
> > 
> 
> Ok, I've slotted this one in with a bunch of other stuff I wanted to run
> over the weekend. That particular patch was on my radar anyway. It just
> got bumped up the schedule a little bit.
>

We've run some of our perf tests against various kernels in this thread.
By default RHEL configs run with the performance governor.

For 5.8 to 5.9 we can confirm Mel's results. But mostly in microbenchmarks.
We see microbenchmark hits with fork, exec and unmap.  Real workloads showed
no difference between the two except for the EPYC first generation (Naples)
servers.  On those systems NAS and SPECjvm2008 show a drop of about 10% but
with very high variance. 

With the spread llc patch from Vincent on 5.9 we saw no performance change
in our benchmarks. 

On 5.9 with and without Julia's patch showed no real performance change.
The only difference was an increase in hackbench latency on the same EPYC
first gen servers.

As I mentioned earlier in the thread we have all the 5.9 patches in this area
in our development distro kernel (plus a handful from 5.10-rc) and don't see
the same effect we see here between 5.8 and 5.9 caused by this patch.  But
there are other variables there.  We've queued up a comparison between that
kernel and one with just the patch in question reverted.  That may tell us
if there is an effect that is otherwise being masked. 

Jirka - feel free to correct me if I mis-summarized your results :)

Cheers,
Phil

--

Re: [PATCH v1] sched/fair: update_pick_idlest() Select group with lowest group_util when idle_cpus are equal

2020-11-02 Thread Phil Auld

Hi,

On Mon, Nov 02, 2020 at 12:06:21PM +0100 Vincent Guittot wrote:
> On Mon, 2 Nov 2020 at 11:50, Mel Gorman  wrote:
> >
> > On Tue, Jul 14, 2020 at 08:59:41AM -0400, peter.pu...@linaro.org wrote:
> > > From: Peter Puhov 
> > >
> > > v0: https://lkml.org/lkml/2020/6/16/1286
> > >
> > > Changes in v1:
> > > - Test results formatted in a table form as suggested by Valentin 
> > > Schneider
> > > - Added explanation by Vincent Guittot why nr_running may not be 
> > > sufficient
> > >
> > > In slow path, when selecting idlest group, if both groups have type
> > > group_has_spare, only idle_cpus count gets compared.
> > > As a result, if multiple tasks are created in a tight loop,
> > > and go back to sleep immediately
> > > (while waiting for all tasks to be created),
> > > they may be scheduled on the same core, because CPU is back to idle
> > > when the new fork happen.
> > >
> >
> > Intuitively, this made some sense but it's a regression magnet. For those
> > that don't know, I run a grid that among other things, operates similar to
> > the Intel 0-day bot but runs much long-lived tests on a less frequent basis
> > -- can be a few weeks, sometimes longer depending on the grid activity.
> > Where it finds regressions, it bisects them and generates a report.
> >
> > While all tests have not completed, I currently have 14 separate
> > regressions across 4 separate tests on 6 machines which are Broadwell,
> > Haswell and EPYC 2 machines (all x86_64 of course but different generations
> > and vendors). The workload configurations in mmtests are
> >
> > pagealloc-performance-aim9
> > workload-shellscripts
> > workload-kerndevel
> > scheduler-unbound-hackbench
> >
> > When reading the reports, the first and second columns are what it was
> > bisecting against. The second 3rd last commit is the "last good commit"
> > and the last column is "first bad commit". The first bad commit is always
> > this patch
> >
> > The main concern is that all of these workloads have very short-lived tasks
> > which is exactly what this patch is meant to address so either sysbench
> > and futex behave very differently on the machine that was tested or their
> > microbenchmark nature found one good corner case but missed bad ones.
> >
> > I have not investigated why because I do not have the bandwidth
> > to do a detailed study (I was off for a few days and my backlog is
> > severe). However, I recommend in before v5.10 this be reverted and retried.
> > If I'm cc'd on v2, I'll run the same tests through the grid and see what
> > falls out.
> 
> I'm going to have a look at the regressions and see if  patches that
> have been queued for v5.10 or even more recent patch can help or if
> the patch should be adjusted
>

Fwiw, we have pulled this in, along with some of the 5.10-rc1 fixes in this
area and in the load balancing code.

We found some load balancing improvements and some minor overall perf
gains in a few places, but generally did not see any difference from before
the commit mentioned here.

I'm wondering, Mel, if you have compared 5.10-rc1? 

We don't have everything though so it's possible something we have
not pulled back is interacting with this patch, or we are missing something
in our testing, or it's better with the later fixes in 5.10 or ...
something else :)

I'll try to see if we can run some direct 5.8 - 5.9 tests like these. 

Cheers,
Phil

> >
> > I'll show one example of each workload from one machine.
> >
> > pagealloc-performance-aim9
> > --
> >
> > While multiple tests are shown, the exec_test and fork_test are
> > regressing and these are very short lived. 67% regression for exec_test
> > and 32% regression for fork_test
> >
> >initialinitial   
> > last  penup   last  
> > penup  first
> >  good-v5.8   bad-v5.9   
> > bad-58934356   bad-e0078e2e  good-46132e3a  
> > good-aa93cd53   bad-3edecfef
> > Min   page_test   522580.00 (   0.00%)   537880.00 (   2.93%)   
> > 536842.11 (   2.73%)   542300.00 (   3.77%)   537993.33 (   2.95%)   
> > 526660.00 (   0.78%)   532553.33 (   1.91%)
> > Min   brk_test   1987866.67 (   0.00%)  2028666.67 (   2.05%)  
> > 2016200.00 (   1.43%)  2014856.76 (   1.36%)  2004663.56 (   0.84%)  
> > 1984466.67 (  -0.17%)  2025266.67 (   1.88%)
> > Min   exec_test  877.75 (   0.00%)  284.33 ( -67.61%)  
> > 285.14 ( -67.51%)  285.14 ( -67.51%)  852.10 (  -2.92%)  932.05 
> > (   6.19%)  285.62 ( -67.46%)
> > Min   fork_test 3213.33 (   0.00%) 2154.26 ( -32.96%) 
> > 2180.85 ( -32.13%) 2214.10 ( -31.10%) 3257.83 (   1.38%) 
> > 4154.46 (  29.29%) 2194.15 ( -31.72%)
> > Hmean page_test   544508.39 (   0.00%)   545446.23 (   0.17%)   
> > 542617.62 (  -0.35%)   546829.87 (

Re: [PATCH] sched/fair: remove the spin_lock operations

2020-11-02 Thread Phil Auld

On Fri, Oct 30, 2020 at 10:16:29PM + David Laight wrote:
> From: Benjamin Segall
> > Sent: 30 October 2020 18:48
> > 
> > Hui Su  writes:
> > 
> > > Since 'ab93a4bc955b ("sched/fair: Remove
> > > distribute_running fromCFS bandwidth")',there is
> > > nothing to protect between raw_spin_lock_irqsave/store()
> > > in do_sched_cfs_slack_timer().
> > >
> > > So remove it.
> > 
> > Reviewed-by: Ben Segall 
> > 
> > (I might nitpick the subject to be clear that it should be trivial
> > because the lock area is empty, or call them dead or something, but it's
> > not all that important)
> 
> I don't know about this case, but a lock+unlock can be used
> to ensure that nothing else holds the lock when acquiring
> the lock requires another lock be held.
> 
> So if the normal sequence is:
>   lock(table)
>   # lookup item
>   lock(item)
>   unlock(table)
>   
>   unlock(item)
> 
> Then it can make sense to do:
>   lock(table)
>   lock(item)
>   unlock(item)
>   
>   unlock(table)
> 
> although that ought to deserve a comment.
>

Nah, this one used to be like this :

raw_spin_lock_irqsave(_b->lock, flags);
lsub_positive(_b->runtime, runtime);
cfs_b->distribute_running = 0;
raw_spin_unlock_irqrestore(_b->lock, flags);

It's just a leftover. I agree that if it was there for some other
purpose that it would really need a comment. In this case, it's an
artifact of patch-based development I think.


Cheers,
Phil


>   avid
> 
> -
> Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 
> 1PT, UK
> Registration No: 1397386 (Wales)
> 

--

Re: [PATCH] sched/fair: remove the spin_lock operations

2020-10-30 Thread Phil Auld

On Fri, Oct 30, 2020 at 10:46:21PM +0800 Hui Su wrote:
> Since 'ab93a4bc955b ("sched/fair: Remove
> distribute_running fromCFS bandwidth")',there is
> nothing to protect between raw_spin_lock_irqsave/store()
> in do_sched_cfs_slack_timer().
> 
> So remove it.
> 
> Signed-off-by: Hui Su 
> ---
>  kernel/sched/fair.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 290f9e38378c..5ecbf5e63198 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5105,9 +5105,6 @@ static void do_sched_cfs_slack_timer(struct 
> cfs_bandwidth *cfs_b)
>   return;
>  
>   distribute_cfs_runtime(cfs_b);
> -
> - raw_spin_lock_irqsave(_b->lock, flags);
> - raw_spin_unlock_irqrestore(_b->lock, flags);
>  }
>  
>  /*
> -- 
> 2.29.0
> 
> 

Nice :)

Reviewed-by: Phil Auld 
--

Re: [PATCH 0/8] Style and small fixes for core-scheduling

2020-10-28 Thread Phil Auld

Hi John,

On Wed, Oct 28, 2020 at 05:19:09AM -0700 John B. Wyatt IV wrote:
> Patchset of style and small fixes for the 8th iteration of the
> Core-Scheduling feature.
> 
> Style fixes include changing spaces to tabs, inserting new lines before
> declarations, removing unused braces, and spelling.
> 
> Two small fixes involving changing a main() to main(void) and removing an
> unused 'else'.
> 
> All issues were reported by checkpatch.
> 
> I am a new Linux kernel developer interning with the Outreachy project.
>

Welcome!

> Please feel free to advise on any corrections or improvements that can be
> made.

Thanks for these. I wonder, though, if it would not make more sense
to post these changes as comments on the original as-yet-unmerged
patches that you are fixing up? 


Cheers,
Phil

> 
> John B. Wyatt IV (8):
>   sched: Correct misspellings in core-scheduling.rst
>   sched: Fix bad function definition
>   sched: Fix some style issues in test_coresched.c
>   sched: Remove unused else
>   sched: Add newline after declaration
>   sched: Remove unneeded braces
>   sched: Replace spaces with tabs
>   sched: Add newlines after declarations
> 
>  Documentation/admin-guide/hw-vuln/core-scheduling.rst | 8 
>  arch/x86/include/asm/thread_info.h| 4 ++--
>  kernel/sched/core.c   | 6 --
>  kernel/sched/coretag.c| 3 ++-
>  tools/testing/selftests/sched/test_coresched.c| 8 
>  5 files changed, 16 insertions(+), 13 deletions(-)
> 
> -- 
> 2.28.0
> 

--

Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Phil Auld

On Thu, Oct 22, 2020 at 09:32:55PM +0100 Mel Gorman wrote:
> On Thu, Oct 22, 2020 at 07:59:43PM +0200, Rafael J. Wysocki wrote:
> > > > Agreed. I'd like the option to switch back if we make the default 
> > > > change.
> > > > It's on the table and I'd like to be able to go that way.
> > > >
> > >
> > > Yep. It sounds chicken, but it's a useful safety net and a reasonable
> > > way to deprecate a feature. It's also useful for bug creation -- User X
> > > running whatever found that schedutil is worse than the old governor and
> > > had to temporarily switch back. Repeat until complaining stops and then
> > > tear out the old stuff.
> > >
> > > When/if there is a patch setting schedutil as the default, cc suitable
> > > distro people (Giovanni and myself for openSUSE).
> > 
> > So for the record, Giovanni was on the CC list of the "cpufreq:
> > intel_pstate: Use passive mode by default without HWP" patch that this
> > discussion resulted from (and which kind of belongs to the above
> > category).
> > 
> 
> Oh I know, I did not mean to suggest that you did not. He made people
> aware that this was going to be coming down the line and has been looking
> into the "what if schedutil was the default" question.  AFAIK, it's still
> a work-in-progress and I don't know all the specifics but he knows more
> than I do on the topic. I only know enough that if we flipped the switch
> tomorrow that we could be plagued with google searches suggesting it be
> turned off again just like there is still broken advice out there about
> disabling intel_pstate for usually the wrong reasons.
> 
> The passive patch was a clear flag that the intent is that schedutil will
> be the default at some unknown point in the future. That point is now a
> bit closer and this thread could have encouraged a premature change of
> the default resulting in unfair finger pointing at one company's test
> team. If at least two distos check it out and it still goes wrong, at
> least there will be shared blame :/
> 
> > > Other distros assuming they're watching can nominate their own victim.
> > 
> > But no other victims had been nominated at that time.
> 
> We have one, possibly two if Phil agrees. That's better than zero or
> unfairly placing the full responsibility on the Intel guys that have been
> testing it out.
>

Yes. I agree and we (RHEL) are planning to test this soon. I'll try to get
to it.  You can certainly CC me, please, athough I also try to watch for this
sort of thing on list. 


Cheers,
Phil

> -- 
> Mel Gorman
> SUSE Labs
> 

--

Re: default cpufreq gov, was: [PATCH] sched/fair: check for idle core

2020-10-22 Thread Phil Auld

On Thu, Oct 22, 2020 at 03:58:13PM +0100 Colin Ian King wrote:
> On 22/10/2020 15:52, Mel Gorman wrote:
> > On Thu, Oct 22, 2020 at 02:29:49PM +0200, Peter Zijlstra wrote:
> >> On Thu, Oct 22, 2020 at 02:19:29PM +0200, Rafael J. Wysocki wrote:
>  However I do want to retire ondemand, conservative and also very much
>  intel_pstate/active mode.
> >>>
> >>> I agree in general, but IMO it would not be prudent to do that without 
> >>> making
> >>> schedutil provide the same level of performance in all of the relevant use
> >>> cases.
> >>
> >> Agreed; I though to have understood we were there already.
> > 
> > AFAIK, not quite (added Giovanni as he has been paying more attention).
> > Schedutil has improved since it was merged but not to the extent where
> > it is a drop-in replacement. The standard it needs to meet is that
> > it is at least equivalent to powersave (in intel_pstate language)
> > or ondemand (acpi_cpufreq) and within a reasonable percentage of the
> > performance governor. Defaulting to performance is a) giving up and b)
> > the performance governor is not a universal win. There are some questions
> > currently on whether schedutil is good enough when HWP is not available.
> > There was some evidence (I don't have the data, Giovanni was looking into
> > it) that HWP was a requirement to make schedutil work well. That is a
> > hazard in itself because someone could test on the latest gen Intel CPU
> > and conclude everything is fine and miss that Intel-specific technology
> > is needed to make it work well while throwing everyone else under a bus.
> > Giovanni knows a lot more than I do about this, I could be wrong or
> > forgetting things.
> > 
> > For distros, switching to schedutil by default would be nice because
> > frequency selection state would follow the task instead of being per-cpu
> > and we could stop worrying about different HWP implementations but it's
> > not at the point where the switch is advisable. I would expect hard data
> > before switching the default and still would strongly advise having a
> > period of time where we can fall back when someone inevitably finds a
> > new corner case or exception.
> 
> ..and it would be really useful for distros to know when the hard data
> is available so that they can make an informed decision when to move to
> schedutil.
>

I think distros are on the hook to generate that hard data themselves
with which to make such a decision.  I don't expect it to be done by
someone else. 

> > 
> > For reference, SLUB had the same problem for years. It was switched
> > on by default in the kernel config but it was a long time before
> > SLUB was generally equivalent to SLAB in terms of performance. Block
> > multiqueue also had vaguely similar issues before the default changes
> > and a period of time before it was removed removed (example whinging mail
> > https://lore.kernel.org/lkml/20170803085115.r2jfz2lofy5sp...@techsingularity.net/)
> > It's schedutil's turn :P
> > 
> 

Agreed. I'd like the option to switch back if we make the default change.
It's on the table and I'd like to be able to go that way. 

Cheers,
Phil

--

Re: [PATCH] sched/fair: Remove the force parameter of update_tg_load_avg()

2020-09-25 Thread Phil Auld

On Thu, Sep 24, 2020 at 09:47:55AM +0800 Xianting Tian wrote:
> In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used,
> sometimes update_tg_load_avg(cfs_rq, false) is used.
> update_tg_load_avg() has the parameter force, but in current code,
> it never set 1 or true to it, so remove the force parameter.
> 
> Signed-off-by: Xianting Tian 
> ---
>  kernel/sched/fair.c | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1a68a0536..7056fa97f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -831,7 +831,7 @@ void init_entity_runnable_average(struct sched_entity *se)
>  void post_init_entity_util_avg(struct task_struct *p)
>  {
>  }
> -static void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> +static void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
>  }
>  #endif /* CONFIG_SMP */
> @@ -3288,7 +3288,6 @@ static inline void cfs_rq_util_change(struct cfs_rq 
> *cfs_rq, int flags)
>  /**
>   * update_tg_load_avg - update the tg's load avg
>   * @cfs_rq: the cfs_rq whose avg changed
> - * @force: update regardless of how small the difference
>   *
>   * This function 'ensures': tg->load_avg := \Sum tg->cfs_rq[]->avg.load.
>   * However, because tg->load_avg is a global value there are performance
> @@ -3300,7 +3299,7 @@ static inline void cfs_rq_util_change(struct cfs_rq 
> *cfs_rq, int flags)
>   *
>   * Updating tg's load_avg is necessary before update_cfs_share().
>   */
> -static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force)
> +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
>  {
>   long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
>  
> @@ -3310,7 +3309,7 @@ static inline void update_tg_load_avg(struct cfs_rq 
> *cfs_rq, int force)
>   if (cfs_rq->tg == _task_group)
>   return;
>  
> - if (force || abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> + if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
>   atomic_long_add(delta, _rq->tg->load_avg);
>   cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
>   }
> @@ -3612,7 +3611,7 @@ static inline bool skip_blocked_update(struct 
> sched_entity *se)
>  
>  #else /* CONFIG_FAIR_GROUP_SCHED */
>  
> -static inline void update_tg_load_avg(struct cfs_rq *cfs_rq, int force) {}
> +static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
>  
>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>  {
> @@ -3800,13 +3799,13 @@ static inline void update_load_avg(struct cfs_rq 
> *cfs_rq, struct sched_entity *s
>* IOW we're enqueueing a task on a new CPU.
>*/
>   attach_entity_load_avg(cfs_rq, se);
> - update_tg_load_avg(cfs_rq, 0);
> + update_tg_load_avg(cfs_rq);
>  
>   } else if (decayed) {
>   cfs_rq_util_change(cfs_rq, 0);
>  
>   if (flags & UPDATE_TG)
> - update_tg_load_avg(cfs_rq, 0);
> + update_tg_load_avg(cfs_rq);
>   }
>  }
>  
> @@ -7887,7 +7886,7 @@ static bool __update_blocked_fair(struct rq *rq, bool 
> *done)
>   struct sched_entity *se;
>  
>   if (update_cfs_rq_load_avg(cfs_rq_clock_pelt(cfs_rq), cfs_rq)) {
> - update_tg_load_avg(cfs_rq, 0);
> + update_tg_load_avg(cfs_rq);
>  
>   if (cfs_rq == >cfs)
>   decayed = true;
> @@ -10786,7 +10785,7 @@ static void detach_entity_cfs_rq(struct sched_entity 
> *se)
>   /* Catch up with the cfs_rq and remove our load when we leave */
>   update_load_avg(cfs_rq, se, 0);
>   detach_entity_load_avg(cfs_rq, se);
> - update_tg_load_avg(cfs_rq, false);
> + update_tg_load_avg(cfs_rq);
>   propagate_entity_cfs_rq(se);
>  }
>  
> @@ -10805,7 +10804,7 @@ static void attach_entity_cfs_rq(struct sched_entity 
> *se)
>   /* Synchronize entity with its cfs_rq */
>   update_load_avg(cfs_rq, se, sched_feat(ATTACH_AGE_LOAD) ? 0 : 
> SKIP_AGE_LOAD);
>   attach_entity_load_avg(cfs_rq, se);
> - update_tg_load_avg(cfs_rq, false);
> + update_tg_load_avg(cfs_rq);
>   propagate_entity_cfs_rq(se);
>  }
>  
> -- 
> 2.17.1
> 

LGTM,

Reviewed-by: Phil Auld 
--

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-24 Thread Phil Auld

On Thu, Sep 24, 2020 at 10:43:12AM -0700 Tim Chen wrote:
> 
> 
> On 9/24/20 10:13 AM, Phil Auld wrote:
> > On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote:
> >>
> >>
> >> On 9/22/20 12:14 AM, Vincent Guittot wrote:
> >>
> >>>>
> >>>>>>
> >>>>>> And a quick test with hackbench on my octo cores arm64 gives for 12
> >>
> >> Vincent,
> >>
> >> Is it octo (=10) or octa (=8) cores on a single socket for your system?
> > 
> > In what Romance language does octo mean 10?  :)
> > 
> 
> Got confused by october, the tenth month. :)

It used to be the eigth month ;)

> 
> Tim
> 
> > 
> >> The L2 is per core or there are multiple L2s shared among groups of cores?
> >>
> >> Wonder if placing the threads within a L2 or not within
> >> an L2 could cause differences seen with Aubrey's test.
> >>
> >> Tim
> >>
> > 
> 

--

Re: [RFC PATCH v2] sched/fair: select idle cpu from idle cpumask in sched domain

2020-09-24 Thread Phil Auld

On Thu, Sep 24, 2020 at 09:37:33AM -0700 Tim Chen wrote:
> 
> 
> On 9/22/20 12:14 AM, Vincent Guittot wrote:
> 
> >>
> 
>  And a quick test with hackbench on my octo cores arm64 gives for 12
> 
> Vincent,
> 
> Is it octo (=10) or octa (=8) cores on a single socket for your system?

In what Romance language does octo mean 10?  :)


> The L2 is per core or there are multiple L2s shared among groups of cores?
> 
> Wonder if placing the threads within a L2 or not within
> an L2 could cause differences seen with Aubrey's test.
> 
> Tim
> 

--

Re: [RFC -V2] autonuma: Migrate on fault among multiple bound nodes

2020-09-22 Thread Phil Auld

Hi,

On Tue, Sep 22, 2020 at 02:54:01PM +0800 Huang Ying wrote:
> Now, AutoNUMA can only optimize the page placement among the NUMA nodes if the
> default memory policy is used.  Because the memory policy specified explicitly
> should take precedence.  But this seems too strict in some situations.  For
> example, on a system with 4 NUMA nodes, if the memory of an application is 
> bound
> to the node 0 and 1, AutoNUMA can potentially migrate the pages between the 
> node
> 0 and 1 to reduce cross-node accessing without breaking the explicit memory
> binding policy.
> 
> So in this patch, if mbind(.mode=MPOL_BIND, .flags=MPOL_MF_LAZY) is used to 
> bind
> the memory of the application to multiple nodes, and in the hint page fault
> handler both the faulting page node and the accessing node are in the policy
> nodemask, the page will be tried to be migrated to the accessing node to 
> reduce
> the cross-node accessing.
>

Do you have any performance numbers that show the effects of this on
a workload?


> [Peter Zijlstra: provided the simplified implementation method.]
> 
> Questions:
> 
> Sysctl knob kernel.numa_balancing can enable/disable AutoNUMA optimizing
> globally.  But for the memory areas that are bound to multiple NUMA nodes, 
> even
> if the AutoNUMA is enabled globally via the sysctl knob, we still need to 
> enable
> AutoNUMA again with a special flag.  Why not just optimize the page placement 
> if
> possible as long as AutoNUMA is enabled globally?  The interface would look
> simpler with that.


I agree. I think it should try to do this if globally enabled.


> 
> Signed-off-by: "Huang, Ying" 
> Cc: Andrew Morton 
> Cc: Ingo Molnar 
> Cc: Mel Gorman 
> Cc: Rik van Riel 
> Cc: Johannes Weiner 
> Cc: "Matthew Wilcox (Oracle)" 
> Cc: Dave Hansen 
> Cc: Andi Kleen 
> Cc: Michal Hocko 
> Cc: David Rientjes 
> ---
>  mm/mempolicy.c | 17 +++--
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index eddbe4e56c73..273969204732 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2494,15 +2494,19 @@ int mpol_misplaced(struct page *page, struct 
> vm_area_struct *vma, unsigned long
>   break;
>  
>   case MPOL_BIND:
> -
>   /*
> -  * allows binding to multiple nodes.
> -  * use current page if in policy nodemask,
> -  * else select nearest allowed node, if any.
> -  * If no allowed nodes, use current [!misplaced].
> +  * Allows binding to multiple nodes.  If both current and
> +  * accessing nodes are in policy nodemask, migrate to
> +  * accessing node to optimize page placement. Otherwise,
> +  * use current page if in policy nodemask, else select
> +  * nearest allowed node, if any.  If no allowed nodes, use
> +  * current [!misplaced].
>*/
> - if (node_isset(curnid, pol->v.nodes))
> + if (node_isset(curnid, pol->v.nodes)) {
> + if (node_isset(thisnid, pol->v.nodes))
> + goto moron;

Nice label :)

>   goto out;
> + }
>   z = first_zones_zonelist(
>   node_zonelist(numa_node_id(), GFP_HIGHUSER),
>   gfp_zone(GFP_HIGHUSER),
> @@ -2516,6 +2520,7 @@ int mpol_misplaced(struct page *page, struct 
> vm_area_struct *vma, unsigned long
>  
>   /* Migrate the page towards the node whose CPU is referencing it */
>   if (pol->flags & MPOL_F_MORON) {
> +moron:
>   polnid = thisnid;
>  
>   if (!should_numa_migrate_memory(current, page, curnid, thiscpu))
> -- 
> 2.28.0
> 


Cheers,
Phil

--

Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks

2020-09-18 Thread Phil Auld

On Fri, Sep 18, 2020 at 12:39:28PM -0400 Phil Auld wrote:
> Hi Peter,
> 
> On Mon, Sep 14, 2020 at 01:42:02PM +0200 pet...@infradead.org wrote:
> > On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> > > Vincent Guittot (4):
> > >   sched/fair: relax constraint on task's load during load balance
> > >   sched/fair: reduce minimal imbalance threshold
> > >   sched/fair: minimize concurrent LBs between domain level
> > >   sched/fair: reduce busy load balance interval
> > 
> > I see nothing objectionable there, a little more testing can't hurt, but
> > I'm tempted to apply them.
> > 
> > Phil, Mel, any chance you can run them through your respective setups?
> > 
> 
> Sorry for the delay. Things have been backing up...
> 
> We tested with tis series and found there was no performance change in
> our test suites. (We don't have a good way to share the actual numbers
> outside right now, but since they aren't really different it probably
> doesn't matter much here.)
> 
> The difference we did see was a slight decrease in the number of tasks
> moved around at higher loads.  That seems to be a good thing even though
> it didn't directly show time-based performance benefits (and was pretty
> minor).
> 
> So if this helps other use cases we've got no problems with it.
>

Feel free to add a

Reviewed-by: Phil Auld 

Jirka did the actual testing so he can speak up with a Tested-by if he
wants to.


> Thanks,
> Phil
> 
> -- 
> 

--

Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks

2020-09-18 Thread Phil Auld

Hi Peter,

On Mon, Sep 14, 2020 at 01:42:02PM +0200 pet...@infradead.org wrote:
> On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> > Vincent Guittot (4):
> >   sched/fair: relax constraint on task's load during load balance
> >   sched/fair: reduce minimal imbalance threshold
> >   sched/fair: minimize concurrent LBs between domain level
> >   sched/fair: reduce busy load balance interval
> 
> I see nothing objectionable there, a little more testing can't hurt, but
> I'm tempted to apply them.
> 
> Phil, Mel, any chance you can run them through your respective setups?
> 

Sorry for the delay. Things have been backing up...

We tested with tis series and found there was no performance change in
our test suites. (We don't have a good way to share the actual numbers
outside right now, but since they aren't really different it probably
doesn't matter much here.)

The difference we did see was a slight decrease in the number of tasks
moved around at higher loads.  That seems to be a good thing even though
it didn't directly show time-based performance benefits (and was pretty
minor).

So if this helps other use cases we've got no problems with it.

Thanks,
Phil

--

Re: [PATCH 0/4] sched/fair: Improve fairness between cfs tasks

2020-09-14 Thread Phil Auld

On Mon, Sep 14, 2020 at 01:42:02PM +0200 pet...@infradead.org wrote:
> On Mon, Sep 14, 2020 at 12:03:36PM +0200, Vincent Guittot wrote:
> > Vincent Guittot (4):
> >   sched/fair: relax constraint on task's load during load balance
> >   sched/fair: reduce minimal imbalance threshold
> >   sched/fair: minimize concurrent LBs between domain level
> >   sched/fair: reduce busy load balance interval
> 
> I see nothing objectionable there, a little more testing can't hurt, but
> I'm tempted to apply them.
> 
> Phil, Mel, any chance you can run them through your respective setups?
> 

Yep. I'll try to get something started today, results in a few days.

These look pretty inocuous. It'll be interesting to see the effect is.


Cheers,
Phil
--

Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2020-09-08 Thread Phil Auld

Hi Quais,

On Mon, Sep 07, 2020 at 12:02:24PM +0100 Qais Yousef wrote:
> On 09/02/20 09:54, Phil Auld wrote:
> > > 
> > > I think this decoupling is not necessary. The natural place for those
> > > scheduler trace_event based on trace_points extension files is
> > > kernel/sched/ and here the internal sched.h can just be included.
> > > 
> > > If someone really wants to build this as an out-of-tree module there is
> > > an easy way to make kernel/sched/sched.h visible.
> > >
> > 
> > It's not so much that we really _want_ to do this in an external module.
> > But we aren't adding more trace events and my (limited) knowledge of
> > BPF let me to the conclusion that its raw tracepoint functionality
> > requires full events.  I didn't see any other way to do it.
> 
> I did have a patch that allowed that. It might be worth trying to upstream it.
> It just required a new macro which could be problematic.
> 
> https://github.com/qais-yousef/linux/commit/fb9fea29edb8af327e6b2bf3bc41469a8e66df8b
> 
> With the above I could attach using bpf::RAW_TRACEPOINT mechanism.
>

Yeah, that could work. I meant there was no way to do it with what was there :)

In our initial attempts at using BPF to get at nr_running (which I was not
involved in and don't have all the details...) there were issues being able to
keep up and losing events.  That may have been an implementation issue, but
using the module and trace-cmd doesn't have that problem. Hopefully you don't
see that using RAW_TRACEPOINTs.

Fwiw, I don't think these little helper routines are all that hard to maintain.
If something changes in those fields, which seems moderately unlikely at least
for many of them, the compiler will complain.

And I agree with you about preferring to use the public headers for the module.
I think we can work around it though, if needed.

Cheers,
Phil

> Cheers
> 
> --
> Qais Yousef
> 

--

Re: Requirements to control kernel isolation/nohz_full at runtime

2020-09-03 Thread Phil Auld

On Thu, Sep 03, 2020 at 03:30:15PM -0300 Marcelo Tosatti wrote:
> On Thu, Sep 03, 2020 at 03:23:59PM -0300, Marcelo Tosatti wrote:
> > On Tue, Sep 01, 2020 at 12:46:41PM +0200, Frederic Weisbecker wrote:
> > > Hi,
> > 
> > Hi Frederic,
> > 
> > Thanks for the summary! Looking forward to your comments...
> > 
> > > I'm currently working on making nohz_full/nohz_idle runtime toggable
> > > and some other people seem to be interested as well. So I've dumped
> > > a few thoughts about some pre-requirements to achieve that for those
> > > interested.
> > > 
> > > As you can see, there is a bit of hard work in the way. I'm iterating
> > > that in https://pad.kernel.org/p/isolation, feel free to edit:
> > > 
> > > 
> > > == RCU nocb ==
> > > 
> > > Currently controllable with "rcu_nocbs=" boot parameter and/or through 
> > > nohz_full=/isolcpus=nohz
> > > We need to make it toggeable at runtime. Currently handling that:
> > > v1: https://lwn.net/Articles/820544/
> > > v2: coming soon
> > 
> > Nice.
> > 
> > > == TIF_NOHZ ==
> > > 
> > > Need to get rid of that in order not to trigger syscall slowpath on CPUs 
> > > that don't want nohz_full.
> > > Also we don't want to iterate all threads and clear the flag when the 
> > > last nohz_full CPU exits nohz_full
> > > mode. Prefer static keys to call context tracking on archs. x86 does that 
> > > well.
> > > 
> > > == Proper entry code ==
> > > 
> > > We must make sure that a given arch never calls exception_enter() / 
> > > exception_exit().
> > > This saves the previous state of context tracking and switch to kernel 
> > > mode (from context tracking POV)
> > > temporarily. Since this state is saved on the stack, this prevents us 
> > > from turning off context tracking
> > > entirely on a CPU: The tracking must be done on all CPUs and that takes 
> > > some cycles.
> > > 
> > > This means that, considering early entry code (before the call to context 
> > > tracking upon kernel entry,
> > > and after the call to context tracking upon kernel exit), we must take 
> > > care of few things:
> > > 
> > > 1) Make sure early entry code can't trigger exceptions. Or if it does, 
> > > the given exception can't schedule
> > > or use RCU (unless it calls rcu_nmi_enter()). Otherwise the exception 
> > > must call exception_enter()/exception_exit()
> > > which we don't want.
> > > 
> > > 2) No call to schedule_user().
> > > 
> > > 3) Make sure early entry code is not interruptible or 
> > > preempt_schedule_irq() would rely on
> > > exception_entry()/exception_exit()
> > > 
> > > 4) Make sure early entry code can't be traced (no call to 
> > > preempt_schedule_notrace()), or if it does it
> > > can't schedule
> > > 
> > > I believe x86 does most of that well. In the end we should remove 
> > > exception_enter()/exit implementations
> > > in x86 and replace it with a check that makes sure context_tracking state 
> > > is not in USER. An arch meeting
> > > all the above conditions would earn a 
> > > CONFIG_ARCH_HAS_SANE_CONTEXT_TRACKING. Being able to toggle nohz_full
> > > at runtime would depend on that.
> > > 
> > > 
> > > == Cputime accounting ==
> > > 
> > > Both write and read side must switch to tick based accounting and drop 
> > > the use of seqlock in task_cputime(),
> > > task_gtime(), kcpustat_field(), kcpustat_cpu_fetch(). Special 
> > > ordering/state machine is required to make that without races.
> > > 
> > > == Nohz ==
> > > 
> > > Switch from nohz_full to nohz_idle. Mind a few details:
> > > 
> > > 1) Turn off 1Hz offlined tick handled in housekeeping
> > > 2) Handle tick dependencies, take care of racing CPUs 
> > > setting/clearing tick dependency. It's much trickier when
> > > we switch from nohz_idle to nohz_full
> > > 
> > > == Unbound affinity ==
> > > 
> > > Restore kernel threads, workqueue, timers, etc... wide affinity. But take 
> > > care of cpumasks that have been set through other
> > > interfaces: sysfs, procfs, etc...
> > 
> > We were looking at a userspace interface: what would be a proper
> > (unified, similar to isolcpus= interface) and its implementation:
> > 
> > The simplest idea for interface seemed to be exposing the integer list of
> > CPUs and isolation flags to userspace (probably via sysfs).
> > 
> > The scheme would allow flags to be separately enabled/disabled, 
> > with not all flags being necessary toggable (could for example
> > disallow nohz_full= toggling until it is implemented, but allow for
> > other isolation features to be toggable).
> > 
> > This would require per flag housekeeping_masks (instead of a single).
> > 
> > Back to the userspace interface, you mentioned earlier that cpusets
> > was a possibility for it. However:
> > 
> > "Cpusets provide a Linux kernel mechanism to constrain which CPUs and
> > Memory Nodes are used by a process or set of processes.
> > 
> > The Linux kernel already has a pair of mechanisms to specify on which
> > CPUs a task may be scheduled (sched_setaffinity) and on

Re: [PATCH v2] sched/debug: Add new tracepoint to track cpu_capacity

2020-09-02 Thread Phil Auld

On Wed, Sep 02, 2020 at 12:44:42PM +0200 Dietmar Eggemann wrote:
> + Phil Auld 
>

Thanks Dietmar.


> On 28/08/2020 19:26, Qais Yousef wrote:
> > On 08/28/20 19:10, Dietmar Eggemann wrote:
> >> On 28/08/2020 12:27, Qais Yousef wrote:
> >>> On 08/28/20 10:00, vincent.donnef...@arm.com wrote:
> >>>> From: Vincent Donnefort 
> 
> [...]
> 
> >> Can you remind me why we have all these helper functions like
> >> sched_trace_rq_cpu_capacity?
> > 
> > struct rq is defined in kernel/sched/sched.h. It's not exported. Exporting
> > these helper functions was the agreement to help modules trace internal 
> > info.
> > By passing generic info you decouple the tracepoint from giving specific 
> > info
> > and allow the modules to extract all the info they need from the same
> > tracepoint. IE: if you need more than just cpu_capacity from this 
> > tracepoint,
> > you can get that without having to continuously add extra arguments 
> > everytime
> > you need an extra piece of info. Unless this info is not in the rq of 
> > course.
> 
> I think this decoupling is not necessary. The natural place for those
> scheduler trace_event based on trace_points extension files is
> kernel/sched/ and here the internal sched.h can just be included.
> 
> If someone really wants to build this as an out-of-tree module there is
> an easy way to make kernel/sched/sched.h visible.
>

It's not so much that we really _want_ to do this in an external module.
But we aren't adding more trace events and my (limited) knowledge of
BPF let me to the conclusion that its raw tracepoint functionality
requires full events.  I didn't see any other way to do it.

We could put sched_tp in the tree under a debug CONFIG :)

> CFLAGS_sched_tp.o := -I$KERNEL_SRC/kernel/sched
> 
> all:
> make -C $KERNEL_SRC M=$(PWD) modules
> 
> This allowed me to build our trace_event extension module (sched_tp.c,
> sched_events.h) out-of-tree and I was able to get rid of all the
> sched_trace_foo() functions (in fair.c, include/linux/sched.h) and code
> there content directly in foo.c
> 
> There are two things we would need exported from the kernel:
> 
> (1) cfs_rq_tg_path() to print the path of a taskgroup cfs_rq or se.
> 
> (2) sched_uclamp_used so uclamp_rq_util_with() can be used in
> sched_events.h.
> 
> I put Phil Auld on cc because of his trace_point
> sched_update_nr_running_tp. I think Phil was using sched_tp as a base so
> I can't see an issue why we can't also remove sched_trace_rq_nr_running().
>

Our Perf team is now actively using this in downstream, using sched_tp, and
finding it very useful.

But I have no problem if this is all simpler in the kernel tree.

> >> In case we would let the extra code (which transforms trace points into
> >> trace events) know the internals of struct rq we could handle those
> >> things in the TRACE_EVENT and/or the register_trace_##name(void
> >> (*probe)(data_proto), void *data) thing.
> >> We always said when the internal things will change this extra code will
> >> break. So that's not an issue.
> > 
> > The problem is that you need to export struct rq in a public header. Which 
> > we
> > don't want to do. I have been trying to find out how to use BTF so we can
> > remove these functions. Haven't gotten far away yet - but it should be 
> > doable
> > and it's a question of me finding enough time to understand what was 
> > currently
> > done and if I can re-use something or need to come up with extra 
> > infrastructure
> > first.
> 
> Let's keep the footprint of these trace points as small as possible in
> the scheduler code.
> 
> I'm putting the changes I described above in our monthly EAS integration
> right now and when this worked out nicely I will share the patches on lkml.
> 

Sounds good, thanks!


Cheers,
Phil

--

[tip: sched/urgent] sched: Fix use of count for nr_running tracepoint

2020-08-06 Thread tip-bot2 for Phil Auld

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: a1bd06853ee478d37fae9435c5521e301de94c67
Gitweb:
https://git.kernel.org/tip/a1bd06853ee478d37fae9435c5521e301de94c67
Author:Phil Auld 
AuthorDate:Wed, 05 Aug 2020 16:31:38 -04:00
Committer: Ingo Molnar 
CommitterDate: Thu, 06 Aug 2020 09:36:59 +02:00

sched: Fix use of count for nr_running tracepoint

The count field is meant to tell if an update to nr_running
is an add or a subtract. Make it do so by adding the missing
minus sign.

Fixes: 9d246053a691 ("sched: Add a tracepoint to track rq->nr_running")
Signed-off-by: Phil Auld 
Signed-off-by: Ingo Molnar 
Link: https://lore.kernel.org/r/20200805203138.1411-1-pa...@redhat.com
---
 kernel/sched/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3fd2838..28709f6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1999,7 +1999,7 @@ static inline void sub_nr_running(struct rq *rq, unsigned 
count)
 {
rq->nr_running -= count;
if (trace_sched_update_nr_running_tp_enabled()) {
-   call_trace_sched_update_nr_running(rq, count);
+   call_trace_sched_update_nr_running(rq, -count);
}
 
/* Check if we still need preemption */

[PATCH] sched: Fix use of count for nr_running tracepoint

2020-08-05 Thread Phil Auld

The count field is meant to tell if an update to nr_running
is an add or a subtract. Make it do so by adding the missing
minus sign.

Fixes: 9d246053a691 ("sched: Add a tracepoint to track rq->nr_running")
Signed-off-by: Phil Auld 
---
 kernel/sched/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3fd283892761..28709f6b0975 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1999,7 +1999,7 @@ static inline void sub_nr_running(struct rq *rq, unsigned 
count)
 {
rq->nr_running -= count;
if (trace_sched_update_nr_running_tp_enabled()) {
-   call_trace_sched_update_nr_running(rq, count);
+   call_trace_sched_update_nr_running(rq, -count);
}
 
/* Check if we still need preemption */
-- 
2.26.2

[tip: sched/core] sched: Add a tracepoint to track rq->nr_running

2020-07-09 Thread tip-bot2 for Phil Auld

The following commit has been merged into the sched/core branch of tip:

Commit-ID: 9d246053a69196c7c27068870e9b4b66ac536f68
Gitweb:
https://git.kernel.org/tip/9d246053a69196c7c27068870e9b4b66ac536f68
Author:Phil Auld 
AuthorDate:Mon, 29 Jun 2020 15:23:03 -04:00
Committer: Peter Zijlstra 
CommitterDate: Wed, 08 Jul 2020 11:39:02 +02:00

sched: Add a tracepoint to track rq->nr_running

Add a bare tracepoint trace_sched_update_nr_running_tp which tracks
->nr_running CPU's rq. This is used to accurately trace this data and
provide a visualization of scheduler imbalances in, for example, the
form of a heat map.  The tracepoint is accessed by loading an external
kernel module. An example module (forked from Qais' module and including
the pelt related tracepoints) can be found at:

  https://github.com/auldp/tracepoints-helpers.git

A script to turn the trace-cmd report output into a heatmap plot can be
found at:

  https://github.com/jirvoz/plot-nr-running

The tracepoints are added to add_nr_running() and sub_nr_running() which
are in kernel/sched/sched.h. In order to avoid CREATE_TRACE_POINTS in
the header a wrapper call is used and the trace/events/sched.h include
is moved before sched.h in kernel/sched/core.

Signed-off-by: Phil Auld 
Signed-off-by: Peter Zijlstra (Intel) 
Link: 
https://lkml.kernel.org/r/20200629192303.gc120...@lorien.usersys.redhat.com
---
 include/linux/sched.h|  1 +
 include/trace/events/sched.h |  4 
 kernel/sched/core.c  | 13 +
 kernel/sched/fair.c  |  8 ++--
 kernel/sched/pelt.c  |  2 --
 kernel/sched/sched.h | 10 ++
 6 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6833729..12b10ce 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2044,6 +2044,7 @@ const struct sched_avg *sched_trace_rq_avg_dl(struct rq 
*rq);
 const struct sched_avg *sched_trace_rq_avg_irq(struct rq *rq);
 
 int sched_trace_rq_cpu(struct rq *rq);
+int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index 04f9a4c..0d5ff09 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -642,6 +642,10 @@ DECLARE_TRACE(sched_util_est_se_tp,
TP_PROTO(struct sched_entity *se),
TP_ARGS(se));
 
+DECLARE_TRACE(sched_update_nr_running_tp,
+   TP_PROTO(struct rq *rq, int change),
+   TP_ARGS(rq, change));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4cf30e4..ff05195 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6,6 +6,10 @@
  *
  *  Copyright (C) 1991-2002  Linus Torvalds
  */
+#define CREATE_TRACE_POINTS
+#include 
+#undef CREATE_TRACE_POINTS
+
 #include "sched.h"
 
 #include 
@@ -23,9 +27,6 @@
 #include "pelt.h"
 #include "smp.h"
 
-#define CREATE_TRACE_POINTS
-#include 
-
 /*
  * Export tracepoints that act as a bare tracehook (ie: have no trace event
  * associated with them) to allow external modules to probe them.
@@ -38,6 +39,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_se_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_cfs_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_util_est_se_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
@@ -8195,4 +8197,7 @@ const u32 sched_prio_to_wmult[40] = {
  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
 };
 
-#undef CREATE_TRACE_POINTS
+void call_trace_sched_update_nr_running(struct rq *rq, int count)
+{
+trace_sched_update_nr_running_tp(rq, count);
+}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6fab1d1..3213cb2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -22,8 +22,6 @@
  */
 #include "sched.h"
 
-#include 
-
 /*
  * Targeted preemption latency for CPU-bound tasks:
  *
@@ -11296,3 +11294,9 @@ const struct cpumask *sched_trace_rd_span(struct 
root_domain *rd)
 #endif
 }
 EXPORT_SYMBOL_GPL(sched_trace_rd_span);
+
+int sched_trace_rq_nr_running(struct rq *rq)
+{
+return rq ? rq->nr_running : -1;
+}
+EXPORT_SYMBOL_GPL(sched_trace_rq_nr_running);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 11bea3b..2c613e1 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -28,8 +28,6 @@
 #include "sched.h"
 #include "pelt.h"
 
-#include 
-
 /*
  * Approximate:
  *   val * y^n,where y^32 ~= 0.5 (~1 scheduling period)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b1432f6..65b72e0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -76,6 +76,8 @@
 #include "cpupri.h"
 #include "cpudeadline.h"
 
+#include 
+
 #ifdef

Re: [RFC][PATCH] sched: Better document ttwu()

2020-07-02 Thread Phil Auld



Hi Peter,

On Thu, Jul 02, 2020 at 02:52:11PM +0200 Peter Zijlstra wrote:
> 
> Dave hit the problem fixed by commit:
> 
>   b6e13e85829f ("sched/core: Fix ttwu() race")
> 
> and failed to understand much of the code involved. Per his request a
> few comments to (hopefully) clarify things.
> 
> Requested-by: Dave Chinner 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  include/linux/sched.h |  12 ++--
>  kernel/sched/core.c   | 195 
> +++---
>  kernel/sched/sched.h  |  11 +++
>  3 files changed, 187 insertions(+), 31 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 9bd073a10224..ad36f70bef24 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -158,24 +158,24 @@ struct task_group;
>   *
>   *   for (;;) {
>   *   set_current_state(TASK_UNINTERRUPTIBLE);
> - *   if (!need_sleep)
> - *   break;
> + *   if (CONDITION)
> + *  break;
>   *
>   *   schedule();
>   *   }
>   *   __set_current_state(TASK_RUNNING);
>   *
>   * If the caller does not need such serialisation (because, for instance, the
> - * condition test and condition change and wakeup are under the same lock) 
> then
> + * CONDITION test and condition change and wakeup are under the same lock) 
> then
>   * use __set_current_state().
>   *
>   * The above is typically ordered against the wakeup, which does:
>   *
> - *   need_sleep = false;
> + *   CONDITION = 1;
>   *   wake_up_state(p, TASK_UNINTERRUPTIBLE);
>   *
> - * where wake_up_state() executes a full memory barrier before accessing the
> - * task state.
> + * where wake_up_state()/try_to_wake_up() executes a full memory barrier 
> before
> + * accessing p->state.
>   *
>   * Wakeup will do: if (@state & p->state) p->state = TASK_RUNNING, that is,
>   * once it observes the TASK_UNINTERRUPTIBLE store the waking CPU can issue a
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 1d3d2d67f398..0cd6c336029f 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -77,6 +77,97 @@ __read_mostly int scheduler_running;
>   */
>  int sysctl_sched_rt_runtime = 95;
>  
> +
> +/*
> + * Serialization rules:
> + *
> + * Lock order:
> + *
> + *   p->pi_lock
> + * rq->lock
> + *   hrtimer_cpu_base->lock (hrtimer_start() for bandwidth controls)
> + *
> + *  rq1->lock
> + *rq2->lock  where: rq1 < rq2
> + *
> + * Regular state:
> + *
> + * Normal scheduling state is serialized by rq->lock. __schedule() takes the
> + * local CPU's rq->lock, it optionally removes the task from the runqueue and
> + * always looks at the local rq data structures to find the most elegible 
> task
> + * to run next.
> + *
> + * Task enqueue is also under rq->lock, possibly taken from another CPU.
> + * Wakeups from another LLC domain might use an IPI to transfer the enqueue 
> to
> + * the local CPU to avoid bouncing the runqueue state around [ see
> + * ttwu_queue_wakelist() ]
> + *
> + * Task wakeup, specifically wakeups that involve migration, are horribly
> + * complicated to avoid having to take two rq->locks.
> + *
> + * Special state:
> + *
> + * System-calls and anything external will use task_rq_lock() which acquires
> + * both p->lock and rq->lock. As a consequence the state they change is 
> stable
> + * while holding either lock:
> + *
> + *  - sched_setaffinity():   p->cpus_ptr
> + *  - set_user_nice():   p->se.load, p->static_prio
> + *  - __sched_setscheduler():p->sched_class, p->policy, p->*prio, 
> p->se.load,
> + *   p->dl.dl_{runtime, deadline, period, flags, bw, 
> density}
> + *  - sched_setnuma():   p->numa_preferred_nid
> + *  - sched_move_task()/
> + *cpu_cgroup_fork(): p->sched_task_group
> + *
> + * p->state <- TASK_*:
> + *
> + *   is changed locklessly using set_current_state(), __set_current_state() 
> or
> + *   set_special_state(), see their respective comments, or by
> + *   try_to_wake_up(). This latter uses p->pi_lock to serialize against
> + *   concurrent self.
> + *
> + * p->on_rq <- { 0, 1 = TASK_ON_RQ_QUEUED, 2 = TASK_ON_RQ_MIGRATING }:
> + *
> + *   is set by activate_task() and cleared by deactivate_task(), under
> + *   rq->lock. Non-zero indicates the task is runnable, the special
> + *   ON_RQ_MIGRATING state is used for migration without holding both
> + *   rq->locks. It indicates task_cpu() is not stable, see task_rq_lock().
> + *
> + * p->on_cpu <- { 0, 1 }:
> + *
> + *   is set by prepare_task() and cleared by finish_task() such that it will 
> be
> + *   set before p is scheduled-in and cleared after p is scheduled-out, both
> + *   under rq->lock. Non-zero indicates the task is running on it's CPU.

s/it's/its/

> + *
> + *   [ The astute reader will observe that it is possible for two tasks on 
> one
> + * CPU to have ->on_cpu = 1 at the same time. ]
> + *
> + * task_cpu(p): is changed by set_task_cpu(), the rules are:
> + *
> + *  - Don't call set_task_cpu()

Re: [RFC PATCH 00/13] Core scheduling v5

2020-06-30 Thread Phil Auld

On Fri, Jun 26, 2020 at 11:10:28AM -0400 Joel Fernandes wrote:
> On Fri, Jun 26, 2020 at 10:36:01AM -0400, Vineeth Remanan Pillai wrote:
> > On Thu, Jun 25, 2020 at 9:47 PM Joel Fernandes  
> > wrote:
> > >
> > > On Thu, Jun 25, 2020 at 4:12 PM Vineeth Remanan Pillai
> > >  wrote:
> > > [...]
> > > > TODO lists:
> > > >
> > > >  - Interface discussions could not come to a conclusion in v5 and hence 
> > > > would
> > > >like to restart the discussion and reach a consensus on it.
> > > >- 
> > > > https://lwn.net/ml/linux-kernel/20200520222642.70679-1-j...@joelfernandes.org
> > >
> > > Thanks Vineeth, just want to add: I have a revised implementation of
> > > prctl(2) where you only pass a TID of a task you'd like to share a
> > > core with (credit to Peter for the idea [1]) so we can make use of
> > > ptrace_may_access() checks. I am currently finishing writing of
> > > kselftests for this and post it all once it is ready.
> > >
> > Thinking more about it, using TID/PID for prctl(2) and internally
> > using a task identifier to identify coresched group may have
> > limitations. A coresched group can exist longer than the lifetime
> > of a task and then there is a chance for that identifier to be
> > reused by a newer task which may or maynot be a part of the same
> > coresched group.
> 
> True, for the prctl(2) tagging (a task wanting to share core with
> another) we will need some way of internally identifying groups which does
> not depend on any value that can be reused for another purpose.
>

That was my concern as well. That's why I was thinking it should be
an arbitrary, user/admin/orchestrator defined value and not be the
responsibility of the kernel at all.  However...


> [..]
> > What do you think about having a separate cgroup for coresched?
> > Both coresched cgroup and prctl() could co-exist where prctl could
> > be used to isolate individual process or task and coresched cgroup
> > to group trusted processes.
> 
> This sounds like a fine idea to me. I wonder how Tejun and Peter feel about
> having a new attribute-less CGroup controller for core-scheduling and just
> use that for tagging. (No need to even have a tag file, just adding/removing
> to/from CGroup will tag).
>

... this could be an interesting approach. Then the cookie could still
be the cgroup address as is and there would be no need for the prctl. At
least so it seems. 



Cheers,
Phil

> > > However a question: If using the prctl(2) on a CGroup tagged task, we
> > > discussed in previous threads [2] to override the CGroup cookie such
> > > that the task may not share a core with any of the tasks in its CGroup
> > > anymore and I think Peter and Phil are Ok with.  My question though is
> > > - would that not be confusing for anyone looking at the CGroup
> > > filesystem's "tag" and "tasks" files?
> > >
> > Having a dedicated cgroup for coresched could solve this problem
> > as well. "coresched.tasks" inside the cgroup hierarchy would list all
> > the taskx in the group and prctl can override this and take it out
> > of the group.
> 
> We don't even need coresched.tasks, just the existing 'tasks' of CGroups can
> be used.
> 
> > > To resolve this, I am proposing to add a new CGroup file
> > > 'tasks.coresched' to the CGroup, and this will only contain tasks that
> > > were assigned cookies due to their CGroup residency. As soon as one
> > > prctl(2)'s the task, it will stop showing up in the CGroup's
> > > "tasks.coresched" file (unless of course it was requesting to
> > > prctl-share a core with someone in its CGroup itself). Are folks Ok
> > > with this solution?
> > >
> > As I mentioned above, IMHO cpu cgroups should not be used to account
> > for core scheduling as well. Cpu cgroups serve a different purpose
> > and overloading it with core scheduling would not be flexible and
> > scalable. But if there is a consensus to move forward with cpu cgroups,
> > adding this new file seems to be okay with me.
> 
> Yes, this is the problem. Many people use CPU controller CGroups already for
> other purposes. In that case, tagging a CGroup would make all the entities in
> the group be able to share a core, which may not always make sense. May be a
> new CGroup controller is the answer (?).
> 
> thanks,
> 
>  - Joel
> 

--

Re: [PATCH v2] Sched: Add a tracepoint to track rq->nr_running

2020-06-29 Thread Phil Auld

Add a bare tracepoint trace_sched_update_nr_running_tp which tracks
->nr_running CPU's rq. This is used to accurately trace this data and
provide a visualization of scheduler imbalances in, for example, the
form of a heat map.  The tracepoint is accessed by loading an external
kernel module. An example module (forked from Qais' module and including
the pelt related tracepoints) can be found at:

  https://github.com/auldp/tracepoints-helpers.git

A script to turn the trace-cmd report output into a heatmap plot can be
found at:

  https://github.com/jirvoz/plot-nr-running

The tracepoints are added to add_nr_running() and sub_nr_running() which
are in kernel/sched/sched.h. In order to avoid CREATE_TRACE_POINTS in
the header a wrapper call is used and the trace/events/sched.h include
is moved before sched.h in kernel/sched/core.

Signed-off-by: Phil Auld 
CC: Qais Yousef 
CC: Ingo Molnar 
CC: Peter Zijlstra 
CC: Vincent Guittot 
CC: Steven Rostedt 
CC: linux-kernel@vger.kernel.org
---

V2: Fix use of tracepoint in header from Steven. Pass rq* and use
helper to get nr_running field, from Qais. 


 include/linux/sched.h|  1 +
 include/trace/events/sched.h |  4 
 kernel/sched/core.c  | 13 +
 kernel/sched/fair.c  |  8 ++--
 kernel/sched/pelt.c  |  2 --
 kernel/sched/sched.h | 10 ++
 6 files changed, 30 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4418f5cb8324..5f114faf2247 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -2015,6 +2015,7 @@ const struct sched_avg *sched_trace_rq_avg_dl(struct rq 
*rq);
 const struct sched_avg *sched_trace_rq_avg_irq(struct rq *rq);
 
 int sched_trace_rq_cpu(struct rq *rq);
+int sched_trace_rq_nr_running(struct rq *rq);
 
 const struct cpumask *sched_trace_rd_span(struct root_domain *rd);
 
diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index ed168b0e2c53..8c72f9113694 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -634,6 +634,10 @@ DECLARE_TRACE(sched_overutilized_tp,
TP_PROTO(struct root_domain *rd, bool overutilized),
TP_ARGS(rd, overutilized));
 
+DECLARE_TRACE(sched_update_nr_running_tp,
+   TP_PROTO(struct rq *rq, int change),
+   TP_ARGS(rq, change));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..0d35d7c4c330 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6,6 +6,10 @@
  *
  *  Copyright (C) 1991-2002  Linus Torvalds
  */
+#define CREATE_TRACE_POINTS
+#include 
+#undef CREATE_TRACE_POINTS
+
 #include "sched.h"
 
 #include 
@@ -21,9 +25,6 @@
 
 #include "pelt.h"
 
-#define CREATE_TRACE_POINTS
-#include 
-
 /*
  * Export tracepoints that act as a bare tracehook (ie: have no trace event
  * associated with them) to allow external modules to probe them.
@@ -34,6 +35,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_dl_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_se_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
@@ -7970,4 +7972,7 @@ const u32 sched_prio_to_wmult[40] = {
  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
 };
 
-#undef CREATE_TRACE_POINTS
+void call_trace_sched_update_nr_running(struct rq *rq, int count)
+{
+trace_sched_update_nr_running_tp(rq, count);
+}
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da3e5b54715b..2e2f3f68e318 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -22,8 +22,6 @@
  */
 #include "sched.h"
 
-#include 
-
 /*
  * Targeted preemption latency for CPU-bound tasks:
  *
@@ -11293,3 +11291,9 @@ const struct cpumask *sched_trace_rd_span(struct 
root_domain *rd)
 #endif
 }
 EXPORT_SYMBOL_GPL(sched_trace_rd_span);
+
+int sched_trace_rq_nr_running(struct rq *rq)
+{
+return rq ? rq->nr_running : -1;
+}
+EXPORT_SYMBOL_GPL(sched_trace_rq_nr_running);
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index b647d04d9c8b..bb69a0ae8d6c 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -28,8 +28,6 @@
 #include "sched.h"
 #include "pelt.h"
 
-#include 
-
 /*
  * Approximate:
  *   val * y^n,where y^32 ~= 0.5 (~1 scheduling period)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..e621eaa44474 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -75,6 +75,8 @@
 #include "cpupri.h"
 #include "cpudeadline.h"
 
+#include 
+
 #ifdef CONFIG_SCHED_DEBUG
 # define SCHED_WARN_ON(x)  WARN_ONCE(x, #x)
 #else
@@ -96,6 +98,7 @@ extern atomic_long_t calc_load_tasks;
 extern void calc_global_load_tick(struct rq *this_rq);
 extern long calc_load_fold_active(struct rq *this_rq, long adjust)

Re: [PATCH] Sched: Add a tracepoint to track rq->nr_running

2020-06-23 Thread Phil Auld

Hi Qais,

On Mon, Jun 22, 2020 at 01:17:47PM +0100 Qais Yousef wrote:
> On 06/19/20 10:11, Phil Auld wrote:
> > Add a bare tracepoint trace_sched_update_nr_running_tp which tracks
> > ->nr_running CPU's rq. This is used to accurately trace this data and
> > provide a visualization of scheduler imbalances in, for example, the
> > form of a heat map.  The tracepoint is accessed by loading an external
> > kernel module. An example module (forked from Qais' module and including
> > the pelt related tracepoints) can be found at:
> > 
> >   https://github.com/auldp/tracepoints-helpers.git
> > 
> > A script to turn the trace-cmd report output into a heatmap plot can be
> > found at:
> > 
> >   https://github.com/jirvoz/plot-nr-running
> > 
> > The tracepoints are added to add_nr_running() and sub_nr_running() which
> > are in kernel/sched/sched.h. Since sched.h includes trace/events/tlb.h
> > via mmu_context.h we had to limit when CREATE_TRACE_POINTS is defined.
> > 
> > Signed-off-by: Phil Auld 
> > CC: Qais Yousef 
> > CC: Ingo Molnar 
> > CC: Peter Zijlstra 
> > CC: Vincent Guittot 
> > CC: linux-kernel@vger.kernel.org
> > ---
> >  include/trace/events/sched.h |  4 
> >  kernel/sched/core.c  |  9 -
> >  kernel/sched/fair.c  |  2 --
> >  kernel/sched/pelt.c  |  2 --
> >  kernel/sched/sched.h | 12 
> >  5 files changed, 20 insertions(+), 9 deletions(-)
> > 
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index ed168b0e2c53..a6d9fe5a68cf 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -634,6 +634,10 @@ DECLARE_TRACE(sched_overutilized_tp,
> > TP_PROTO(struct root_domain *rd, bool overutilized),
> > TP_ARGS(rd, overutilized));
> >  
> > +DECLARE_TRACE(sched_update_nr_running_tp,
> > +   TP_PROTO(int cpu, int change, unsigned int nr_running),
> > +   TP_ARGS(cpu, change, nr_running));
> > +
> >  #endif /* _TRACE_SCHED_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9a2fbf98fd6f..6f28fdff1d48 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6,7 +6,10 @@
> >   *
> >   *  Copyright (C) 1991-2002  Linus Torvalds
> >   */
> > +
> > +#define SCHED_CREATE_TRACE_POINTS
> >  #include "sched.h"
> > +#undef SCHED_CREATE_TRACE_POINTS
> >  
> >  #include 
> >  
> > @@ -21,9 +24,6 @@
> >  
> >  #include "pelt.h"
> >  
> > -#define CREATE_TRACE_POINTS
> > -#include 
> > -
> >  /*
> >   * Export tracepoints that act as a bare tracehook (ie: have no trace event
> >   * associated with them) to allow external modules to probe them.
> > @@ -34,6 +34,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_dl_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_se_tp);
> >  EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
> > +EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
> >  
> >  DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
> >  
> > @@ -7969,5 +7970,3 @@ const u32 sched_prio_to_wmult[40] = {
> >   /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
> >   /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
> >  };
> > -
> > -#undef CREATE_TRACE_POINTS
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index da3e5b54715b..fe5d9b6db8f7 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -22,8 +22,6 @@
> >   */
> >  #include "sched.h"
> >  
> > -#include 
> > -
> >  /*
> >   * Targeted preemption latency for CPU-bound tasks:
> >   *
> > diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> > index b647d04d9c8b..bb69a0ae8d6c 100644
> > --- a/kernel/sched/pelt.c
> > +++ b/kernel/sched/pelt.c
> > @@ -28,8 +28,6 @@
> >  #include "sched.h"
> >  #include "pelt.h"
> >  
> > -#include 
> > -
> >  /*
> >   * Approximate:
> >   *   val * y^n,where y^32 ~= 0.5 (~1 scheduling period)
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index db3a57675ccf..6ae96679c169 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -75,6 +75,15 @@
> >  #include "cpupri.h"
> >  #include "cpudeadl

Re: [PATCH] Sched: Add a tracepoint to track rq->nr_running

2020-06-19 Thread Phil Auld

On Fri, Jun 19, 2020 at 12:46:41PM -0400 Steven Rostedt wrote:
> On Fri, 19 Jun 2020 10:11:20 -0400
> Phil Auld  wrote:
> 
> > 
> > diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
> > index ed168b0e2c53..a6d9fe5a68cf 100644
> > --- a/include/trace/events/sched.h
> > +++ b/include/trace/events/sched.h
> > @@ -634,6 +634,10 @@ DECLARE_TRACE(sched_overutilized_tp,
> > TP_PROTO(struct root_domain *rd, bool overutilized),
> > TP_ARGS(rd, overutilized));
> >  
> > +DECLARE_TRACE(sched_update_nr_running_tp,
> > +   TP_PROTO(int cpu, int change, unsigned int nr_running),
> > +   TP_ARGS(cpu, change, nr_running));
> > +
> >  #endif /* _TRACE_SCHED_H */
> >  
> >  /* This part must be outside protection */
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index 9a2fbf98fd6f..6f28fdff1d48 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6,7 +6,10 @@
> >   *
> >   *  Copyright (C) 1991-2002  Linus Torvalds
> >   */
> > +
> > +#define SCHED_CREATE_TRACE_POINTS
> >  #include "sched.h"
> > +#undef SCHED_CREATE_TRACE_POINTS
> 
> Because of the macro magic, and really try not to have trace events
> defined in any headers. Otherwise, we have weird defines like you are
> doing, and it doesn't fully protect it if a C file adds this header and
> defines CREATE_TRACE_POINTS first.
> 
> 
> >  
> >  #include 
> >  
> > @@ -21,9 +24,6 @@
> >  
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -75,6 +75,15 @@
> >  #include "cpupri.h"
> >  #include "cpudeadline.h"
> >  
> > +#ifdef SCHED_CREATE_TRACE_POINTS
> > +#define CREATE_TRACE_POINTS
> > +#endif
> > +#include 
> > +
> > +#ifdef SCHED_CREATE_TRACE_POINTS
> > +#undef CREATE_TRACE_POINTS
> > +#endif
> > +
> >  #ifdef CONFIG_SCHED_DEBUG
> >  # define SCHED_WARN_ON(x)  WARN_ONCE(x, #x)
> >  #else
> > @@ -1959,6 +1968,7 @@ static inline void add_nr_running(struct rq *rq, 
> > unsigned count)
> > unsigned prev_nr = rq->nr_running;
> >  
> > rq->nr_running = prev_nr + count;
> > +   trace_sched_update_nr_running_tp(cpu_of(rq), count, rq->nr_running);
> 
> Instead of having sched.h define CREATE_TRACE_POINTS, I would have the
> following:
> 
>   if (trace_sched_update_nr_running_tp_enabled()) {
>   call_trace_sched_update_nr_runnig(rq, count);
>   }
> 
> Then in sched/core.c:
> 
> void trace_sched_update_nr_running(struct rq *rq, int count)
> {
>   trace_sched_update_nr_running_tp(cpu_of(rq), count, rq->nr_running);
> }
> 
> The trace_*_enabled() above uses static branches, where the if turns to
> a nop (pass through) when disabled and a jmp when enabled (same logic
> that trace points use themselves).
> 
> Then you don't need this macro dance, and risk having another C file
> define CREATE_TRACE_POINTS and spend hours debugging why it suddenly
> broke.
>

Awesome, thanks Steve. I was really hoping there was a better way to do
that. I try it this way. 


Cheers,
Phil

> -- Steve
> 

--

[PATCH] Sched: Add a tracepoint to track rq->nr_running

2020-06-19 Thread Phil Auld

Add a bare tracepoint trace_sched_update_nr_running_tp which tracks
->nr_running CPU's rq. This is used to accurately trace this data and
provide a visualization of scheduler imbalances in, for example, the
form of a heat map.  The tracepoint is accessed by loading an external
kernel module. An example module (forked from Qais' module and including
the pelt related tracepoints) can be found at:

  https://github.com/auldp/tracepoints-helpers.git

A script to turn the trace-cmd report output into a heatmap plot can be
found at:

  https://github.com/jirvoz/plot-nr-running

The tracepoints are added to add_nr_running() and sub_nr_running() which
are in kernel/sched/sched.h. Since sched.h includes trace/events/tlb.h
via mmu_context.h we had to limit when CREATE_TRACE_POINTS is defined.

Signed-off-by: Phil Auld 
CC: Qais Yousef 
CC: Ingo Molnar 
CC: Peter Zijlstra 
CC: Vincent Guittot 
CC: linux-kernel@vger.kernel.org
---
 include/trace/events/sched.h |  4 
 kernel/sched/core.c  |  9 -
 kernel/sched/fair.c  |  2 --
 kernel/sched/pelt.c  |  2 --
 kernel/sched/sched.h | 12 
 5 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/include/trace/events/sched.h b/include/trace/events/sched.h
index ed168b0e2c53..a6d9fe5a68cf 100644
--- a/include/trace/events/sched.h
+++ b/include/trace/events/sched.h
@@ -634,6 +634,10 @@ DECLARE_TRACE(sched_overutilized_tp,
TP_PROTO(struct root_domain *rd, bool overutilized),
TP_ARGS(rd, overutilized));
 
+DECLARE_TRACE(sched_update_nr_running_tp,
+   TP_PROTO(int cpu, int change, unsigned int nr_running),
+   TP_ARGS(cpu, change, nr_running));
+
 #endif /* _TRACE_SCHED_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 9a2fbf98fd6f..6f28fdff1d48 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6,7 +6,10 @@
  *
  *  Copyright (C) 1991-2002  Linus Torvalds
  */
+
+#define SCHED_CREATE_TRACE_POINTS
 #include "sched.h"
+#undef SCHED_CREATE_TRACE_POINTS
 
 #include 
 
@@ -21,9 +24,6 @@
 
 #include "pelt.h"
 
-#define CREATE_TRACE_POINTS
-#include 
-
 /*
  * Export tracepoints that act as a bare tracehook (ie: have no trace event
  * associated with them) to allow external modules to probe them.
@@ -34,6 +34,7 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_dl_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_irq_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(pelt_se_tp);
 EXPORT_TRACEPOINT_SYMBOL_GPL(sched_overutilized_tp);
+EXPORT_TRACEPOINT_SYMBOL_GPL(sched_update_nr_running_tp);
 
 DEFINE_PER_CPU_SHARED_ALIGNED(struct rq, runqueues);
 
@@ -7969,5 +7970,3 @@ const u32 sched_prio_to_wmult[40] = {
  /*  10 */  39045157,  49367440,  61356676,  76695844,  95443717,
  /*  15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
 };
-
-#undef CREATE_TRACE_POINTS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index da3e5b54715b..fe5d9b6db8f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -22,8 +22,6 @@
  */
 #include "sched.h"
 
-#include 
-
 /*
  * Targeted preemption latency for CPU-bound tasks:
  *
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index b647d04d9c8b..bb69a0ae8d6c 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -28,8 +28,6 @@
 #include "sched.h"
 #include "pelt.h"
 
-#include 
-
 /*
  * Approximate:
  *   val * y^n,where y^32 ~= 0.5 (~1 scheduling period)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index db3a57675ccf..6ae96679c169 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -75,6 +75,15 @@
 #include "cpupri.h"
 #include "cpudeadline.h"
 
+#ifdef SCHED_CREATE_TRACE_POINTS
+#define CREATE_TRACE_POINTS
+#endif
+#include 
+
+#ifdef SCHED_CREATE_TRACE_POINTS
+#undef CREATE_TRACE_POINTS
+#endif
+
 #ifdef CONFIG_SCHED_DEBUG
 # define SCHED_WARN_ON(x)  WARN_ONCE(x, #x)
 #else
@@ -1959,6 +1968,7 @@ static inline void add_nr_running(struct rq *rq, unsigned 
count)
unsigned prev_nr = rq->nr_running;
 
rq->nr_running = prev_nr + count;
+   trace_sched_update_nr_running_tp(cpu_of(rq), count, rq->nr_running);
 
 #ifdef CONFIG_SMP
if (prev_nr < 2 && rq->nr_running >= 2) {
@@ -1973,6 +1983,8 @@ static inline void add_nr_running(struct rq *rq, unsigned 
count)
 static inline void sub_nr_running(struct rq *rq, unsigned count)
 {
rq->nr_running -= count;
+   trace_sched_update_nr_running_tp(cpu_of(rq), -count, rq->nr_running);
+
/* Check if we still need preemption */
sched_update_tick_dependency(rq);
 }
-- 
2.18.0

Re: [tip: sched/core] sched/fair: Remove distribute_running from CFS bandwidth

2020-06-08 Thread Phil Auld



On Tue, Jun 09, 2020 at 07:05:38AM +0800 Tao Zhou wrote:
> Hi Phil,
> 
> On Mon, Jun 08, 2020 at 10:53:04AM -0400, Phil Auld wrote:
> > On Sun, Jun 07, 2020 at 09:25:58AM +0800 Tao Zhou wrote:
> > > Hi,
> > > 
> > > On Fri, May 01, 2020 at 06:22:12PM -, tip-bot2 for Josh Don wrote:
> > > > The following commit has been merged into the sched/core branch of tip:
> > > > 
> > > > Commit-ID: ab93a4bc955b3980c699430bc0b633f0d8b607be
> > > > Gitweb:
> > > > https://git.kernel.org/tip/ab93a4bc955b3980c699430bc0b633f0d8b607be
> > > > Author:Josh Don 
> > > > AuthorDate:Fri, 10 Apr 2020 15:52:08 -07:00
> > > > Committer: Peter Zijlstra 
> > > > CommitterDate: Thu, 30 Apr 2020 20:14:38 +02:00
> > > > 
> > > > sched/fair: Remove distribute_running from CFS bandwidth
> > > > 
> > > > This is mostly a revert of commit:
> > > > 
> > > >   baa9be4ffb55 ("sched/fair: Fix throttle_list starvation with low CFS 
> > > > quota")
> > > > 
> > > > The primary use of distribute_running was to determine whether to add
> > > > throttled entities to the head or the tail of the throttled list. Now
> > > > that we always add to the tail, we can remove this field.
> > > > 
> > > > The other use of distribute_running is in the slack_timer, so that we
> > > > don't start a distribution while one is already running. However, even
> > > > in the event that this race occurs, it is fine to have two distributions
> > > > running (especially now that distribute grabs the cfs_b->lock to
> > > > determine remaining quota before assigning).
> > > > 
> > > > Signed-off-by: Josh Don 
> > > > Signed-off-by: Peter Zijlstra (Intel) 
> > > > Reviewed-by: Phil Auld 
> > > > Tested-by: Phil Auld 
> > > > Link: 
> > > > https://lkml.kernel.org/r/20200410225208.109717-3-josh...@google.com
> > > > ---
> > > >  kernel/sched/fair.c  | 13 +
> > > >  kernel/sched/sched.h |  1 -
> > > >  2 files changed, 1 insertion(+), 13 deletions(-)
> > > > 
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index 0c13a41..3d6ce75 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -4931,14 +4931,12 @@ static int do_sched_cfs_period_timer(struct 
> > > > cfs_bandwidth *cfs_b, int overrun, u
> > > > /*
> > > >  * This check is repeated as we release cfs_b->lock while we 
> > > > unthrottle.
> > > >  */
> > > > -   while (throttled && cfs_b->runtime > 0 && 
> > > > !cfs_b->distribute_running) {
> > > > -   cfs_b->distribute_running = 1;
> > > > +   while (throttled && cfs_b->runtime > 0) {
> > > > raw_spin_unlock_irqrestore(_b->lock, flags);
> > > > /* we can't nest cfs_b->lock while distributing 
> > > > bandwidth */
> > > > distribute_cfs_runtime(cfs_b);
> > > > raw_spin_lock_irqsave(_b->lock, flags);
> > > >  
> > > > -   cfs_b->distribute_running = 0;
> > > > throttled = !list_empty(_b->throttled_cfs_rq);
> > > > }
> > > >  
> > > > @@ -5052,10 +5050,6 @@ static void do_sched_cfs_slack_timer(struct 
> > > > cfs_bandwidth *cfs_b)
> > > > /* confirm we're still not at a refresh boundary */
> > > > raw_spin_lock_irqsave(_b->lock, flags);
> > > > cfs_b->slack_started = false;
> > > > -   if (cfs_b->distribute_running) {
> > > > -   raw_spin_unlock_irqrestore(_b->lock, flags);
> > > > -   return;
> > > > -   }
> > > >  
> > > > if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
> > > > raw_spin_unlock_irqrestore(_b->lock, flags);
> > > > @@ -5065,9 +5059,6 @@ static void do_sched_cfs_slack_timer(struct 
> > > > cfs_bandwidth *cfs_b)
> > > > if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
> > > > runtime = cfs_b->runtime;
> > > >  
>

Re: [tip: sched/core] sched/fair: Remove distribute_running from CFS bandwidth

2020-06-08 Thread Phil Auld

On Sun, Jun 07, 2020 at 09:25:58AM +0800 Tao Zhou wrote:
> Hi,
> 
> On Fri, May 01, 2020 at 06:22:12PM -, tip-bot2 for Josh Don wrote:
> > The following commit has been merged into the sched/core branch of tip:
> > 
> > Commit-ID: ab93a4bc955b3980c699430bc0b633f0d8b607be
> > Gitweb:
> > https://git.kernel.org/tip/ab93a4bc955b3980c699430bc0b633f0d8b607be
> > Author:Josh Don 
> > AuthorDate:Fri, 10 Apr 2020 15:52:08 -07:00
> > Committer: Peter Zijlstra 
> > CommitterDate: Thu, 30 Apr 2020 20:14:38 +02:00
> > 
> > sched/fair: Remove distribute_running from CFS bandwidth
> > 
> > This is mostly a revert of commit:
> > 
> >   baa9be4ffb55 ("sched/fair: Fix throttle_list starvation with low CFS 
> > quota")
> > 
> > The primary use of distribute_running was to determine whether to add
> > throttled entities to the head or the tail of the throttled list. Now
> > that we always add to the tail, we can remove this field.
> > 
> > The other use of distribute_running is in the slack_timer, so that we
> > don't start a distribution while one is already running. However, even
> > in the event that this race occurs, it is fine to have two distributions
> > running (especially now that distribute grabs the cfs_b->lock to
> > determine remaining quota before assigning).
> > 
> > Signed-off-by: Josh Don 
> > Signed-off-by: Peter Zijlstra (Intel) 
> > Reviewed-by: Phil Auld 
> > Tested-by: Phil Auld 
> > Link: https://lkml.kernel.org/r/20200410225208.109717-3-josh...@google.com
> > ---
> >  kernel/sched/fair.c  | 13 +
> >  kernel/sched/sched.h |  1 -
> >  2 files changed, 1 insertion(+), 13 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0c13a41..3d6ce75 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4931,14 +4931,12 @@ static int do_sched_cfs_period_timer(struct 
> > cfs_bandwidth *cfs_b, int overrun, u
> > /*
> >  * This check is repeated as we release cfs_b->lock while we unthrottle.
> >  */
> > -   while (throttled && cfs_b->runtime > 0 && !cfs_b->distribute_running) {
> > -   cfs_b->distribute_running = 1;
> > +   while (throttled && cfs_b->runtime > 0) {
> > raw_spin_unlock_irqrestore(_b->lock, flags);
> > /* we can't nest cfs_b->lock while distributing bandwidth */
> > distribute_cfs_runtime(cfs_b);
> > raw_spin_lock_irqsave(_b->lock, flags);
> >  
> > -   cfs_b->distribute_running = 0;
> > throttled = !list_empty(_b->throttled_cfs_rq);
> > }
> >  
> > @@ -5052,10 +5050,6 @@ static void do_sched_cfs_slack_timer(struct 
> > cfs_bandwidth *cfs_b)
> > /* confirm we're still not at a refresh boundary */
> > raw_spin_lock_irqsave(_b->lock, flags);
> > cfs_b->slack_started = false;
> > -   if (cfs_b->distribute_running) {
> > -   raw_spin_unlock_irqrestore(_b->lock, flags);
> > -   return;
> > -   }
> >  
> > if (runtime_refresh_within(cfs_b, min_bandwidth_expiration)) {
> > raw_spin_unlock_irqrestore(_b->lock, flags);
> > @@ -5065,9 +5059,6 @@ static void do_sched_cfs_slack_timer(struct 
> > cfs_bandwidth *cfs_b)
> > if (cfs_b->quota != RUNTIME_INF && cfs_b->runtime > slice)
> > runtime = cfs_b->runtime;
> >  
> > -   if (runtime)
> > -   cfs_b->distribute_running = 1;
> > -
> > raw_spin_unlock_irqrestore(_b->lock, flags);
> >  
> > if (!runtime)
> > @@ -5076,7 +5067,6 @@ static void do_sched_cfs_slack_timer(struct 
> > cfs_bandwidth *cfs_b)
> > distribute_cfs_runtime(cfs_b);
> >  
> > raw_spin_lock_irqsave(_b->lock, flags);
> > -   cfs_b->distribute_running = 0;
> > raw_spin_unlock_irqrestore(_b->lock, flags);
> >  }
> 
> When I read the tip code, I found nothing between above lock/unlock.
> This commit removed distribute_running. Is there any reason to remain
> that lock/unlock there ? I feel that it is not necessary now, no ?
>

Yeah, that looks pretty useless :)

Do you want to spin up a patch?


Cheers,
Phil


> Thanks
> 
> > @@ -5218,7 +5208,6 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> > cfs_b->period_timer.function = sched_cfs_period_timer;
> > hrtimer_init(_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
> > cfs_b->slack_timer.function = sched_cfs_slack_timer;
> > -   cfs_b->distribute_running = 0;
> > cfs_b->slack_started = false;
> >  }
> >  
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index db3a576..7198683 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -349,7 +349,6 @@ struct cfs_bandwidth {
> >  
> > u8  idle;
> > u8  period_active;
> > -   u8  distribute_running;
> > u8  slack_started;
> > struct hrtimer  period_timer;
> > struct hrtimer  slack_timer;
> 

--

Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-28 Thread Phil Auld

On Thu, May 28, 2020 at 02:17:19PM -0400 Phil Auld wrote:
> On Thu, May 28, 2020 at 07:01:28PM +0200 Peter Zijlstra wrote:
> > On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > > [..]
> > > > > > > It doens't allow tasks for form their own groups (by for example 
> > > > > > > setting
> > > > > > > the key to that of another task).
> > > > > > 
> > > > > > So for this, I was thinking of making the prctl pass in an integer. 
> > > > > > And 0
> > > > > > would mean untagged. Does that sound good to you?
> > > > > 
> > > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > > ptrace tests to see if you're allowed to observe their junk.
> > > > 
> > > > But that would require a bunch of tasks agreeing on which TID to tag 
> > > > with.
> > > > For example, if 2 tasks tag with each other's TID, then they would have
> > > > different tags and not share.
> > 
> > Well, don't do that then ;-)
> >
> 
> That was a poorly worded example :)
>

Heh, sorry, I thought that was my statement. I do not mean to belittle Joel's
example...  That's a fine example of a totally different problem than I
was thinking of :)


Cheers,
Phil

> The point I was trying to make was more that one TID of a group (not cgroup!)
> of tasks is just an arbitrary value.
> 
> At a single process (or pair rather) level, sure, you can see it as an
> identifier of whom you want to share with, but even then you have to tag
> both processes with this. And it has less meaning when the whom you want to
> share with is mutltiple tasks.
> 
> > > > What's wrong with passing in an integer instead? In any case, we would 
> > > > do the
> > > > CAP_SYS_ADMIN check to limit who can do it.
> > 
> > So the actual permission model can be different depending on how broken
> > the hardware is.
> > 
> > > > Also, one thing CGroup interface allows is an external process to set 
> > > > the
> > > > cookie, so I am wondering if we should use sched_setattr(2) instead of, 
> > > > or in
> > > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > > completely. How do you feel about that?
> > > >
> > > 
> > > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > > any potential reuse security issues.
> > > 
> > > I think the cgroup interface could be extended not to be a boolean but 
> > > take
> > > the value. With 0 being untagged as now.
> > 
> > How do you avoid reuse in such a huge space? That just creates yet
> > another problem for the kernel to keep track of who is who.
> >
> 
> The kernel doesn't care or have to track anything.  The admin does.
> At the kernel level it's just matching cookies. 
> 
> Tasks A,B,C all can share core so you give them each A's TID as a cookie.
> Task A then exits. Now B and C are using essentially a random value.
> Task D comes along and want to share with B and C. You have to tag it
> with A's old TID, which has no meaning at this point.
> 
> And if A's TID ever gets reused. The new A` gets to share too. At some
> level aren't those still 32bits? 
> 
> > With random u64 numbers, it even becomes hard to determine if you're
> > sharing at all or not.
> > 
> > Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> > allows leaking kernel privates. But under a less severe thread scenario,
> > say where only user data would be at risk, the ptrace() tests make
> > sense, but those become really hard with random u64 numbers too.
> > 
> > What would the purpose of random u64 values be for cgroups? That only
> > replicates the problem of determining uniqueness there. Then you can get
> > two cgroups unintentionally sharing because you got lucky.
> >
> 
> Seems that would be more flexible for the admin. 
> 
> What if you had two cgroups you wanted to allow to run together?  Or a
> cgroup and a few processes from a different one (say with different
> quotas or something).
> 
> I don't have such use cases so I don't feel that strongly but it seemed
> more flexible and followed the mechanism-in-kernel/policy-in-userspace
> dictum rather than basing the functionality on the implementation details.
> 
> 
> Cheers,
> Phil
> 
> 
> > Also, fundamentally, we cannot have more threads than TID space, it's a
> > natural identifier.
> > 
> 
> -- 

--

Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-28 Thread Phil Auld

On Thu, May 28, 2020 at 07:01:28PM +0200 Peter Zijlstra wrote:
> On Sun, May 24, 2020 at 10:00:46AM -0400, Phil Auld wrote:
> > On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> > > On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> > > [..]
> > > > > > It doens't allow tasks for form their own groups (by for example 
> > > > > > setting
> > > > > > the key to that of another task).
> > > > > 
> > > > > So for this, I was thinking of making the prctl pass in an integer. 
> > > > > And 0
> > > > > would mean untagged. Does that sound good to you?
> > > > 
> > > > A TID, I think. If you pass your own TID, you tag yourself as
> > > > not-sharing. If you tag yourself with another tasks's TID, you can do
> > > > ptrace tests to see if you're allowed to observe their junk.
> > > 
> > > But that would require a bunch of tasks agreeing on which TID to tag with.
> > > For example, if 2 tasks tag with each other's TID, then they would have
> > > different tags and not share.
> 
> Well, don't do that then ;-)
>

That was a poorly worded example :)

The point I was trying to make was more that one TID of a group (not cgroup!)
of tasks is just an arbitrary value.

At a single process (or pair rather) level, sure, you can see it as an
identifier of whom you want to share with, but even then you have to tag
both processes with this. And it has less meaning when the whom you want to
share with is mutltiple tasks.

> > > What's wrong with passing in an integer instead? In any case, we would do 
> > > the
> > > CAP_SYS_ADMIN check to limit who can do it.
> 
> So the actual permission model can be different depending on how broken
> the hardware is.
> 
> > > Also, one thing CGroup interface allows is an external process to set the
> > > cookie, so I am wondering if we should use sched_setattr(2) instead of, 
> > > or in
> > > addition to, the prctl(2). That way, we can drop the CGroup interface
> > > completely. How do you feel about that?
> > >
> > 
> > I think it should be an arbitrary 64bit value, in both interfaces to avoid
> > any potential reuse security issues.
> > 
> > I think the cgroup interface could be extended not to be a boolean but take
> > the value. With 0 being untagged as now.
> 
> How do you avoid reuse in such a huge space? That just creates yet
> another problem for the kernel to keep track of who is who.
>

The kernel doesn't care or have to track anything.  The admin does.
At the kernel level it's just matching cookies. 

Tasks A,B,C all can share core so you give them each A's TID as a cookie.
Task A then exits. Now B and C are using essentially a random value.
Task D comes along and want to share with B and C. You have to tag it
with A's old TID, which has no meaning at this point.

And if A's TID ever gets reused. The new A` gets to share too. At some
level aren't those still 32bits? 

> With random u64 numbers, it even becomes hard to determine if you're
> sharing at all or not.
> 
> Now, with the current SMT+MDS trainwreck, any sharing is bad because it
> allows leaking kernel privates. But under a less severe thread scenario,
> say where only user data would be at risk, the ptrace() tests make
> sense, but those become really hard with random u64 numbers too.
> 
> What would the purpose of random u64 values be for cgroups? That only
> replicates the problem of determining uniqueness there. Then you can get
> two cgroups unintentionally sharing because you got lucky.
>

Seems that would be more flexible for the admin. 

What if you had two cgroups you wanted to allow to run together?  Or a
cgroup and a few processes from a different one (say with different
quotas or something).

I don't have such use cases so I don't feel that strongly but it seemed
more flexible and followed the mechanism-in-kernel/policy-in-userspace
dictum rather than basing the functionality on the implementation details.

Cheers,
Phil

> Also, fundamentally, we cannot have more threads than TID space, it's a
> natural identifier.
> 

--

Re: [PATCH RFC] sched: Add a per-thread core scheduling interface

2020-05-24 Thread Phil Auld

On Fri, May 22, 2020 at 05:35:24PM -0400 Joel Fernandes wrote:
> On Fri, May 22, 2020 at 02:59:05PM +0200, Peter Zijlstra wrote:
> [..]
> > > > It doens't allow tasks for form their own groups (by for example setting
> > > > the key to that of another task).
> > > 
> > > So for this, I was thinking of making the prctl pass in an integer. And 0
> > > would mean untagged. Does that sound good to you?
> > 
> > A TID, I think. If you pass your own TID, you tag yourself as
> > not-sharing. If you tag yourself with another tasks's TID, you can do
> > ptrace tests to see if you're allowed to observe their junk.
> 
> But that would require a bunch of tasks agreeing on which TID to tag with.
> For example, if 2 tasks tag with each other's TID, then they would have
> different tags and not share.
> 
> What's wrong with passing in an integer instead? In any case, we would do the
> CAP_SYS_ADMIN check to limit who can do it.
> 
> Also, one thing CGroup interface allows is an external process to set the
> cookie, so I am wondering if we should use sched_setattr(2) instead of, or in
> addition to, the prctl(2). That way, we can drop the CGroup interface
> completely. How do you feel about that?
>

I think it should be an arbitrary 64bit value, in both interfaces to avoid
any potential reuse security issues. 

I think the cgroup interface could be extended not to be a boolean but take
the value. With 0 being untagged as now.

And sched_setattr could be used to set it on a per task basis.


> > > > It is also horribly ill defined what it means to 'enable', with whoem
> > > > is it allows to share a core.
> > > 
> > > I couldn't parse this. Do you mean "enabling coresched does not make 
> > > sense if
> > > we don't specify whom to share the core with?"
> > 
> > As a corrolary yes. I mostly meant that a blanket 'enable' doesn't
> > specify a 'who' you're sharing your bits with.
> 
> Yes, ok. I can reword the commit log a bit to make it more clear that we are
> specifying who we can share a core with.
> 
> > > I was just trying to respect the functionality of the CGroup patch in the
> > > coresched series, after all a gentleman named Peter Zijlstra wrote that
> > > patch ;-) ;-).
> > 
> > Yeah, but I think that same guy said that that was a shit interface and
> > only hacked up because it was easy :-)
> 
> Fair enough :-)
> 
> > > More seriously, the reason I did it this way is the prctl-tagging is a bit
> > > incompatible with CGroup tagging:
> > > 
> > > 1. What happens if 2 tasks are in a tagged CGroup and one of them changes
> > > their cookie through prctl? Do they still remain in the tagged CGroup but 
> > > are
> > > now going to not trust each other? Do they get removed from the CGroup? 
> > > This
> > > is why I made the prctl fail with -EBUSY in such cases.
> > > 
> > > 2. What happens if 2 tagged tasks with different cookies are added to a
> > > tagged CGroup? Do we fail the addition of the tasks to the group, or do we
> > > override their cookie (like I'm doing)?
> > 
> > For #2 I think I prefer failure.
> > 
> > But having the rationale spelled out in documentation (man-pages for
> > example) is important.
> 
> If we drop the CGroup interface, this would avoid both #1 and #2.
>

I believe both are useful.  Personally, I think the per-task setting should
win over the cgroup tagging. In that case #1 just falls out. And #2 pretty
much as well. Nothing would happen to the tagged task as they were added
to the cgroup. They'd keep their explicitly assigned tags and everything
should "just work". There are other reasons to be in a cpu cgroup together
than just the core scheduling tag.

There are a few other edge cases, like if you are in a cgroup, but have
been tagged explicitly with sched_setattr and then get untagged (presumably
by setting 0) do you get the cgroup tag or just stay untagged? I think based
on per-task winning you'd stay untagged. I supposed you could move out and
back in the cgroup to get the tag reapplied (Or maybe the cgroup interface
could just be reused with the same value to re-tag everyone who's untagged).



Cheers,
Phil


> thanks,
> 
>  - Joel
> 

--

[tip: sched/urgent] sched/fair: Fix enqueue_task_fair() warning some more

2020-05-19 Thread tip-bot2 for Phil Auld

The following commit has been merged into the sched/urgent branch of tip:

Commit-ID: b34cb07dde7c2346dec73d053ce926aeaa087303
Gitweb:
https://git.kernel.org/tip/b34cb07dde7c2346dec73d053ce926aeaa087303
Author:Phil Auld 
AuthorDate:Tue, 12 May 2020 09:52:22 -04:00
Committer: Peter Zijlstra 
CommitterDate: Tue, 19 May 2020 20:34:10 +02:00

sched/fair: Fix enqueue_task_fair() warning some more

sched/fair: Fix enqueue_task_fair warning some more

The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
did not fully resolve the issues with the rq->tmp_alone_branch !=
>leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
the first for_each_sched_entity loop exits due to on_rq, having incompletely
updated the list.  In this case the second for_each_sched_entity loop can
further modify se. The later code to fix up the list management fails to do
what is needed because se does not point to the sched_entity which broke out
of the first loop. The list is not fixed up because the throttled parent was
already added back to the list by a task enqueue in a parallel child hierarchy.

Address this by calling list_add_leaf_cfs_rq if there are throttled parents
while doing the second for_each_sched_entity loop.

Fixes: fe61468b2cb ("sched/fair: Fix enqueue_task_fair warning")
Suggested-by: Vincent Guittot 
Signed-off-by: Phil Auld 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Dietmar Eggemann 
Reviewed-by: Vincent Guittot 
Link: https://lkml.kernel.org/r/20200512135222.gc2...@lorien.usersys.redhat.com
---
 kernel/sched/fair.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b..c6d57c3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5479,6 +5479,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;
+
+   /*
+* One parent has been throttled and cfs_rq removed from the
+* list. Add it back to not break the leaf list.
+*/
+   if (throttled_hierarchy(cfs_rq))
+   list_add_leaf_cfs_rq(cfs_rq);
}
 
 enqueue_throttle:

Re: [PATCH v2] sched/fair: enqueue_task_fair optimization

2020-05-13 Thread Phil Auld

On Wed, May 13, 2020 at 03:25:29PM +0200 Vincent Guittot wrote:
> On Wed, 13 May 2020 at 15:18, Phil Auld  wrote:
> >
> > On Wed, May 13, 2020 at 03:15:53PM +0200 Vincent Guittot wrote:
> > > On Wed, 13 May 2020 at 15:13, Phil Auld  wrote:
> > > >
> > > > On Wed, May 13, 2020 at 03:10:28PM +0200 Vincent Guittot wrote:
> > > > > On Wed, 13 May 2020 at 14:45, Phil Auld  wrote:
> > > > > >
> > > > > > Hi Vincent,
> > > > > >
> > > > > > On Wed, May 13, 2020 at 02:33:35PM +0200 Vincent Guittot wrote:
> > > > > > > enqueue_task_fair jumps to enqueue_throttle label when 
> > > > > > > cfs_rq_of(se) is
> > > > > > > throttled which means that se can't be NULL and we can skip the 
> > > > > > > test.
> > > > > > >
> > > > > >
> > > > > > s/be NULL/be non-NULL/
> > > > > >
> > > > > > I think.
> > > > >
> > > > > This sentence refers to the move of enqueue_throttle and the fact that
> > > > > se can't be null when goto enqueue_throttle and we can jump directly
> > > > > after the if statement, which is now removed in v2 because se is
> > > > > always NULL if we don't use goto enqueue_throttle.
> > > > >
> > > > > I haven't change the commit message for the remove of if statement
> > > > >
> > > >
> > > > Fair enough, it just seems backwards from the intent of the patch now.
> > > >
> > > > There is also an extra }  after the update_overutilized_status.
> > >
> > > don't know what I did but it's crap.  sorry about that
> > >
> >
> > No worries. I didn't see it when I read it either. The compiler told me :)
> 
> Yeah, but i thought that i compiled it which is obviously not true
>

It's that "obviously" correct stuff that bites you every time ;)



> >
> >
> > > Let me prepare a v3
> > >
> > > >
> > > >
> > > > Cheers,
> > > > Phil
> > > >
> > > >
> > > >
> > > > > >
> > > > > > It's more like if it doesn't jump to the label then se must be NULL 
> > > > > > for
> > > > > > the loop to terminate.  The final loop is a NOP if se is NULL. The 
> > > > > > check
> > > > > > wasn't protecting that.
> > > > > >
> > > > > > Otherwise still
> > > > > >
> > > > > > > Reviewed-by: Phil Auld 
> > > > > >
> > > > > > Cheers,
> > > > > > Phil
> > > > > >
> > > > > >
> > > > > > > Signed-off-by: Vincent Guittot 
> > > > > > > ---
> > > > > > >
> > > > > > > v2 changes:
> > > > > > > - Remove useless if statement
> > > > > > >
> > > > > > >  kernel/sched/fair.c | 39 ---
> > > > > > >  1 file changed, 20 insertions(+), 19 deletions(-)
> > > > > > >
> > > > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > > > index a0c690d57430..b51b12d63c39 100644
> > > > > > > --- a/kernel/sched/fair.c
> > > > > > > +++ b/kernel/sched/fair.c
> > > > > > > @@ -5513,28 +5513,29 @@ enqueue_task_fair(struct rq *rq, struct 
> > > > > > > task_struct *p, int flags)
> > > > > > > list_add_leaf_cfs_rq(cfs_rq);
> > > > > > >   }
> > > > > > >
> > > > > > > -enqueue_throttle:
> > > > > > > - if (!se) {
> > > > > > > - add_nr_running(rq, 1);
> > > > > > > - /*
> > > > > > > -  * Since new tasks are assigned an initial util_avg 
> > > > > > > equal to
> > > > > > > -  * half of the spare capacity of their CPU, tiny 
> > > > > > > tasks have the
> > > > > > > -  * ability to cross the overutilized threshold, 
> > > > > > > which will
> > > > > > > -  * result in the load balancer

Re: [PATCH v2] sched/fair: enqueue_task_fair optimization

2020-05-13 Thread Phil Auld

On Wed, May 13, 2020 at 03:15:53PM +0200 Vincent Guittot wrote:
> On Wed, 13 May 2020 at 15:13, Phil Auld  wrote:
> >
> > On Wed, May 13, 2020 at 03:10:28PM +0200 Vincent Guittot wrote:
> > > On Wed, 13 May 2020 at 14:45, Phil Auld  wrote:
> > > >
> > > > Hi Vincent,
> > > >
> > > > On Wed, May 13, 2020 at 02:33:35PM +0200 Vincent Guittot wrote:
> > > > > enqueue_task_fair jumps to enqueue_throttle label when cfs_rq_of(se) 
> > > > > is
> > > > > throttled which means that se can't be NULL and we can skip the test.
> > > > >
> > > >
> > > > s/be NULL/be non-NULL/
> > > >
> > > > I think.
> > >
> > > This sentence refers to the move of enqueue_throttle and the fact that
> > > se can't be null when goto enqueue_throttle and we can jump directly
> > > after the if statement, which is now removed in v2 because se is
> > > always NULL if we don't use goto enqueue_throttle.
> > >
> > > I haven't change the commit message for the remove of if statement
> > >
> >
> > Fair enough, it just seems backwards from the intent of the patch now.
> >
> > There is also an extra }  after the update_overutilized_status.
> 
> don't know what I did but it's crap.  sorry about that
>

No worries. I didn't see it when I read it either. The compiler told me :)


> Let me prepare a v3
> 
> >
> >
> > Cheers,
> > Phil
> >
> >
> >
> > > >
> > > > It's more like if it doesn't jump to the label then se must be NULL for
> > > > the loop to terminate.  The final loop is a NOP if se is NULL. The check
> > > > wasn't protecting that.
> > > >
> > > > Otherwise still
> > > >
> > > > > Reviewed-by: Phil Auld 
> > > >
> > > > Cheers,
> > > > Phil
> > > >
> > > >
> > > > > Signed-off-by: Vincent Guittot 
> > > > > ---
> > > > >
> > > > > v2 changes:
> > > > > - Remove useless if statement
> > > > >
> > > > >  kernel/sched/fair.c | 39 ---
> > > > >  1 file changed, 20 insertions(+), 19 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index a0c690d57430..b51b12d63c39 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -5513,28 +5513,29 @@ enqueue_task_fair(struct rq *rq, struct 
> > > > > task_struct *p, int flags)
> > > > > list_add_leaf_cfs_rq(cfs_rq);
> > > > >   }
> > > > >
> > > > > -enqueue_throttle:
> > > > > - if (!se) {
> > > > > - add_nr_running(rq, 1);
> > > > > - /*
> > > > > -  * Since new tasks are assigned an initial util_avg 
> > > > > equal to
> > > > > -  * half of the spare capacity of their CPU, tiny tasks 
> > > > > have the
> > > > > -  * ability to cross the overutilized threshold, which 
> > > > > will
> > > > > -  * result in the load balancer ruining all the task 
> > > > > placement
> > > > > -  * done by EAS. As a way to mitigate that effect, do 
> > > > > not account
> > > > > -  * for the first enqueue operation of new tasks during 
> > > > > the
> > > > > -  * overutilized flag detection.
> > > > > -  *
> > > > > -  * A better way of solving this problem would be to 
> > > > > wait for
> > > > > -  * the PELT signals of tasks to converge before taking 
> > > > > them
> > > > > -  * into account, but that is not straightforward to 
> > > > > implement,
> > > > > -  * and the following generally works well enough in 
> > > > > practice.
> > > > > -  */
> > > > > - if (flags & ENQUEUE_WAKEUP)
> > > > > - update_overutilized_status(rq);
> > > > > + /* At this point se is NULL and we are at root level*/
> > > > > + add_nr_running(rq

Re: [PATCH v2] sched/fair: enqueue_task_fair optimization

2020-05-13 Thread Phil Auld

On Wed, May 13, 2020 at 03:10:28PM +0200 Vincent Guittot wrote:
> On Wed, 13 May 2020 at 14:45, Phil Auld  wrote:
> >
> > Hi Vincent,
> >
> > On Wed, May 13, 2020 at 02:33:35PM +0200 Vincent Guittot wrote:
> > > enqueue_task_fair jumps to enqueue_throttle label when cfs_rq_of(se) is
> > > throttled which means that se can't be NULL and we can skip the test.
> > >
> >
> > s/be NULL/be non-NULL/
> >
> > I think.
> 
> This sentence refers to the move of enqueue_throttle and the fact that
> se can't be null when goto enqueue_throttle and we can jump directly
> after the if statement, which is now removed in v2 because se is
> always NULL if we don't use goto enqueue_throttle.
> 
> I haven't change the commit message for the remove of if statement
>

Fair enough, it just seems backwards from the intent of the patch now.

There is also an extra }  after the update_overutilized_status.


Cheers,
Phil



> >
> > It's more like if it doesn't jump to the label then se must be NULL for
> > the loop to terminate.  The final loop is a NOP if se is NULL. The check
> > wasn't protecting that.
> >
> > Otherwise still
> >
> > > Reviewed-by: Phil Auld 
> >
> > Cheers,
> > Phil
> >
> >
> > > Signed-off-by: Vincent Guittot 
> > > ---
> > >
> > > v2 changes:
> > > - Remove useless if statement
> > >
> > >  kernel/sched/fair.c | 39 ---
> > >  1 file changed, 20 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index a0c690d57430..b51b12d63c39 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -5513,28 +5513,29 @@ enqueue_task_fair(struct rq *rq, struct 
> > > task_struct *p, int flags)
> > > list_add_leaf_cfs_rq(cfs_rq);
> > >   }
> > >
> > > -enqueue_throttle:
> > > - if (!se) {
> > > - add_nr_running(rq, 1);
> > > - /*
> > > -  * Since new tasks are assigned an initial util_avg equal to
> > > -  * half of the spare capacity of their CPU, tiny tasks have 
> > > the
> > > -  * ability to cross the overutilized threshold, which will
> > > -  * result in the load balancer ruining all the task 
> > > placement
> > > -  * done by EAS. As a way to mitigate that effect, do not 
> > > account
> > > -  * for the first enqueue operation of new tasks during the
> > > -  * overutilized flag detection.
> > > -  *
> > > -  * A better way of solving this problem would be to wait for
> > > -  * the PELT signals of tasks to converge before taking them
> > > -  * into account, but that is not straightforward to 
> > > implement,
> > > -  * and the following generally works well enough in 
> > > practice.
> > > -  */
> > > - if (flags & ENQUEUE_WAKEUP)
> > > - update_overutilized_status(rq);
> > > + /* At this point se is NULL and we are at root level*/
> > > + add_nr_running(rq, 1);
> > > +
> > > + /*
> > > +  * Since new tasks are assigned an initial util_avg equal to
> > > +  * half of the spare capacity of their CPU, tiny tasks have the
> > > +  * ability to cross the overutilized threshold, which will
> > > +  * result in the load balancer ruining all the task placement
> > > +  * done by EAS. As a way to mitigate that effect, do not account
> > > +  * for the first enqueue operation of new tasks during the
> > > +  * overutilized flag detection.
> > > +  *
> > > +  * A better way of solving this problem would be to wait for
> > > +  * the PELT signals of tasks to converge before taking them
> > > +  * into account, but that is not straightforward to implement,
> > > +  * and the following generally works well enough in practice.
> > > +  */
> > > + if (flags & ENQUEUE_WAKEUP)
> > > + update_overutilized_status(rq);
> > >
> > >   }
> > >
> > > +enqueue_throttle:
> > >   if (cfs_bandwidth_used()) {
> > >   /*
> > >* When bandwidth control is enabled; the cfs_rq_throttled()
> > > --
> > > 2.17.1
> > >
> >
> > --
> >
> 

--

Re: [PATCH v2] sched/fair: fix unthrottle_cfs_rq for leaf_cfs_rq list

2020-05-13 Thread Phil Auld

Hi Vincent,

On Wed, May 13, 2020 at 02:34:22PM +0200 Vincent Guittot wrote:
> Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair()
> are quite close and follow the same sequence for enqueuing an entity in the
> cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as
> enqueue_task_fair(). This fixes a problem already faced with the latter and
> add an optimization in the last for_each_sched_entity loop.
> 
> Reported-by Tao Zhou 
> Reviewed-by: Phil Auld 
> Signed-off-by: Vincent Guittot 
> ---
> 
> v2 changes:
> - Remove useless if statement
> 
>  kernel/sched/fair.c | 41 ++---
>  1 file changed, 30 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4e12ba882663..a0c690d57430 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4816,26 +4816,44 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>   idle_task_delta = cfs_rq->idle_h_nr_running;
>   for_each_sched_entity(se) {
>   if (se->on_rq)
> - enqueue = 0;

Can probably drop the now-unused enqueue variable too.


Cheers,
Phil



> + break;
> + cfs_rq = cfs_rq_of(se);
> + enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
>  
> + cfs_rq->h_nr_running += task_delta;
> + cfs_rq->idle_h_nr_running += idle_task_delta;
> +
> + /* end evaluation on encountering a throttled cfs_rq */
> + if (cfs_rq_throttled(cfs_rq))
> + goto unthrottle_throttle;
> + }
> +
> + for_each_sched_entity(se) {
>   cfs_rq = cfs_rq_of(se);
> - if (enqueue) {
> - enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> - } else {
> - update_load_avg(cfs_rq, se, 0);
> - se_update_runnable(se);
> - }
> +
> + update_load_avg(cfs_rq, se, UPDATE_TG);
> + se_update_runnable(se);
>  
>   cfs_rq->h_nr_running += task_delta;
>   cfs_rq->idle_h_nr_running += idle_task_delta;
>  
> +
> + /* end evaluation on encountering a throttled cfs_rq */
>   if (cfs_rq_throttled(cfs_rq))
> - break;
> + goto unthrottle_throttle;
> +
> + /*
> +  * One parent has been throttled and cfs_rq removed from the
> +  * list. Add it back to not break the leaf list.
> +  */
> + if (throttled_hierarchy(cfs_rq))
> + list_add_leaf_cfs_rq(cfs_rq);
>   }
>  
> - if (!se)
> - add_nr_running(rq, task_delta);
> + /* At this point se is NULL and we are at root level*/
> + add_nr_running(rq, task_delta);
>  
> +unthrottle_throttle:
>   /*
>* The cfs_rq_throttled() breaks in the above iteration can result in
>* incomplete leaf list maintenance, resulting in triggering the
> @@ -4844,7 +4862,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>   for_each_sched_entity(se) {
>   cfs_rq = cfs_rq_of(se);
>  
> - list_add_leaf_cfs_rq(cfs_rq);
> + if (list_add_leaf_cfs_rq(cfs_rq))
> + break;
>   }
>  
>   assert_list_leaf_cfs_rq(rq);
> -- 
> 2.17.1
> 

--

Re: [PATCH v2] sched/fair: enqueue_task_fair optimization

2020-05-13 Thread Phil Auld

Hi Vincent,

On Wed, May 13, 2020 at 02:33:35PM +0200 Vincent Guittot wrote:
> enqueue_task_fair jumps to enqueue_throttle label when cfs_rq_of(se) is
> throttled which means that se can't be NULL and we can skip the test.
>

s/be NULL/be non-NULL/

I think.

It's more like if it doesn't jump to the label then se must be NULL for
the loop to terminate.  The final loop is a NOP if se is NULL. The check
wasn't protecting that.

Otherwise still

> Reviewed-by: Phil Auld 

Cheers,
Phil


> Signed-off-by: Vincent Guittot 
> ---
> 
> v2 changes:
> - Remove useless if statement
> 
>  kernel/sched/fair.c | 39 ---
>  1 file changed, 20 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index a0c690d57430..b51b12d63c39 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5513,28 +5513,29 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> *p, int flags)
> list_add_leaf_cfs_rq(cfs_rq);
>   }
>  
> -enqueue_throttle:
> - if (!se) {
> - add_nr_running(rq, 1);
> - /*
> -  * Since new tasks are assigned an initial util_avg equal to
> -  * half of the spare capacity of their CPU, tiny tasks have the
> -  * ability to cross the overutilized threshold, which will
> -  * result in the load balancer ruining all the task placement
> -  * done by EAS. As a way to mitigate that effect, do not account
> -  * for the first enqueue operation of new tasks during the
> -  * overutilized flag detection.
> -  *
> -  * A better way of solving this problem would be to wait for
> -  * the PELT signals of tasks to converge before taking them
> -  * into account, but that is not straightforward to implement,
> -  * and the following generally works well enough in practice.
> -  */
> - if (flags & ENQUEUE_WAKEUP)
> - update_overutilized_status(rq);
> + /* At this point se is NULL and we are at root level*/
> + add_nr_running(rq, 1);
> +
> + /*
> +  * Since new tasks are assigned an initial util_avg equal to
> +  * half of the spare capacity of their CPU, tiny tasks have the
> +  * ability to cross the overutilized threshold, which will
> +  * result in the load balancer ruining all the task placement
> +  * done by EAS. As a way to mitigate that effect, do not account
> +  * for the first enqueue operation of new tasks during the
> +  * overutilized flag detection.
> +  *
> +  * A better way of solving this problem would be to wait for
> +  * the PELT signals of tasks to converge before taking them
> +  * into account, but that is not straightforward to implement,
> +  * and the following generally works well enough in practice.
> +  */
> + if (flags & ENQUEUE_WAKEUP)
> + update_overutilized_status(rq);
>  
>   }
>  
> +enqueue_throttle:
>   if (cfs_bandwidth_used()) {
>   /*
>* When bandwidth control is enabled; the cfs_rq_throttled()
> -- 
> 2.17.1
> 

--

Re: [PATCH] sched/fair: fix unthrottle_cfs_rq for leaf_cfs_rq list

2020-05-12 Thread Phil Auld

On Mon, May 11, 2020 at 09:13:20PM +0200 Vincent Guittot wrote:
> Although not exactly identical, unthrottle_cfs_rq() and enqueue_task_fair()
> are quite close and follow the same sequence for enqueuing an entity in the
> cfs hierarchy. Modify unthrottle_cfs_rq() to use the same pattern as
> enqueue_task_fair(). This fixes a problem already faced with the latter and
> add an optimization in the last for_each_sched_entity loop.
> 
> Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
> Reported-by Tao Zhou 
> Signed-off-by: Vincent Guittot 
> ---
> 
> This path applies on top of 20200507203612.gf19...@lorien.usersys.redhat.com
> and fixes similar problem for unthrottle_cfs_rq()
> 
>  kernel/sched/fair.c | 37 -
>  1 file changed, 28 insertions(+), 9 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e2450c2e0747..4b73518aa25c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4803,26 +4803,44 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>   idle_task_delta = cfs_rq->idle_h_nr_running;
>   for_each_sched_entity(se) {
>   if (se->on_rq)
> - enqueue = 0;
> + break;
> + cfs_rq = cfs_rq_of(se);
> + enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
>  
> + cfs_rq->h_nr_running += task_delta;
> + cfs_rq->idle_h_nr_running += idle_task_delta;
> +
> + /* end evaluation on encountering a throttled cfs_rq */
> + if (cfs_rq_throttled(cfs_rq))
> + goto unthrottle_throttle;
> + }
> +
> + for_each_sched_entity(se) {
>   cfs_rq = cfs_rq_of(se);
> - if (enqueue) {
> - enqueue_entity(cfs_rq, se, ENQUEUE_WAKEUP);
> - } else {
> - update_load_avg(cfs_rq, se, 0);
> - se_update_runnable(se);
> - }
> +
> + update_load_avg(cfs_rq, se, UPDATE_TG);
> + se_update_runnable(se);
>  
>   cfs_rq->h_nr_running += task_delta;
>   cfs_rq->idle_h_nr_running += idle_task_delta;
>  
> +
> + /* end evaluation on encountering a throttled cfs_rq */
>   if (cfs_rq_throttled(cfs_rq))
> - break;
> + goto unthrottle_throttle;
> +
> + /*
> +  * One parent has been throttled and cfs_rq removed from the
> +  * list. Add it back to not break the leaf list.
> +  */
> + if (throttled_hierarchy(cfs_rq))
> + list_add_leaf_cfs_rq(cfs_rq);
>   }
>  
>   if (!se)
>   add_nr_running(rq, task_delta);
>  
> +unthrottle_throttle:
>   /*
>* The cfs_rq_throttled() breaks in the above iteration can result in
>* incomplete leaf list maintenance, resulting in triggering the
> @@ -4831,7 +4849,8 @@ void unthrottle_cfs_rq(struct cfs_rq *cfs_rq)
>   for_each_sched_entity(se) {
>   cfs_rq = cfs_rq_of(se);
>  
> - list_add_leaf_cfs_rq(cfs_rq);
> + if (list_add_leaf_cfs_rq(cfs_rq))
> + break;
>   }
>  
>   assert_list_leaf_cfs_rq(rq);
> -- 
> 2.17.1
> 

I ran my reproducer test with this one as well. As expected, since
the first patch fixed the issue I was seeing and I wasn't hitting
the assert here anyway, I didn't hit the assert.

But I also didn't hit any other issues, new or old. 

It makes sense to use the same logic flow here as enqueue_task_fair.

Reviewed-by: Phil Auld 


Cheers,
Phil
--

Re: [PATCH] sched/fair: enqueue_task_fair optimization

2020-05-12 Thread Phil Auld

On Mon, May 11, 2020 at 09:23:01PM +0200 Vincent Guittot wrote:
> enqueue_task_fair() jumps to enqueue_throttle when cfs_rq_of(se) is
> throttled, which means that se can't be NULL and we can skip the test.
> 
> Signed-off-by: Vincent Guittot 
> ---
>  kernel/sched/fair.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4b73518aa25c..910bbbe50365 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5512,7 +5512,6 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
> int flags)
> list_add_leaf_cfs_rq(cfs_rq);
>   }
>  
> -enqueue_throttle:
>   if (!se) {
>   add_nr_running(rq, 1);
>   /*
> @@ -5534,6 +5533,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
> int flags)
>  
>   }
>  
> +enqueue_throttle:
>   if (cfs_bandwidth_used()) {
>   /*
>* When bandwidth control is enabled; the cfs_rq_throttled()
> -- 
> 2.17.1
> 


Reviewed-by: Phil Auld 

--

Re: [PATCH v3] sched/fair: Fix enqueue_task_fair warning some more

2020-05-12 Thread Phil Auld

On Tue, May 12, 2020 at 04:10:48PM +0200 Peter Zijlstra wrote:
> On Tue, May 12, 2020 at 09:52:22AM -0400, Phil Auld wrote:
> > sched/fair: Fix enqueue_task_fair warning some more
> > 
> > The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
> > did not fully resolve the issues with the rq->tmp_alone_branch !=
> > >leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
> > the first for_each_sched_entity loop exits due to on_rq, having incompletely
> > updated the list.  In this case the second for_each_sched_entity loop can
> > further modify se. The later code to fix up the list management fails to do
> > what is needed because se does not point to the sched_entity which broke out
> > of the first loop. The list is not fixed up because the throttled parent was
> > already added back to the list by a task enqueue in a parallel child 
> > hierarchy.
> > 
> > Address this by calling list_add_leaf_cfs_rq if there are throttled parents
> > while doing the second for_each_sched_entity loop.
> > 
> > v3: clean up commit message and add fixes and review tags.
> 
> Excellent, ignore what I just sent, I now have this one.
> 

Thank you!


Cheers,
Phil
--

Re: [PATCH v3] sched/fair: Fix enqueue_task_fair warning some more

2020-05-12 Thread Phil Auld

sched/fair: Fix enqueue_task_fair warning some more

The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
did not fully resolve the issues with the rq->tmp_alone_branch !=
>leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
the first for_each_sched_entity loop exits due to on_rq, having incompletely
updated the list.  In this case the second for_each_sched_entity loop can
further modify se. The later code to fix up the list management fails to do
what is needed because se does not point to the sched_entity which broke out
of the first loop. The list is not fixed up because the throttled parent was
already added back to the list by a task enqueue in a parallel child hierarchy.

Address this by calling list_add_leaf_cfs_rq if there are throttled parents
while doing the second for_each_sched_entity loop.

v3: clean up commit message and add fixes and review tags.

Suggested-by: Vincent Guittot 
Signed-off-by: Phil Auld 
Cc: Peter Zijlstra (Intel) 
Cc: Vincent Guittot 
Cc: Ingo Molnar 
Cc: Juri Lelli 
Reviewed-by: Vincent Guittot 
Reviewed-by: Dietmar Eggemann 
Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
---
 kernel/sched/fair.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..c6d57c334d51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5479,6 +5479,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;
+
+   /*
+* One parent has been throttled and cfs_rq removed from the
+* list. Add it back to not break the leaf list.
+*/
+   if (throttled_hierarchy(cfs_rq))
+   list_add_leaf_cfs_rq(cfs_rq);
}
 
 enqueue_throttle:
-- 
2.18.0


--

Re: [PATCH v2] sched/fair: Fix enqueue_task_fair warning some more

2020-05-12 Thread Phil Auld

Hi Dietmar,

On Tue, May 12, 2020 at 11:00:16AM +0200 Dietmar Eggemann wrote:
> On 11/05/2020 22:44, Phil Auld wrote:
> > On Mon, May 11, 2020 at 09:25:43PM +0200 Vincent Guittot wrote:
> >> On Thu, 7 May 2020 at 22:36, Phil Auld  wrote:
> >>>
> >>> sched/fair: Fix enqueue_task_fair warning some more
> >>>
> >>> The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
> >>> did not fully resolve the issues with the rq->tmp_alone_branch !=
> >>> >leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
> >>> the first for_each_sched_entity loop exits due to on_rq, having 
> >>> incompletely
> >>> updated the list.  In this case the second for_each_sched_entity loop can
> >>> further modify se. The later code to fix up the list management fails to 
> >>> do
> >>> what is needed because se no longer points to the sched_entity which broke
> >>> out of the first loop.
> >>>
> >>> Address this by calling leaf_add_rq_list if there are throttled parents 
> >>> while
> >>> doing the second for_each_sched_entity loop.
> >>>
> >>
> >> Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
> >>
> >>> Suggested-by: Vincent Guittot 
> >>> Signed-off-by: Phil Auld 
> >>> Cc: Peter Zijlstra (Intel) 
> >>> Cc: Vincent Guittot 
> >>> Cc: Ingo Molnar 
> >>> Cc: Juri Lelli 
> >>
> >> With the Fixes tag and the typo mentioned by Tao
> >>
> > 
> > Right, that last line of the commit message should read 
> > "list_add_leaf_cfs_rq"
> > 
> > 
> >> Reviewed-by: Vincent Guittot 
> > 
> > Thanks Vincent.
> > 
> > Peter/Ingo, do you want me to resend or can you fix when applying?
> 
> 
> Maybe you could add that 'the throttled parent was already added back to
> the list by a task enqueue in a parallel child hierarchy'.
> 
> IMHO, this is part of the description because otherwise the throttled
> parent would have connected the branch.
> 
> And the not-adding of the intermediate child cfs_rq would have gone
> unnoticed.

Okay, I'll add that statement. For those curious here are the lines from about
70ms earlier in the trace where the throttled parent (0xa085e48ce000) is 
added
to the list.

bz1738415-test-6264  [005]  1271.315046: sched_waking: 
comm=bz1738415-test pid=6269 prio=120 target_cpu=005
bz1738415-test-6264  [005]  1271.315048: sched_migrate_task:   
comm=bz1738415-test pid=6269 prio=120 orig_cpu=5 dest_cpu=17
bz1738415-test-6264  [005]  1271.315050: bprint:   
enqueue_task_fair: se 0xa081e6d7de80 on_rq 0 cfs_rq = 0xa085e48ce000
bz1738415-test-6264  [005]  1271.315051: bprint:   enqueue_entity: 
Add_leaf_rq: cpu 17: nr_r 2; cfs 0xa085e48ce000 onlist 0 tmp_a_b = 
0xa085ef92c868 >l_c_r_l = 0xa085ef92c868
bz1738415-test-6264  [005]  1271.315053: bprint:   enqueue_entity: 
Add_leaf_rq: cpu 17: nr_r 2: parent onlist  Set tmp_alone_branch to 
0xa085ef92c868
bz1738415-test-6264  [005]  1271.315053: bprint:   
enqueue_task_fair: current se = 0xa081e6d7de80, orig_se = 0xa081e6d7de80
bz1738415-test-6264  [005]  1271.315055: bprint:   
enqueue_task_fair: Add_leaf_rq: cpu 17: nr_r 2; cfs 0xa085e48ce000 onlist 1 
tmp_a_b = 0xa085ef92c868 >l_c_r_l = 0xa085ef92c868
bz1738415-test-6264  [005]  1271.315056: sched_wake_idle_without_ipi: cpu=17

> 
> Reviewed-by: Dietmar Eggemann 

Thanks,

Phil


> 
> [...]
> 

--

Re: [PATCH v2] sched/fair: Fix enqueue_task_fair warning some more

2020-05-11 Thread Phil Auld

On Mon, May 11, 2020 at 09:25:43PM +0200 Vincent Guittot wrote:
> On Thu, 7 May 2020 at 22:36, Phil Auld  wrote:
> >
> > sched/fair: Fix enqueue_task_fair warning some more
> >
> > The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
> > did not fully resolve the issues with the rq->tmp_alone_branch !=
> > >leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
> > the first for_each_sched_entity loop exits due to on_rq, having incompletely
> > updated the list.  In this case the second for_each_sched_entity loop can
> > further modify se. The later code to fix up the list management fails to do
> > what is needed because se no longer points to the sched_entity which broke
> > out of the first loop.
> >
> > Address this by calling leaf_add_rq_list if there are throttled parents 
> > while
> > doing the second for_each_sched_entity loop.
> >
> 
> Fixes: fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
> 
> > Suggested-by: Vincent Guittot 
> > Signed-off-by: Phil Auld 
> > Cc: Peter Zijlstra (Intel) 
> > Cc: Vincent Guittot 
> > Cc: Ingo Molnar 
> > Cc: Juri Lelli 
> 
> With the Fixes tag and the typo mentioned by Tao
>

Right, that last line of the commit message should read "list_add_leaf_cfs_rq"


> Reviewed-by: Vincent Guittot 

Thanks Vincent.

Peter/Ingo, do you want me to resend or can you fix when applying?


Thanks,
Phil

> 
> > ---
> >  kernel/sched/fair.c | 7 +++
> >  1 file changed, 7 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 02f323b85b6d..c6d57c334d51 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5479,6 +5479,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct 
> > *p, int flags)
> > /* end evaluation on encountering a throttled cfs_rq */
> > if (cfs_rq_throttled(cfs_rq))
> > goto enqueue_throttle;
> > +
> > +   /*
> > +* One parent has been throttled and cfs_rq removed from the
> > +* list. Add it back to not break the leaf list.
> > +*/
> > +   if (throttled_hierarchy(cfs_rq))
> > +   list_add_leaf_cfs_rq(cfs_rq);
> > }
> >
> >  enqueue_throttle:
> > --
> > 2.18.0
> >
> > V2 rework the fix based on Vincent's suggestion. Thanks Vincent.
> >
> >
> > Cheers,
> > Phil
> >
> > --
> >
> 

--

Re: [PATCH v2] sched/fair: Fix enqueue_task_fair warning some more

2020-05-07 Thread Phil Auld

sched/fair: Fix enqueue_task_fair warning some more

The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
did not fully resolve the issues with the rq->tmp_alone_branch !=
>leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
the first for_each_sched_entity loop exits due to on_rq, having incompletely
updated the list.  In this case the second for_each_sched_entity loop can
further modify se. The later code to fix up the list management fails to do
what is needed because se no longer points to the sched_entity which broke
out of the first loop.

Address this by calling leaf_add_rq_list if there are throttled parents while
doing the second for_each_sched_entity loop.

Suggested-by: Vincent Guittot 
Signed-off-by: Phil Auld 
Cc: Peter Zijlstra (Intel) 
Cc: Vincent Guittot 
Cc: Ingo Molnar 
Cc: Juri Lelli 
---
 kernel/sched/fair.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..c6d57c334d51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5479,6 +5479,13 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
/* end evaluation on encountering a throttled cfs_rq */
if (cfs_rq_throttled(cfs_rq))
goto enqueue_throttle;
+
+   /*
+* One parent has been throttled and cfs_rq removed from the
+* list. Add it back to not break the leaf list.
+*/
+   if (throttled_hierarchy(cfs_rq))
+   list_add_leaf_cfs_rq(cfs_rq);
}
 
 enqueue_throttle:
-- 
2.18.0

V2 rework the fix based on Vincent's suggestion. Thanks Vincent.


Cheers,
Phil

--

Re: [PATCH] sched/fair: Fix enqueue_task_fair warning some more

2020-05-07 Thread Phil Auld

Hi Vincent,

On Thu, May 07, 2020 at 05:06:29PM +0200 Vincent Guittot wrote:
> Hi Phil,
> 
> On Wed, 6 May 2020 at 20:05, Phil Auld  wrote:
> >
> > Hi Vincent,
> >
> > Thanks for taking a look. More below...
> >
> > On Wed, May 06, 2020 at 06:36:45PM +0200 Vincent Guittot wrote:
> > > Hi Phil,
> > >
> > > - reply to all this time
> > >
> > > On Wed, 6 May 2020 at 16:18, Phil Auld  wrote:
> > > >
> > > > sched/fair: Fix enqueue_task_fair warning some more
> > > >
> > > > The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair 
> > > > warning)
> > > > did not fully resolve the issues with the (rq->tmp_alone_branch !=
> > > > >leaf_cfs_rq_list) warning in enqueue_task_fair. There is a case 
> > > > where
> > > > the first for_each_sched_entity loop exits due to on_rq, having 
> > > > incompletely
> > > > updated the list.  In this case the second for_each_sched_entity loop 
> > > > can
> > > > further modify se. The later code to fix up the list management fails 
> > > > to do
> > >
> > > But for the 2nd  for_each_sched_entity, the cfs_rq should already be
> > > in the list, isn't it ?
> >
> > No. In this case we hit the parent not on list case in list_add_leaf_cfs_rq
> > which sets rq-tmp_alone_branch to cfs_rq->leaf_cfs_rq_list which is not
> > the same. It returns false expecting the parent to be added later.
> >
> > But then the parent doens't get there because it's on_rq.
> >
> > >
> > > The third for_each_entity loop is there for the throttled case but is
> > > useless for other case
> > >
> >
> > There actually is a throttling involved usually. The second loop breaks out
> > early because one of the parents is throttled. But not before it advances
> > se at least once.
> 
> Ok, that's even because of the throttling that the problem occurs
> 
> >
> > Then the 3rd loop doesn't fix the tmp_alone_branch because it doesn't start
> > with the right se.
> >
> > > Could you provide us some details about the use case that creates such
> > > a situation ?
> > >
> >
> > I admit I had to add trace_printks to get here. Here's what it showed (sorry
> > for the long lines...)
> >
> > 1)  sh-6271  [044]  1271.322317: bprint: enqueue_task_fair: se 
> > 0xa085e7e30080 on_rq 0 cfs_rq = 0xa085e93da200
> > 2)  sh-6271  [044]  1271.322320: bprint: enqueue_entity: Add_leaf_rq: cpu 
> > 17: nr_r 2; cfs 0xa085e93da200 onlist 0 tmp_a_b = 0xa085ef92c868 
> > >l_c_r_l = 0xa085ef92c868
> > 3)  sh-6271  [044]  1271.322322: bprint: enqueue_entity: Add_leaf_rq: cpu 
> > 17: nr_r 2: parent not onlist  Set t_a_branch to 0xa085e93da340 
> > rq->l_c_r_l = 0xa085ef92c868
> > 4)  sh-6271  [044]  1271.322323: bprint: enqueue_task_fair: se 
> > 0xa085e93d8800 on_rq 1 cfs_rq = 0xa085dbfaea00
> > 5)  sh-6271  [044]  1271.322324: bprint: enqueue_task_fair: Done enqueues, 
> > se=0xa085e93d8800, pid=3642
> > 6)  sh-6271  [044]  1271.322326: bprint: enqueue_task_fair: update: cfs 
> > 0xa085e48ce000 throttled, se = 0xa085dbfafc00
> > 7)  sh-6271  [044]  1271.322326: bprint: enqueue_task_fair: current se = 
> > 0xa085dbfafc00, orig_se = 0xa085e7e30080
> > 8)  sh-6271  [044]  1271.322327: bprint: enqueue_task_fair: Add_leaf_rq: 
> > cpu 17: nr_r 2; cfs 0xa085e48ce000 onlist 1 tmp_a_b = 
> > 0xa085e93da340 >l_c_r_l = 0xa085ef92c868
> > 9)  sh-6271  [044]  1271.322328: bprint: enqueue_task_fair: Add_leaf_rq: 
> > cpu 17: nr_r 0; cfs 0xa085ef92bf80 onlist 1 tmp_a_b = 
> > 0xa085e93da340 >l_c_r_l = 0xa085ef92c868
> > 10) sh-6271  [044]  1271.672599: bprint: enqueue_task_fair: cpu 17: 
> > rq->tmp_alone_branch = 0xa085e93da340 != >leaf_cfs_rq_list = 
> > 0xa085ef92c868
> >
> >
> > lines 1 and 4 are from the first loop in enqueue_task_fair. Line 2 and 3 
> > are from the
> > first call to list_add_leaf_rq with line 2 being at the start and line 3 
> > showing which
> > of the 3 cases we hit.
> >
> > Line 5 is right after the first loop.
> >
> > Line 6 is the second trip through the 2nd loop and is in the if(throttled) 
> > condition.
> > Line 7 is right below the enqueue_throttle label.
> >
> > Lines 8 and 9 are from the fixup loop and since onlist is set for both of 
> > these it doesn't
> > do anything. But we've le

Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6

2020-05-07 Thread Phil Auld

On Thu, May 07, 2020 at 06:29:44PM +0200 Jirka Hladky wrote:
> Hi Mel,
> 
> we are not targeting just OMP applications. We see the performance
> degradation also for other workloads, like SPECjbb2005 and
> SPECjvm2008. Even worse, it also affects a higher number of threads.
> For example, comparing 5.7.0-0.rc2 against 5.6 kernel, on 4 NUMA
> server with 2x AMD 7351 CPU, we see performance degradation 22% for 32
> threads (the system has 64 CPUs in total). We observe this degradation
> only when we run a single SPECjbb binary. When running 4 SPECjbb
> binaries in parallel, there is no change in performance between 5.6
> and 5.7.
> 
> That's why we are asking for the kernel tunable, which we would add to
> the tuned profile. We don't expect users to change this frequently but
> rather to set the performance profile once based on the purpose of the
> server.
> 
> If you could prepare a patch for us, we would be more than happy to
> test it extensively. Based on the results, we can then evaluate if
> it's the way to go. Thoughts?
>

I'm happy to spin up a patch once I'm sure what exactly the tuning would
effect. At an initial glance I'm thinking it would be the imbalance_min
which is currently hardcoded to 2. But there may be something else...


Cheers,
Phil


> Thanks a lot!
> Jirka
> 
> On Thu, May 7, 2020 at 5:54 PM Mel Gorman  wrote:
> >
> > On Thu, May 07, 2020 at 05:24:17PM +0200, Jirka Hladky wrote:
> > > Hi Mel,
> > >
> > > > > Yes, it's indeed OMP.  With low threads count, I mean up to 2x number 
> > > > > of
> > > > > NUMA nodes (8 threads on 4 NUMA node servers, 16 threads on 8 NUMA 
> > > > > node
> > > > > servers).
> > > >
> > > > Ok, so we know it's within the imbalance threshold where a NUMA node can
> > > > be left idle.
> > >
> > > we have discussed today with my colleagues the performance drop for
> > > some workloads for low threads counts (roughly up to 2x number of NUMA
> > > nodes). We are worried that it can be a severe issue for some use
> > > cases, which require a full memory bandwidth even when only part of
> > > CPUs is used.
> > >
> > > We understand that scheduler cannot distinguish this type of workload
> > > from others automatically. However, there was an idea for a * new
> > > kernel tunable to control the imbalance threshold *. Based on the
> > > purpose of the server, users could set this tunable. See the tuned
> > > project, which allows creating performance profiles [1].
> > >
> >
> > I'm not completely opposed to it but given that the setting is global,
> > I imagine it could have other consequences if two applications ran
> > at different times have different requirements. Given that it's OMP,
> > I would have imagined that an application that really cared about this
> > would specify what was needed using OMP_PLACES. Why would someone prefer
> > kernel tuning or a tuned profile over OMP_PLACES? After all, it requires
> > specific knowledge of the application even to know that a particular
> > tuned profile is needed.
> >
> > --
> > Mel Gorman
> > SUSE Labs
> >
> 
> 
> -- 
> -Jirka
> 

--

Re: [PATCH] sched/fair: Fix enqueue_task_fair warning some more

2020-05-07 Thread Phil Auld

Hi Vincent,

On Thu, May 07, 2020 at 05:06:29PM +0200 Vincent Guittot wrote:
> Hi Phil,
> 
> On Wed, 6 May 2020 at 20:05, Phil Auld  wrote:
> >
> > Hi Vincent,
> >
> > Thanks for taking a look. More below...
> >
> > On Wed, May 06, 2020 at 06:36:45PM +0200 Vincent Guittot wrote:
> > > Hi Phil,
> > >
> > > - reply to all this time
> > >
> > > On Wed, 6 May 2020 at 16:18, Phil Auld  wrote:
> > > >
> > > > sched/fair: Fix enqueue_task_fair warning some more
> > > >
> > > > The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair 
> > > > warning)
> > > > did not fully resolve the issues with the (rq->tmp_alone_branch !=
> > > > >leaf_cfs_rq_list) warning in enqueue_task_fair. There is a case 
> > > > where
> > > > the first for_each_sched_entity loop exits due to on_rq, having 
> > > > incompletely
> > > > updated the list.  In this case the second for_each_sched_entity loop 
> > > > can
> > > > further modify se. The later code to fix up the list management fails 
> > > > to do
> > >
> > > But for the 2nd  for_each_sched_entity, the cfs_rq should already be
> > > in the list, isn't it ?
> >
> > No. In this case we hit the parent not on list case in list_add_leaf_cfs_rq
> > which sets rq-tmp_alone_branch to cfs_rq->leaf_cfs_rq_list which is not
> > the same. It returns false expecting the parent to be added later.
> >
> > But then the parent doens't get there because it's on_rq.
> >
> > >
> > > The third for_each_entity loop is there for the throttled case but is
> > > useless for other case
> > >
> >
> > There actually is a throttling involved usually. The second loop breaks out
> > early because one of the parents is throttled. But not before it advances
> > se at least once.
> 
> Ok, that's even because of the throttling that the problem occurs
> 
> >
> > Then the 3rd loop doesn't fix the tmp_alone_branch because it doesn't start
> > with the right se.
> >
> > > Could you provide us some details about the use case that creates such
> > > a situation ?
> > >
> >
> > I admit I had to add trace_printks to get here. Here's what it showed (sorry
> > for the long lines...)
> >
> > 1)  sh-6271  [044]  1271.322317: bprint: enqueue_task_fair: se 
> > 0xa085e7e30080 on_rq 0 cfs_rq = 0xa085e93da200
> > 2)  sh-6271  [044]  1271.322320: bprint: enqueue_entity: Add_leaf_rq: cpu 
> > 17: nr_r 2; cfs 0xa085e93da200 onlist 0 tmp_a_b = 0xa085ef92c868 
> > >l_c_r_l = 0xa085ef92c868
> > 3)  sh-6271  [044]  1271.322322: bprint: enqueue_entity: Add_leaf_rq: cpu 
> > 17: nr_r 2: parent not onlist  Set t_a_branch to 0xa085e93da340 
> > rq->l_c_r_l = 0xa085ef92c868
> > 4)  sh-6271  [044]  1271.322323: bprint: enqueue_task_fair: se 
> > 0xa085e93d8800 on_rq 1 cfs_rq = 0xa085dbfaea00
> > 5)  sh-6271  [044]  1271.322324: bprint: enqueue_task_fair: Done enqueues, 
> > se=0xa085e93d8800, pid=3642
> > 6)  sh-6271  [044]  1271.322326: bprint: enqueue_task_fair: update: cfs 
> > 0xa085e48ce000 throttled, se = 0xa085dbfafc00
> > 7)  sh-6271  [044]  1271.322326: bprint: enqueue_task_fair: current se = 
> > 0xa085dbfafc00, orig_se = 0xa085e7e30080
> > 8)  sh-6271  [044]  1271.322327: bprint: enqueue_task_fair: Add_leaf_rq: 
> > cpu 17: nr_r 2; cfs 0xa085e48ce000 onlist 1 tmp_a_b = 
> > 0xa085e93da340 >l_c_r_l = 0xa085ef92c868
> > 9)  sh-6271  [044]  1271.322328: bprint: enqueue_task_fair: Add_leaf_rq: 
> > cpu 17: nr_r 0; cfs 0xa085ef92bf80 onlist 1 tmp_a_b = 
> > 0xa085e93da340 >l_c_r_l = 0xa085ef92c868
> > 10) sh-6271  [044]  1271.672599: bprint: enqueue_task_fair: cpu 17: 
> > rq->tmp_alone_branch = 0xa085e93da340 != >leaf_cfs_rq_list = 
> > 0xa085ef92c868
> >
> >
> > lines 1 and 4 are from the first loop in enqueue_task_fair. Line 2 and 3 
> > are from the
> > first call to list_add_leaf_rq with line 2 being at the start and line 3 
> > showing which
> > of the 3 cases we hit.
> >
> > Line 5 is right after the first loop.
> >
> > Line 6 is the second trip through the 2nd loop and is in the if(throttled) 
> > condition.
> > Line 7 is right below the enqueue_throttle label.
> >
> > Lines 8 and 9 are from the fixup loop and since onlist is set for both of 
> > these it doesn't
> > do anything. But we've le

Re: [PATCH] sched/fair: Fix enqueue_task_fair warning some more

2020-05-06 Thread Phil Auld

Hi Vincent,

Thanks for taking a look. More below...

On Wed, May 06, 2020 at 06:36:45PM +0200 Vincent Guittot wrote:
> Hi Phil,
> 
> - reply to all this time
> 
> On Wed, 6 May 2020 at 16:18, Phil Auld  wrote:
> >
> > sched/fair: Fix enqueue_task_fair warning some more
> >
> > The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
> > did not fully resolve the issues with the (rq->tmp_alone_branch !=
> > >leaf_cfs_rq_list) warning in enqueue_task_fair. There is a case where
> > the first for_each_sched_entity loop exits due to on_rq, having incompletely
> > updated the list.  In this case the second for_each_sched_entity loop can
> > further modify se. The later code to fix up the list management fails to do
> 
> But for the 2nd  for_each_sched_entity, the cfs_rq should already be
> in the list, isn't it ?

No. In this case we hit the parent not on list case in list_add_leaf_cfs_rq
which sets rq-tmp_alone_branch to cfs_rq->leaf_cfs_rq_list which is not
the same. It returns false expecting the parent to be added later.

But then the parent doens't get there because it's on_rq. 

> 
> The third for_each_entity loop is there for the throttled case but is
> useless for other case
>

There actually is a throttling involved usually. The second loop breaks out
early because one of the parents is throttled. But not before it advances
se at least once. 

Then the 3rd loop doesn't fix the tmp_alone_branch because it doesn't start
with the right se.

> Could you provide us some details about the use case that creates such
> a situation ?
>

I admit I had to add trace_printks to get here. Here's what it showed (sorry
for the long lines...)

1)  sh-6271  [044]  1271.322317: bprint: enqueue_task_fair: se 
0xa085e7e30080 on_rq 0 cfs_rq = 0xa085e93da200
2)  sh-6271  [044]  1271.322320: bprint: enqueue_entity: Add_leaf_rq: cpu 17: 
nr_r 2; cfs 0xa085e93da200 onlist 0 tmp_a_b = 0xa085ef92c868 
>l_c_r_l = 0xa085ef92c868
3)  sh-6271  [044]  1271.322322: bprint: enqueue_entity: Add_leaf_rq: cpu 17: 
nr_r 2: parent not onlist  Set t_a_branch to 0xa085e93da340 rq->l_c_r_l = 
0xa085ef92c868
4)  sh-6271  [044]  1271.322323: bprint: enqueue_task_fair: se 
0xa085e93d8800 on_rq 1 cfs_rq = 0xa085dbfaea00
5)  sh-6271  [044]  1271.322324: bprint: enqueue_task_fair: Done enqueues, 
se=0xa085e93d8800, pid=3642
6)  sh-6271  [044]  1271.322326: bprint: enqueue_task_fair: update: cfs 
0xa085e48ce000 throttled, se = 0xa085dbfafc00
7)  sh-6271  [044]  1271.322326: bprint: enqueue_task_fair: current se = 
0xa085dbfafc00, orig_se = 0xa085e7e30080
8)  sh-6271  [044]  1271.322327: bprint: enqueue_task_fair: Add_leaf_rq: cpu 
17: nr_r 2; cfs 0xa085e48ce000 onlist 1 tmp_a_b = 0xa085e93da340 
>l_c_r_l = 0xa085ef92c868
9)  sh-6271  [044]  1271.322328: bprint: enqueue_task_fair: Add_leaf_rq: cpu 
17: nr_r 0; cfs 0xa085ef92bf80 onlist 1 tmp_a_b = 0xa085e93da340 
>l_c_r_l = 0xa085ef92c868
10) sh-6271  [044]  1271.672599: bprint: enqueue_task_fair: cpu 17: 
rq->tmp_alone_branch = 0xa085e93da340 != >leaf_cfs_rq_list = 
0xa085ef92c868

lines 1 and 4 are from the first loop in enqueue_task_fair. Line 2 and 3 are 
from the
first call to list_add_leaf_rq with line 2 being at the start and line 3 
showing which
of the 3 cases we hit.

Line 5 is right after the first loop.

Line 6 is the second trip through the 2nd loop and is in the if(throttled) 
condition.
Line 7 is right below the enqueue_throttle label.

Lines 8 and 9 are from the fixup loop and since onlist is set for both of these 
it doesn't
do anything. But we've left rq->tmp_alone_branch pointing to the 
cfs_rq->leaf_cfs_rq_list
from the one call to list_add_leaf_rq that did something and so the cleanup 
doesn't work.

Based on the comment at the clean up, it looked like it expected the se to be 
what it was
when the first loop broke not whatever it was left at after the second loop.  
Could have
been NULL there too I guess but I didn't hit that case.

This is 100% reproducible. And completely gone with the fix. I have a trace 
showing that.

Does that make more sense?

Cheers,
Phil

> > what is needed because se does not point to the sched_entity which broke out
> > of the first loop.
> >
> > Address this issue by saving the se pointer when the first loop exits and
> > resetting it before doing the fix up, if needed.
> >
> > Signed-off-by: Phil Auld 
> > Cc: Peter Zijlstra (Intel) 
> > Cc: Vincent Guittot 
> > Cc: Ingo Molnar 
> > Cc: Juri Lelli 
> > ---
> >  kernel/sched/fair.c | 4 
> >  1 file changed, 4 insertions(+)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 02f323b85b6d..719c996317e3 100644
> > --- a/kernel/sche

[PATCH] sched/fair: Fix enqueue_task_fair warning some more

2020-05-06 Thread Phil Auld

sched/fair: Fix enqueue_task_fair warning some more

The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
did not fully resolve the issues with the (rq->tmp_alone_branch !=
>leaf_cfs_rq_list) warning in enqueue_task_fair. There is a case where
the first for_each_sched_entity loop exits due to on_rq, having incompletely
updated the list.  In this case the second for_each_sched_entity loop can
further modify se. The later code to fix up the list management fails to do
what is needed because se does not point to the sched_entity which broke out
of the first loop.

Address this issue by saving the se pointer when the first loop exits and
resetting it before doing the fix up, if needed.

Signed-off-by: Phil Auld 
Cc: Peter Zijlstra (Intel) 
Cc: Vincent Guittot 
Cc: Ingo Molnar 
Cc: Juri Lelli 
---
 kernel/sched/fair.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 02f323b85b6d..719c996317e3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5432,6 +5432,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
 {
struct cfs_rq *cfs_rq;
struct sched_entity *se = >se;
+   struct sched_entity *saved_se = NULL;
int idle_h_nr_running = task_has_idle_policy(p);
 
/*
@@ -5466,6 +5467,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
flags = ENQUEUE_WAKEUP;
}
 
+   saved_se = se;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
 
@@ -5510,6 +5512,8 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, 
int flags)
 * leaf list maintenance, resulting in triggering the assertion
 * below.
 */
+   if (saved_se)
+   se = saved_se;
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
 
-- 
2.18.0


Cheers,
Phil

Re: [PATCH v4 00/10] sched/fair: rework the CFS load balance

2019-10-21 Thread Phil Auld

On Mon, Oct 21, 2019 at 10:44:20AM +0200 Vincent Guittot wrote:
> On Mon, 21 Oct 2019 at 09:50, Ingo Molnar  wrote:
> >
> >
> > * Vincent Guittot  wrote:
> >
> > > Several wrong task placement have been raised with the current load
> > > balance algorithm but their fixes are not always straight forward and
> > > end up with using biased values to force migrations. A cleanup and rework
> > > of the load balance will help to handle such UCs and enable to fine grain
> > > the behavior of the scheduler for other cases.
> > >
> > > Patch 1 has already been sent separately and only consolidate asym policy
> > > in one place and help the review of the changes in load_balance.
> > >
> > > Patch 2 renames the sum of h_nr_running in stats.
> > >
> > > Patch 3 removes meaningless imbalance computation to make review of
> > > patch 4 easier.
> > >
> > > Patch 4 reworks load_balance algorithm and fixes some wrong task placement
> > > but try to stay conservative.
> > >
> > > Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
> > > into account when pulling tasks.
> > >
> > > Patch 6 replaces runnable_load by load now that the signal is only used
> > > when overloaded.
> > >
> > > Patch 7 improves the spread of tasks at the 1st scheduling level.
> > >
> > > Patch 8 uses utilization instead of load in all steps of misfit task
> > > path.
> > >
> > > Patch 9 replaces runnable_load_avg by load_avg in the wake up path.
> > >
> > > Patch 10 optimizes find_idlest_group() that was using both runnable_load
> > > and load. This has not been squashed with previous patch to ease the
> > > review.
> > >
> > > Patch 11 reworks find_idlest_group() to follow the same steps as
> > > find_busiest_group()
> > >
> > > Some benchmarks results based on 8 iterations of each tests:
> > > - small arm64 dual quad cores system
> > >
> > >tip/sched/corew/ this patchsetimprovement
> > > schedpipe  53125 +/-0.18%53443 +/-0.52%   (+0.60%)
> > >
> > > hackbench -l (2560/#grp) -g #grp
> > >  1 groups  1.579 +/-29.16%   1.410 +/-13.46% (+10.70%)
> > >  4 groups  1.269 +/-9.69%1.205 +/-3.27%   (+5.00%)
> > >  8 groups  1.117 +/-1.51%1.123 +/-1.27%   (+4.57%)
> > > 16 groups  1.176 +/-1.76%1.164 +/-2.42%   (+1.07%)
> > >
> > > Unixbench shell8
> > >   1 test 1963.48 +/-0.36%   1902.88 +/-0.73%(-3.09%)
> > > 224 tests2427.60 +/-0.20%   2469.80 +/-0.42%  (1.74%)
> > >
> > > - large arm64 2 nodes / 224 cores system
> > >
> > >tip/sched/corew/ this patchsetimprovement
> > > schedpipe 124084 +/-1.36%   124445 +/-0.67%   (+0.29%)
> > >
> > > hackbench -l (256000/#grp) -g #grp
> > >   1 groups15.305 +/-1.50%   14.001 +/-1.99%   (+8.52%)
> > >   4 groups 5.959 +/-0.70%5.542 +/-3.76%   (+6.99%)
> > >  16 groups 3.120 +/-1.72%3.253 +/-0.61%   (-4.92%)
> > >  32 groups 2.911 +/-0.88%2.837 +/-1.16%   (+2.54%)
> > >  64 groups 2.805 +/-1.90%2.716 +/-1.18%   (+3.17%)
> > > 128 groups 3.166 +/-7.71%3.891 +/-6.77%   (+5.82%)
> > > 256 groups 3.655 +/-10.09%   3.185 +/-6.65%  (+12.87%)
> > >
> > > dbench
> > >   1 groups   328.176 +/-0.29%  330.217 +/-0.32%   (+0.62%)
> > >   4 groups   930.739 +/-0.50%  957.173 +/-0.66%   (+2.84%)
> > >  16 groups  1928.292 +/-0.36% 1978.234 +/-0.88%   (+0.92%)
> > >  32 groups  2369.348 +/-1.72% 2454.020 +/-0.90%   (+3.57%)
> > >  64 groups  2583.880 +/-3.39% 2618.860 +/-0.84%   (+1.35%)
> > > 128 groups  2256.406 +/-10.67%2392.498 +/-2.13%   (+6.03%)
> > > 256 groups  1257.546 +/-3.81% 1674.684 +/-4.97%  (+33.17%)
> > >
> > > Unixbench shell8
> > >   1 test 6944.16 +/-0.02 6605.82 +/-0.11  (-4.87%)
> > > 224 tests   13499.02 +/-0.1413637.94 +/-0.47% (+1.03%)
> > > lkp reported a -10% regression on shell8 (1 test) for v3 that
> > > seems that is partially recovered on my platform with v4.
> > >
> > > tip/sched/core sha1:
> > >   commit 563c4f85f9f0 ("Merge branch 'sched/rt' into sched/core, to pick 
> > > up -rt changes")
> > >
> > > Changes since v3:
> > > - small typo and variable ordering fixes
> > > - add some acked/reviewed tag
> > > - set 1 instead of load for migrate_misfit
> > > - use nr_h_running instead of load for asym_packing
> > > - update the optimization of find_idlest_group() and put back somes
> > >  conditions when comparing load
> > > - rework find_idlest_group() to match find_busiest_group() behavior
> > >
> > > Changes since v2:
> > > - fix typo and reorder code
> > > - some minor code fixes
> > > - optimize the find_idles_group()
> > >
> > > Not covered in this patchset:
> > > - Better detection of overloaded and fully busy state, especially for 
> > > cases
> > >   when nr_running > nr CPUs.
> > >
> > > Vincent Guittot (11):
> > >   sched/fair: clean up asym packing
> > >   sched/fair: rename sum_nr_running to sum_h_nr_running
> > >

Re: [PATCH v3 0/8] sched/fair: rework the CFS load balance

2019-10-09 Thread Phil Auld

On Tue, Oct 08, 2019 at 05:53:11PM +0200 Vincent Guittot wrote:
> Hi Phil,
>  

...

> While preparing v4, I have noticed that I have probably oversimplified
> the end of find_idlest_group() in patch "sched/fair: optimize
> find_idlest_group" when it compares local vs the idlest other group.
> Especially, there were a NUMA specific test that I removed in v3 and
> re added in v4.
> 
> Then, I'm also preparing a full rework that find_idlest_group() which
> will behave more closely to load_balance; I mean : collect statistics,
> classify group then selects the idlest
> 

Okay, I'll watch for V4 and restest. It really seems to be limited to 
the 8-node system. None of the other systems are showing this.


> What is the behavior of lu.C Thread ? are they waking up a lot  ?and
> could trigger the slow wake path ?

Yes, probably a fair bit of waking. It's an interative equation solving
code. It's fairly CPU intensive but requires communication for dependent
calculations.  That's part of why having them mis-balanced causes such a 
large slow down. I think at times everyone else waits for the slow guys.
 

> 
> >
> > The initial fixes I made for this issue did not exhibit this behavior. They
> > would have had other issues dealing with overload cases though. In this case
> > however there are only 154 or 158 threads on 160 CPUs so not overloaded.
> >
> > I'll try to get my hands on this system and poke into it. I just wanted to 
> > get
> > your thoughts and let you know where we are.
> 
> Thanks for testing
>

Sure!


Thanks,
Phil



 
> Vincent
> 
> >
> >
> >
> > Thanks,
> > Phil
> >
> >
> > System details:
> >
> > Architecture:x86_64
> > CPU op-mode(s):  32-bit, 64-bit
> > Byte Order:  Little Endian
> > CPU(s):  160
> > On-line CPU(s) list: 0-159
> > Thread(s) per core:  2
> > Core(s) per socket:  10
> > Socket(s):   8
> > NUMA node(s):8
> > Vendor ID:   GenuineIntel
> > CPU family:  6
> > Model:   47
> > Model name:  Intel(R) Xeon(R) CPU E7- 4870  @ 2.40GHz
> > Stepping:2
> > CPU MHz: 1063.934
> > BogoMIPS:4787.73
> > Virtualization:  VT-x
> > L1d cache:   32K
> > L1i cache:   32K
> > L2 cache:256K
> > L3 cache:30720K
> > NUMA node0 CPU(s):   0-9,80-89
> > NUMA node1 CPU(s):   10-19,90-99
> > NUMA node2 CPU(s):   20-29,100-109
> > NUMA node3 CPU(s):   30-39,110-119
> > NUMA node4 CPU(s):   40-49,120-129
> > NUMA node5 CPU(s):   50-59,130-139
> > NUMA node6 CPU(s):   60-69,140-149
> > NUMA node7 CPU(s):   70-79,150-159
> > Flags:   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge 
> > mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall 
> > nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl 
> > xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl 
> > vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 popcnt aes 
> > lahf_lm epb pti tpr_shadow vnmi flexpriority ept vpid dtherm ida arat
> >
> > $ numactl --hardware
> > available: 8 nodes (0-7)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 80 81 82 83 84 85 86 87 88 89
> > node 0 size: 64177 MB
> > node 0 free: 60866 MB
> > node 1 cpus: 10 11 12 13 14 15 16 17 18 19 90 91 92 93 94 95 96 97 98 99
> > node 1 size: 64507 MB
> > node 1 free: 61167 MB
> > node 2 cpus: 20 21 22 23 24 25 26 27 28 29 100 101 102 103 104 105 106 107 
> > 108
> > 109
> > node 2 size: 64507 MB
> > node 2 free: 61250 MB
> > node 3 cpus: 30 31 32 33 34 35 36 37 38 39 110 111 112 113 114 115 116 117 
> > 118
> > 119
> > node 3 size: 64507 MB
> > node 3 free: 61327 MB
> > node 4 cpus: 40 41 42 43 44 45 46 47 48 49 120 121 122 123 124 125 126 127 
> > 128
> > 129
> > node 4 size: 64507 MB
> > node 4 free: 60993 MB
> > node 5 cpus: 50 51 52 53 54 55 56 57 58 59 130 131 132 133 134 135 136 137 
> > 138
> > 139
> > node 5 size: 64507 MB
> > node 5 free: 60892 MB
> > node 6 cpus: 60 61 62 63 64 65 66 67 68 69 140 141 142 143 144 145 146 147 
> > 148
> > 149
> > node 6 size: 64507 MB
> > node 6 free: 61139 MB
> > node 7 cpus: 70 71 72 73 74 75 76 77 78 79 150 151 152 153 154 155 156 157 
> > 158
> > 159
> > node 7 size: 64480 MB
> > node 7 free: 61188 MB
> > node distances:
> > node   0   1   2   3   4   5   6   7
> >  0:  10  12  17  17  19  19  19  19
> >  1:  12  10  17  17  19  19  19  19
> >  2:  17  17  10  12  19  19  19  19
> >  3:  17  17  12  10  19  19  19  19
> >  4:  19  19  19  19  10  12  17  17
> >  5:  19  19  19  19  12  10  17  17
> >  6:  19  19  19  19  17  17  10  12
> >  7:  19  19  19  19  17  17  12  10
> >
> >
> >
> > --

--

Re: [PATCH v3 0/8] sched/fair: rework the CFS load balance

2019-10-08 Thread Phil Auld

Hi Vincent,

On Thu, Sep 19, 2019 at 09:33:31AM +0200 Vincent Guittot wrote:
> Several wrong task placement have been raised with the current load
> balance algorithm but their fixes are not always straight forward and
> end up with using biased values to force migrations. A cleanup and rework
> of the load balance will help to handle such UCs and enable to fine grain
> the behavior of the scheduler for other cases.
> 
> Patch 1 has already been sent separately and only consolidate asym policy
> in one place and help the review of the changes in load_balance.
> 
> Patch 2 renames the sum of h_nr_running in stats.
> 
> Patch 3 removes meaningless imbalance computation to make review of
> patch 4 easier.
> 
> Patch 4 reworks load_balance algorithm and fixes some wrong task placement
> but try to stay conservative.
> 
> Patch 5 add the sum of nr_running to monitor non cfs tasks and take that
> into account when pulling tasks.
> 
> Patch 6 replaces runnable_load by load now that the signal is only used
> when overloaded.
> 
> Patch 7 improves the spread of tasks at the 1st scheduling level.
> 
> Patch 8 uses utilization instead of load in all steps of misfit task
> path.
> 
> Patch 9 replaces runnable_load_avg by load_avg in the wake up path.
> 
> Patch 10 optimizes find_idlest_group() that was using both runnable_load
> and load. This has not been squashed with previous patch to ease the
> review.
> 
> Some benchmarks results based on 8 iterations of each tests:
> - small arm64 dual quad cores system
> 
>tip/sched/corew/ this patchsetimprovement
> schedpipe  54981 +/-0.36%55459 +/-0.31%   (+0.97%)
> 
> hackbench
> 1 groups   0.906 +/-2.34%0.906 +/-2.88%   (+0.06%)
> 
> - large arm64 2 nodes / 224 cores system
> 
>tip/sched/corew/ this patchsetimprovement
> schedpipe 125323 +/-0.98%   125624 +/-0.71%   (+0.24%)
> 
> hackbench -l (256000/#grp) -g #grp
> 1 groups  15.360 +/-1.76%   14.206 +/-1.40%   (+8.69%)
> 4 groups   5.822 +/-1.02%5.508 +/-6.45%   (+5.38%)
> 16 groups  3.103 +/-0.80%3.244 +/-0.77%   (-4.52%)
> 32 groups  2.892 +/-1.23%2.850 +/-1.81%   (+1.47%)
> 64 groups  2.825 +/-1.51%2.725 +/-1.51%   (+3.54%)
> 128 groups 3.149 +/-8.46%3.053 +/-13.15%  (+3.06%)
> 256 groups 3.511 +/-8.49%3.019 +/-1.71%  (+14.03%)
> 
> dbench
> 1 groups 329.677 +/-0.46%  329.771 +/-0.11%   (+0.03%)
> 4 groups 931.499 +/-0.79%  947.118 +/-0.94%   (+1.68%)
> 16 groups   1924.210 +/-0.89% 1947.849 +/-0.76%   (+1.23%)
> 32 groups   2350.646 +/-5.75% 2351.549 +/-6.33%   (+0.04%)
> 64 groups   2201.524 +/-3.35% 2192.749 +/-5.84%   (-0.40%)
> 128 groups  2206.858 +/-2.50% 2376.265 +/-7.44%   (+7.68%)
> 256 groups  1263.520 +/-3.34% 1633.143 +/-13.02% (+29.25%)
> 
> tip/sched/core sha1:
>   0413d7f33e60 ('sched/uclamp: Always use 'enum uclamp_id' for clamp_id 
> values')
> 
> Changes since v2:
> - fix typo and reorder code
> - some minor code fixes
> - optimize the find_idles_group()
> 
> Not covered in this patchset:
> - update find_idlest_group() to be more aligned with load_balance(). I didn't
>   want to delay this version because of this update which is not ready yet
> - Better detection of overloaded and fully busy state, especially for cases
>   when nr_running > nr CPUs.
> 
> Vincent Guittot (8):
>   sched/fair: clean up asym packing
>   sched/fair: rename sum_nr_running to sum_h_nr_running
>   sched/fair: remove meaningless imbalance calculation
>   sched/fair: rework load_balance
>   sched/fair: use rq->nr_running when balancing load
>   sched/fair: use load instead of runnable load in load_balance
>   sched/fair: evenly spread tasks when not overloaded
>   sched/fair: use utilization to select misfit task
>   sched/fair: use load instead of runnable load in wakeup path
>   sched/fair: optimize find_idlest_group
> 
>  kernel/sched/fair.c | 805 
> +++-
>  1 file changed, 417 insertions(+), 388 deletions(-)
> 
> -- 
> 2.7.4
> 

We've been testing v3 and for the most part everything looks good. The 
group imbalance issues are fixed on all of our test systems except one.

The one is an 8-node intel system with 160 cpus. I'll put the system 
details at the end. 

This shows the average number of benchmark threads running on each node
through the run. That is, not including the 2 stress jobs. The end 
results are a 4x slow down in the cgroup case versus not. The 152 and 
156 are the number of LU threads in the run. In all cases there are 2 
stress CPU threads running either in their own cgroups (GROUP) or 
everything is in one cgroup (NORMAL).  The normal case is pretty well 
balanced with only a few >= 20 and those that are are only a little 
over. In the GROUP cases things are not so good. There are some > 30 
for example, and others < 10.


lu.C.x_152_GROUP_1   17.52  16.86  17.90

Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision

2019-10-07 Thread Phil Auld

On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> quota/period ratio is used to ensure a child task group won't get more
> bandwidth than the parent task group, and is calculated as:
> normalized_cfs_quota() = [(quota_us << 20) / period_us]
> 
> If the quota/period ratio was changed during this scaling due to
> precision loss, it will cause inconsistency between parent and child
> task groups. See below example:
> A userspace container manager (kubelet) does three operations:
> 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> 2) Create a few children cgroups.
> 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> 
> These operations are expected to succeed. However, if the scaling of
> 147/128 happens before step 3), quota and period of the parent cgroup
> will be changed:
> new_quota: 1148437ns, 1148us
> new_period: 11484375ns, 11484us
> 
> And when step 3) comes in, the ratio of the child cgroup will be 104857,
> which will be larger than the parent cgroup ratio (104821), and will
> fail.
> 
> Scaling them by a factor of 2 will fix the problem.
> 
> Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to 
> avoid hard lockup")
> Signed-off-by: Xuewei Zhang 


I managed to get it to trigger the second case. It took 50,000 children (20x my 
initial tests).

[ 1367.850630] cfs_period_timer[cpu11]: period too short, scaling up (new 
cfs_period_us = 4340, cfs_quota_us = 25)
[ 1370.390832] cfs_period_timer[cpu11]: period too short, scaling up (new 
cfs_period_us = 8680, cfs_quota_us = 50)
[ 1372.914689] cfs_period_timer[cpu11]: period too short, scaling up (new 
cfs_period_us = 17360, cfs_quota_us = 100)
[ 1375.447431] cfs_period_timer[cpu11]: period too short, scaling up (new 
cfs_period_us = 34720, cfs_quota_us = 200)
[ 1377.982785] cfs_period_timer[cpu11]: period too short, scaling up (new 
cfs_period_us = 69440, cfs_quota_us = 400)
[ 1380.481702] cfs_period_timer[cpu11]: period too short, scaling up (new 
cfs_period_us = 138880, cfs_quota_us = 800)
[ 1382.894692] cfs_period_timer[cpu11]: period too short, scaling up (new 
cfs_period_us = 277760, cfs_quota_us = 1600)
[ 1385.264872] cfs_period_timer[cpu11]: period too short, scaling up (new 
cfs_period_us = 20, cfs_quota_us = 3200)
[ 1393.965140] cfs_period_timer[cpu11]: period too short, but cannot scale up 
without losing precision (cfs_period_us = 20, cfs_quota_us = 3200)

I suspect going higher could cause the original lockup, but that'd be the case 
with the old code as well. 
And this also gets us out of it faster.


Tested-by: Phil Auld 


Cheers,
Phil


> ---
>  kernel/sched/fair.c | 36 ++--
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 83ab35e2374f..b3d3d0a231cd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4926,20 +4926,28 @@ static enum hrtimer_restart 
> sched_cfs_period_timer(struct hrtimer *timer)
>   if (++count > 3) {
>   u64 new, old = ktime_to_ns(cfs_b->period);
>  
> - new = (old * 147) / 128; /* ~115% */
> - new = min(new, max_cfs_quota_period);
> -
> - cfs_b->period = ns_to_ktime(new);
> -
> - /* since max is 1s, this is limited to 1e9^2, which 
> fits in u64 */
> - cfs_b->quota *= new;
> - cfs_b->quota = div64_u64(cfs_b->quota, old);
> -
> - pr_warn_ratelimited(
> - "cfs_period_timer[cpu%d]: period too short, scaling up (new 
> cfs_period_us %lld, cfs_quota_us = %lld)\n",
> - smp_processor_id(),
> - div_u64(new, NSEC_PER_USEC),
> - div_u64(cfs_b->quota, NSEC_PER_USEC));
> + /*
> +  * Grow period by a factor of 2 to avoid lossing 
> precision.
> +  * Precision loss in the quota/period ratio can cause 
> __cfs_schedulable
> +  * to fail.
> +  */
> + new = old * 2;
> + if (new < max_cfs_quota_period) {
> + cfs_b->period = ns_to_ktime(new);
> + cfs_b->quota *= 2;
> +
> + pr_warn_ratelimited(
> + "cfs_period_timer[cpu%d]: period too short, scaling up (new 
> cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> + smp_processor_id(),
> + div_u64(new, NSEC_PER_USEC),
&

Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision

2019-10-07 Thread Phil Auld

Hi Xuewei,

On Fri, Oct 04, 2019 at 05:28:15PM -0700 Xuewei Zhang wrote:
> On Fri, Oct 4, 2019 at 6:14 AM Phil Auld  wrote:
> >
> > On Thu, Oct 03, 2019 at 07:05:56PM -0700 Xuewei Zhang wrote:
> > > +cc neeln...@google.com and hao...@google.com, they helped a lot
> > > for this issue. Sorry I forgot to include them when sending out the patch.
> > >
> > > On Thu, Oct 3, 2019 at 5:55 PM Phil Auld  wrote:
> > > >
> > > > Hi,
> > > >
> > > > On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> > > > > quota/period ratio is used to ensure a child task group won't get more
> > > > > bandwidth than the parent task group, and is calculated as:
> > > > > normalized_cfs_quota() = [(quota_us << 20) / period_us]
> > > > >
> > > > > If the quota/period ratio was changed during this scaling due to
> > > > > precision loss, it will cause inconsistency between parent and child
> > > > > task groups. See below example:
> > > > > A userspace container manager (kubelet) does three operations:
> > > > > 1) Create a parent cgroup, set quota to 1,000us and period to 
> > > > > 10,000us.
> > > > > 2) Create a few children cgroups.
> > > > > 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> > > > >
> > > > > These operations are expected to succeed. However, if the scaling of
> > > > > 147/128 happens before step 3), quota and period of the parent cgroup
> > > > > will be changed:
> > > > > new_quota: 1148437ns, 1148us
> > > > > new_period: 11484375ns, 11484us
> > > > >
> > > > > And when step 3) comes in, the ratio of the child cgroup will be 
> > > > > 104857,
> > > > > which will be larger than the parent cgroup ratio (104821), and will
> > > > > fail.
> > > > >
> > > > > Scaling them by a factor of 2 will fix the problem.
> > > >
> > > > I have no issues with the concept. We went around a bit about the actual
> > > > numbers and made it an approximation.
> > > >
> > > > >
> > > > > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop 
> > > > > to avoid hard lockup")
> > > > > Signed-off-by: Xuewei Zhang 
> > > > > ---
> > > > >  kernel/sched/fair.c | 36 ++--
> > > > >  1 file changed, 22 insertions(+), 14 deletions(-)
> > > > >
> > > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > > index 83ab35e2374f..b3d3d0a231cd 100644
> > > > > --- a/kernel/sched/fair.c
> > > > > +++ b/kernel/sched/fair.c
> > > > > @@ -4926,20 +4926,28 @@ static enum hrtimer_restart 
> > > > > sched_cfs_period_timer(struct hrtimer *timer)
> > > > >   if (++count > 3) {
> > > > >   u64 new, old = ktime_to_ns(cfs_b->period);
> > > > >
> > > > > - new = (old * 147) / 128; /* ~115% */
> > > > > - new = min(new, max_cfs_quota_period);
> > > > > -
> > > > > - cfs_b->period = ns_to_ktime(new);
> > > > > -
> > > > > - /* since max is 1s, this is limited to 1e9^2, 
> > > > > which fits in u64 */
> > > > > - cfs_b->quota *= new;
> > > > > - cfs_b->quota = div64_u64(cfs_b->quota, old);
> > > > > -
> > > > > - pr_warn_ratelimited(
> > > > > - "cfs_period_timer[cpu%d]: period too short, scaling up (new 
> > > > > cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > > > > - smp_processor_id(),
> > > > > - div_u64(new, NSEC_PER_USEC),
> > > > > - div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > > > + /*
> > > > > +  * Grow period by a factor of 2 to avoid 
> > > > > lossing precision.
> > > > > +  * Precision loss in the quota/period ratio can 
> > > > > cause __cfs_schedulable
> > > > > +  * to fail.
> &

Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision

2019-10-04 Thread Phil Auld

On Thu, Oct 03, 2019 at 07:05:56PM -0700 Xuewei Zhang wrote:
> +cc neeln...@google.com and hao...@google.com, they helped a lot
> for this issue. Sorry I forgot to include them when sending out the patch.
> 
> On Thu, Oct 3, 2019 at 5:55 PM Phil Auld  wrote:
> >
> > Hi,
> >
> > On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> > > quota/period ratio is used to ensure a child task group won't get more
> > > bandwidth than the parent task group, and is calculated as:
> > > normalized_cfs_quota() = [(quota_us << 20) / period_us]
> > >
> > > If the quota/period ratio was changed during this scaling due to
> > > precision loss, it will cause inconsistency between parent and child
> > > task groups. See below example:
> > > A userspace container manager (kubelet) does three operations:
> > > 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> > > 2) Create a few children cgroups.
> > > 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> > >
> > > These operations are expected to succeed. However, if the scaling of
> > > 147/128 happens before step 3), quota and period of the parent cgroup
> > > will be changed:
> > > new_quota: 1148437ns, 1148us
> > > new_period: 11484375ns, 11484us
> > >
> > > And when step 3) comes in, the ratio of the child cgroup will be 104857,
> > > which will be larger than the parent cgroup ratio (104821), and will
> > > fail.
> > >
> > > Scaling them by a factor of 2 will fix the problem.
> >
> > I have no issues with the concept. We went around a bit about the actual
> > numbers and made it an approximation.
> >
> > >
> > > Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to 
> > > avoid hard lockup")
> > > Signed-off-by: Xuewei Zhang 
> > > ---
> > >  kernel/sched/fair.c | 36 ++--
> > >  1 file changed, 22 insertions(+), 14 deletions(-)
> > >
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 83ab35e2374f..b3d3d0a231cd 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -4926,20 +4926,28 @@ static enum hrtimer_restart 
> > > sched_cfs_period_timer(struct hrtimer *timer)
> > >   if (++count > 3) {
> > >   u64 new, old = ktime_to_ns(cfs_b->period);
> > >
> > > - new = (old * 147) / 128; /* ~115% */
> > > - new = min(new, max_cfs_quota_period);
> > > -
> > > - cfs_b->period = ns_to_ktime(new);
> > > -
> > > - /* since max is 1s, this is limited to 1e9^2, which 
> > > fits in u64 */
> > > - cfs_b->quota *= new;
> > > - cfs_b->quota = div64_u64(cfs_b->quota, old);
> > > -
> > > - pr_warn_ratelimited(
> > > - "cfs_period_timer[cpu%d]: period too short, scaling up (new 
> > > cfs_period_us %lld, cfs_quota_us = %lld)\n",
> > > - smp_processor_id(),
> > > - div_u64(new, NSEC_PER_USEC),
> > > - div_u64(cfs_b->quota, NSEC_PER_USEC));
> > > + /*
> > > +  * Grow period by a factor of 2 to avoid lossing 
> > > precision.
> > > +  * Precision loss in the quota/period ratio can 
> > > cause __cfs_schedulable
> > > +  * to fail.
> > > +  */
> > > + new = old * 2;
> > > + if (new < max_cfs_quota_period) {
> >
> > I don't like this part as much. There may be a value between
> > max_cfs_quota_period/2 and max_cfs_quota_period that would get us out of
> > the loop. Possibly in practice it won't matter but here you trigger the
> > warning and take no action to keep it from continuing.
> >
> > Also, if you are actually hitting this then you might want to just start at
> > a higher but proportional quota and period.
> 
> I'd like to do what you suggested. A quick idea would be to scale period to
> max_cfs_quota_period, and scale quota proportionally. However the naive
> implementation won't work under this edge case:
> original:
> quota: 500,000us  period: 570,000us
> after scaling:
> quota: 877,192us  per

Re: [PATCH] sched/fair: scale quota and period without losing quota/period ratio precision

2019-10-03 Thread Phil Auld

Hi,

On Thu, Oct 03, 2019 at 05:12:43PM -0700 Xuewei Zhang wrote:
> quota/period ratio is used to ensure a child task group won't get more
> bandwidth than the parent task group, and is calculated as:
> normalized_cfs_quota() = [(quota_us << 20) / period_us]
> 
> If the quota/period ratio was changed during this scaling due to
> precision loss, it will cause inconsistency between parent and child
> task groups. See below example:
> A userspace container manager (kubelet) does three operations:
> 1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
> 2) Create a few children cgroups.
> 3) Set quota to 1,000us and period to 10,000us on a child cgroup.
> 
> These operations are expected to succeed. However, if the scaling of
> 147/128 happens before step 3), quota and period of the parent cgroup
> will be changed:
> new_quota: 1148437ns, 1148us
> new_period: 11484375ns, 11484us
> 
> And when step 3) comes in, the ratio of the child cgroup will be 104857,
> which will be larger than the parent cgroup ratio (104821), and will
> fail.
> 
> Scaling them by a factor of 2 will fix the problem.

I have no issues with the concept. We went around a bit about the actual
numbers and made it an approximation. 

> 
> Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to 
> avoid hard lockup")
> Signed-off-by: Xuewei Zhang 
> ---
>  kernel/sched/fair.c | 36 ++--
>  1 file changed, 22 insertions(+), 14 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 83ab35e2374f..b3d3d0a231cd 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4926,20 +4926,28 @@ static enum hrtimer_restart 
> sched_cfs_period_timer(struct hrtimer *timer)
>   if (++count > 3) {
>   u64 new, old = ktime_to_ns(cfs_b->period);
>  
> - new = (old * 147) / 128; /* ~115% */
> - new = min(new, max_cfs_quota_period);
> -
> - cfs_b->period = ns_to_ktime(new);
> -
> - /* since max is 1s, this is limited to 1e9^2, which 
> fits in u64 */
> - cfs_b->quota *= new;
> - cfs_b->quota = div64_u64(cfs_b->quota, old);
> -
> - pr_warn_ratelimited(
> - "cfs_period_timer[cpu%d]: period too short, scaling up (new 
> cfs_period_us %lld, cfs_quota_us = %lld)\n",
> - smp_processor_id(),
> - div_u64(new, NSEC_PER_USEC),
> - div_u64(cfs_b->quota, NSEC_PER_USEC));
> + /*
> +  * Grow period by a factor of 2 to avoid lossing 
> precision.
> +  * Precision loss in the quota/period ratio can cause 
> __cfs_schedulable
> +  * to fail.
> +  */
> + new = old * 2;
> + if (new < max_cfs_quota_period) {

I don't like this part as much. There may be a value between
max_cfs_quota_period/2 and max_cfs_quota_period that would get us out of
the loop. Possibly in practice it won't matter but here you trigger the
warning and take no action to keep it from continuing.

Also, if you are actually hitting this then you might want to just start at
a higher but proportional quota and period.


Cheers,
Phil

> + cfs_b->period = ns_to_ktime(new);
> + cfs_b->quota *= 2;
> +
> + pr_warn_ratelimited(
> + "cfs_period_timer[cpu%d]: period too short, scaling up (new 
> cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> + smp_processor_id(),
> + div_u64(new, NSEC_PER_USEC),
> + div_u64(cfs_b->quota, NSEC_PER_USEC));
> + } else {
> + pr_warn_ratelimited(
> + "cfs_period_timer[cpu%d]: period too short, but cannot scale up without 
> losing precision (cfs_period_us = %lld, cfs_quota_us = %lld)\n",
> + smp_processor_id(),
> + div_u64(old, NSEC_PER_USEC),
> + div_u64(cfs_b->quota, NSEC_PER_USEC));
> + }
>  
>   /* reset count so we don't come right back in here */
>   count = 0;
> -- 
> 2.23.0.581.g78d2f28ef7-goog
> 

--

Re: [PATCH v2 0/8] sched/fair: rework the CFS load balance

2019-08-29 Thread Phil Auld

5  0.23  0.00  0.62
lu.C.x_76_NORMAL_2.stress.ps.numa.hist  Average1.67  0.00  0.00  0.33

lu.C.x_76_GROUP_1.ps.numa.hist   Average30.45  6.95  4.52  34.08
lu.C.x_76_GROUP_2.ps.numa.hist   Average32.33  8.94  9.21  25.52
lu.C.x_76_GROUP_3.ps.numa.hist   Average30.45  8.91  12.09  24.55
lu.C.x_76_NORMAL_1.ps.numa.hist  Average18.54  19.23  19.69  18.54
lu.C.x_76_NORMAL_2.ps.numa.hist  Average17.25  19.83  20.00  18.92

76_GROUPMop/s===
min q1  median  q3  max
2119.92 2418.1  2716.28 3147.82 3579.36
76_GROUPtime
min q1  median  q3  max
569.65  660.155 750.66  856.245 961.83
76_NORMALMop/s===
min q1  median  q3  max
30424.5 31486.4 31486.4 31486.4 32548.4
76_NORMALtime
min q1  median  q3  max
62.65   64.835  64.835  64.835  67.02


After (linux-5.3-rc1+  @  a1dc0446d649 + this v2 series pulled from 
Vincent's git on ~8/15)

lu.C.x_76_GROUP_1.stress.ps.numa.hist   Average0.36  1.00  0.64
lu.C.x_76_GROUP_2.stress.ps.numa.hist   Average1.00  1.00
lu.C.x_76_GROUP_3.stress.ps.numa.hist   Average1.00  1.00
lu.C.x_76_NORMAL_1.stress.ps.numa.hist  Average0.23  0.15  0.31  1.31
lu.C.x_76_NORMAL_2.stress.ps.numa.hist  Average1.00  0.00  0.00  1.00

lu.C.x_76_GROUP_1.ps.numa.hist   Average18.91  18.36  18.91  19.82
lu.C.x_76_GROUP_2.ps.numa.hist   Average18.36  18.00  19.91  19.73
lu.C.x_76_GROUP_3.ps.numa.hist   Average18.17  18.42  19.25  20.17
lu.C.x_76_NORMAL_1.ps.numa.hist  Average19.08  20.00  18.62  18.31
lu.C.x_76_NORMAL_2.ps.numa.hist  Average18.09  19.91  19.18  18.82

76_GROUPMop/s===
min q1  median  q3  max
32304.1 33176   34047.9 34166.8 34285.7
76_GROUPtime
min q1  median  q3  max
59.47   59.68   59.89   61.505  63.12
76_NORMALMop/s===
min q1  median  q3  max
29825.5 32454   32454   32454   35082.5
76_NORMALtime
min q1  median  q3  max
58.12   63.24   63.24   63.24   68.36


I had initially tracked this down to two issues. The first was picking the wrong
group in find_busiest_group due to using the average load. The second was in 
fix_small_imbalance(). The "load" of the lu.C tasks was so low it often failed 
to move anything even when it did find a group that was overloaded (nr_running 
> width). I have two small patches which fix this but since Vincent was 
> embarking
on a re-work which also addressed this I dropped them.

We've also run a series of performance tests we use to check for regressions 
and 
did not find any bad results on our workloads and systems.

So...

Tested-by: Phil Auld 


Cheers,
Phil
--

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-29 Thread Phil Auld

On Wed, Aug 28, 2019 at 06:01:14PM +0200 Peter Zijlstra wrote:
> On Wed, Aug 28, 2019 at 11:30:34AM -0400, Phil Auld wrote:
> > On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> 
> > > And given MDS, I'm still not entirely convinced it all makes sense. If
> > > it were just L1TF, then yes, but now...
> > 
> > I was thinking MDS is really the reason for this. L1TF has mitigations but
> > the only current mitigation for MDS for smt is ... nosmt. 
> 
> L1TF has no known mitigation that is SMT safe. The moment you have
> something in your L1, the other sibling can read it using L1TF.
> 
> The nice thing about L1TF is that only (malicious) guests can exploit
> it, and therefore the synchronizatin context is VMM. And it so happens
> that VMEXITs are 'rare' (and already expensive and thus lots of effort
> has already gone into avoiding them).
> 
> If you don't use VMs, you're good and SMT is not a problem.
> 
> If you do use VMs (and do/can not trust them), _then_ you need
> core-scheduling; and in that case, the implementation under discussion
> misses things like synchronization on VMEXITs due to interrupts and
> things like that.
> 
> But under the assumption that VMs don't generate high scheduling rates,
> it can work.
> 
> > The current core scheduler implementation, I believe, still has 
> > (theoretical?) 
> > holes involving interrupts, once/if those are closed it may be even less 
> > attractive.
> 
> No; so MDS leaks anything the other sibling (currently) does, this makes
> _any_ privilidge boundary a synchronization context.
> 
> Worse still, the exploit doesn't require a VM at all, any other task can
> get to it.
> 
> That means you get to sync the siblings on lovely things like system
> call entry and exit, along with VMM and anything else that one would
> consider a privilidge boundary. Now, system calls are not rare, they
> are really quite common in fact. Trying to sync up siblings at the rate
> of system calls is utter madness.
> 
> So under MDS, SMT is completely hosed. If you use VMs exclusively, then
> it _might_ work because a 'pure' host doesn't schedule that often
> (maybe, same assumption as for L1TF).
> 
> Now, there have been proposals of moving the privilidge boundary further
> into the kernel. Just like PTI exposes the entry stack and code to
> Meltdown, the thinking is, lets expose more. By moving the priv boundary
> the hope is that we can do lots of common system calls without having to
> sync up -- lots of details are 'pending'.


Thanks for clarifying. My understanding is (somewhat) less fuzzy now. :)

I think, though, that you were basically agreeing with me that the current 
core scheduler does not close the holes, or am I reading that wrong.


Cheers,
Phil

--

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-28 Thread Phil Auld

On Tue, Aug 27, 2019 at 11:50:35PM +0200 Peter Zijlstra wrote:
> On Tue, Aug 27, 2019 at 10:14:17PM +0100, Matthew Garrett wrote:
> > Apple have provided a sysctl that allows applications to indicate that 
> > specific threads should make use of core isolation while allowing 
> > the rest of the system to make use of SMT, and browsers (Safari, Firefox 
> > and Chrome, at least) are now making use of this. Trying to do something 
> > similar using cgroups seems a bit awkward. Would something like this be 
> > reasonable? 
> 
> Sure; like I wrote earlier; I only did the cgroup thing because I was
> lazy and it was the easiest interface to hack on in a hurry.
> 
> The rest of the ABI nonsense can 'trivially' be done later; if when we
> decide to actually do this.

I think something that allows the tag to be set may be needed. One of 
the use cases for this is virtualization stacks, where you really want
to be able to keep the higher CPU count and to set up the isolation 
from management processes on the host. 

The current cgroup interface doesn't work for that because it doesn't 
apply the tag to children. We've been unable to fully test it in a virt
setup because our VMs are made of a child cgroup per vcpu. 

> 
> And given MDS, I'm still not entirely convinced it all makes sense. If
> it were just L1TF, then yes, but now...

I was thinking MDS is really the reason for this. L1TF has mitigations but
the only current mitigation for MDS for smt is ... nosmt. 

The current core scheduler implementation, I believe, still has (theoretical?) 
holes involving interrupts, once/if those are closed it may be even less 
attractive.

> 
> > Having spoken to the Chrome team, I believe that the 
> > semantics we want are:
> > 
> > 1) A thread to be able to indicate that it should not run on the same 
> > core as anything not in posession of the same cookie
> > 2) Descendents of that thread to (by default) have the same cookie
> > 3) No other thread be able to obtain the same cookie
> > 4) Threads not be able to rejoin the global group (ie, threads can 
> > segregate themselves from their parent and peers, but can never rejoin 
> > that group once segregated)
> > 
> > but don't know if that's what everyone else would want.
> > 
> > diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
> > index 094bb03b9cc2..5d411246d4d5 100644
> > --- a/include/uapi/linux/prctl.h
> > +++ b/include/uapi/linux/prctl.h
> > @@ -229,4 +229,5 @@ struct prctl_mm_map {
> >  # define PR_PAC_APDBKEY(1UL << 3)
> >  # define PR_PAC_APGAKEY(1UL << 4)
> >  
> > +#define PR_CORE_ISOLATE55
> >  #endif /* _LINUX_PRCTL_H */
> > diff --git a/kernel/sys.c b/kernel/sys.c
> > index 12df0e5434b8..a054cfcca511 100644
> > --- a/kernel/sys.c
> > +++ b/kernel/sys.c
> > @@ -2486,6 +2486,13 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, 
> > arg2, unsigned long, arg3,
> > return -EINVAL;
> > error = PAC_RESET_KEYS(me, arg2);
> > break;
> > +   case PR_CORE_ISOLATE:
> > +#ifdef CONFIG_SCHED_CORE
> > +   current->core_cookie = (unsigned long)current;
> 
> This needs to then also force a reschedule of current. And there's the
> little issue of what happens if 'current' dies while its children live
> on, and current gets re-used for a new process and does this again.

sched_core_get() too?


Cheers,
Phil

> 
> > +#else
> > +   result = -EINVAL;
> > +#endif
> > +   break;
> > default:
> > error = -EINVAL;
> > break;
> > 
> > 
> > -- 
> > Matthew Garrett | mj...@srcf.ucam.org

--

Re: [PATCH -next v2] sched/fair: fix -Wunused-but-set-variable warnings

2019-08-23 Thread Phil Auld

On Fri, Aug 23, 2019 at 10:28:02AM -0700 bseg...@google.com wrote:
> Dave Chiluk  writes:
> 
> > On Wed, Aug 21, 2019 at 12:36 PM  wrote:
> >>
> >> Qian Cai  writes:
> >>
> >> > The linux-next commit "sched/fair: Fix low cpu usage with high
> >> > throttling by removing expiration of cpu-local slices" [1] introduced a
> >> > few compilation warnings,
> >> >
> >> > kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
> >> > kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used
> >> > [-Wunused-but-set-variable]
> >> > kernel/sched/fair.c: In function 'start_cfs_bandwidth':
> >> > kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used
> >> > [-Wunused-but-set-variable]
> >> >
> >> > Also, __refill_cfs_bandwidth_runtime() does no longer update the
> >> > expiration time, so fix the comments accordingly.
> >> >
> >> > [1] 
> >> > https://lore.kernel.org/lkml/1558121424-2914-1-git-send-email-chiluk+li...@indeed.com/
> >> >
> >> > Signed-off-by: Qian Cai 
> >>
> >> Reviewed-by: Ben Segall 
> >>
> >> > ---
> >> >
> >> > v2: Keep hrtimer_forward_now() in start_cfs_bandwidth() per Ben.
> >> >
> >> >  kernel/sched/fair.c | 19 ++-
> >> >  1 file changed, 6 insertions(+), 13 deletions(-)
> >> >
> >> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> > index 84959d3285d1..06782491691f 100644
> >> > --- a/kernel/sched/fair.c
> >> > +++ b/kernel/sched/fair.c
> >> > @@ -4354,21 +4354,16 @@ static inline u64 sched_cfs_bandwidth_slice(void)
> >> >  }
> >> >
> >> >  /*
> >> > - * Replenish runtime according to assigned quota and update expiration 
> >> > time.
> >> > - * We use sched_clock_cpu directly instead of rq->clock to avoid adding
> >> > - * additional synchronization around rq->lock.
> >> > + * Replenish runtime according to assigned quota. We use sched_clock_cpu
> >> > + * directly instead of rq->clock to avoid adding additional 
> >> > synchronization
> >> > + * around rq->lock.
> >> >   *
> >> >   * requires cfs_b->lock
> >> >   */
> >> >  void __refill_cfs_bandwidth_runtime(struct cfs_bandwidth *cfs_b)
> >> >  {
> >> > - u64 now;
> >> > -
> >> > - if (cfs_b->quota == RUNTIME_INF)
> >> > - return;
> >> > -
> >> > - now = sched_clock_cpu(smp_processor_id());
> >> > - cfs_b->runtime = cfs_b->quota;
> >> > + if (cfs_b->quota != RUNTIME_INF)
> >> > + cfs_b->runtime = cfs_b->quota;
> >> >  }
> >> >
> >> >  static inline struct cfs_bandwidth *tg_cfs_bandwidth(struct task_group 
> >> > *tg)
> >> > @@ -4989,15 +4984,13 @@ static void init_cfs_rq_runtime(struct cfs_rq 
> >> > *cfs_rq)
> >> >
> >> >  void start_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
> >> >  {
> >> > - u64 overrun;
> >> > -
> >> >   lockdep_assert_held(_b->lock);
> >> >
> >> >   if (cfs_b->period_active)
> >> >   return;
> >> >
> >> >   cfs_b->period_active = 1;
> >> > - overrun = hrtimer_forward_now(_b->period_timer, cfs_b->period);
> >> > + hrtimer_forward_now(_b->period_timer, cfs_b->period);
> >> >   hrtimer_start_expires(_b->period_timer, 
> >> > HRTIMER_MODE_ABS_PINNED);
> >> >  }
> >
> > Looks good.
> > Reviewed-by: Dave Chiluk 
> >
> > Sorry for the slow response, I was on vacation.
> >
> > @Ben do you think it would be useful to still capture overrun, and
> > WARN on any overruns?  We wouldn't expect overruns, but their
> > existence would indicate an over-loaded node or too short of a
> > cfs_period.  Additionally, it would be interesting to see if we could
> > capture the offset between when the bandwidth was refilled, and when
> > the timer was supposed to fire.  I've always done all my calculations
> > assuming that the timer fires and is handled exceedingly close to the
> > time it was supposed to fire.  Although, if the node is running that
> > overloaded you probably have many more problems than worrying about
> > timer warnings.
> 
> That "overrun" there is not really an overrun - it's the number of
> complete periods the timer has been inactive for. It was used so that a
> given tg's period timer would keep the same
> phase/offset/whatever-you-call-it, even if it goes idle for a while,
> rather than having the next period start N ms after a task wakes up.
> 
> Also, poor choices by userspace is not generally something the kernel
> generally WARNs on, as I understand it.

I don't think it matters in the start_cfs_bandwidth case, anyway. We do 
effectively check in sched_cfs_period_timer. 

Cleanup looks okay to me as well.


Cheers,
Phil

--

[PATCH] sched/rt: silence double clock update warning by using rq_lock wrappers

2019-08-15 Thread Phil Auld

With WARN_DOUBLE_CLOCK enabled a false positive warning can occur in rt

[] rq->clock_update_flags & RQCF_UPDATED
[] WARNING: CPU: 6 PID: 21426 at kernel/sched/core.c:225 
update_rq_clock+0x90/0x130
  [] Call Trace:
  []  update_rq_clock+0x90/0x130
  []  sched_rt_period_timer+0x11f/0x340
  []  __hrtimer_run_queues+0x100/0x280
  []  hrtimer_interrupt+0x100/0x220
  []  smp_apic_timer_interrupt+0x6a/0x130
  []  apic_timer_interrupt+0xf/0x20

sched_rt_period_timer does:
raw_spin_lock(>lock);
update_rq_clock(rq);

which triggers the warning because of not using the rq_lock wrappers.
So, use the wrappers.

Signed-off-by: Phil Auld 
Cc: Peter Zijlstra (Intel) 
Cc: Ingo Molnar 
Cc: Valentin Schneider 
Cc: Dietmar Eggemann 
---
 kernel/sched/rt.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index a532558a5176..0846e71114ee 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -831,6 +831,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth 
*rt_b, int overrun)
int enqueue = 0;
struct rt_rq *rt_rq = sched_rt_period_rt_rq(rt_b, i);
struct rq *rq = rq_of_rt_rq(rt_rq);
+   struct rq_flags rf;
int skip;
 
/*
@@ -845,7 +846,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth 
*rt_b, int overrun)
if (skip)
continue;
 
-   raw_spin_lock(>lock);
+   rq_lock(rq, );
update_rq_clock(rq);
 
if (rt_rq->rt_time) {
@@ -883,7 +884,7 @@ static int do_sched_rt_period_timer(struct rt_bandwidth 
*rt_b, int overrun)
 
if (enqueue)
sched_rt_rq_enqueue(rt_rq);
-   raw_spin_unlock(>lock);
+   rq_unlock(rq, );
}
 
if (!throttled && (!rt_bandwidth_enabled() || rt_b->rt_runtime == 
RUNTIME_INF))
-- 
2.18.0

Re: [PATCH] sched: use rq_lock/unlock in online_fair_sched_group

2019-08-15 Thread Phil Auld

On Fri, Aug 09, 2019 at 06:43:09PM +0100 Valentin Schneider wrote:
> On 09/08/2019 14:33, Phil Auld wrote:
> > On Tue, Aug 06, 2019 at 03:03:34PM +0200 Peter Zijlstra wrote:
> >> On Thu, Aug 01, 2019 at 09:37:49AM -0400, Phil Auld wrote:
> >>> Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
> >>
> >> ISTR there were more issues; but it sure is good to start picking them
> >> off.
> >>
> > 
> > Following up on this I hit another in rt.c which looks like:
> > 
> > [  156.348854] Call Trace:
> > [  156.351301]  
> > [  156.353322]  sched_rt_period_timer+0x124/0x350
> > [  156.357766]  ? sched_rt_rq_enqueue+0x90/0x90
> > [  156.362037]  __hrtimer_run_queues+0xfb/0x270
> > [  156.366303]  hrtimer_interrupt+0x122/0x270
> > [  156.370403]  smp_apic_timer_interrupt+0x6a/0x140
> > [  156.375022]  apic_timer_interrupt+0xf/0x20
> > [  156.379119]  
> > 
> > It looks like the same issue of not using the rq_lock* wrappers and
> > hence not using the pinning. From looking at the code there is at 
> > least one potential hit in deadline.c in the push_dl_task path with 
> > find_lock_later_rq but I have not hit that in practice.
> > 
> > This commit, which introduced the warning, seems to imply that the use
> > of the rq_lock* wrappers is required, at least for any sections that will
> > call update_rq_clock:
> > 
> > commit 26ae58d23b94a075ae724fd18783a3773131cfbc
> > Author: Peter Zijlstra 
> > Date:   Mon Oct 3 16:53:49 2016 +0200
> > 
> > sched/core: Add WARNING for multiple update_rq_clock() calls
> > 
> > Now that we have no missing calls, add a warning to find multiple
> > calls.
> > 
> > By having only a single update_rq_clock() call per rq-lock section,
> > the section appears 'atomic' wrt time.
> > 
> > 
> > Is that the case? Otherwise we have these false positives.
> > 
> 
> Looks like it - only rq_pin_lock() clears RQCF_UPDATED, so any
> update_rq_clock() that isn't preceded by that function will still have
> RQCF_UPDATED set the second time it's executed and will trigger the warn.
> 
> Seeing as the wrappers boil down to raw_spin_*() when the debug bits are
> disabled, I don't see why we wouldn't want to convert these callsites.
> 

The one above is easy enough.  After that I hit one related to the 
double_rq_lock
paths. Now I see why that was not cleaned up already. That's going to be a 
bit messier and will require some study. 

I'll post this trivial anyway. 

Cheers,
Phil

--

Re: [tip:sched/core] sched/fair: Use rq_lock/unlock in online_fair_sched_group

2019-08-12 Thread Phil Auld

On Mon, Aug 12, 2019 at 05:52:04AM -0700 tip-bot for Phil Auld wrote:
> Commit-ID:  a46d14eca7b75fffe35603aa8b81df654353d80f
> Gitweb: 
> https://git.kernel.org/tip/a46d14eca7b75fffe35603aa8b81df654353d80f
> Author:     Phil Auld 
> AuthorDate: Thu, 1 Aug 2019 09:37:49 -0400
> Committer:  Thomas Gleixner 
> CommitDate: Mon, 12 Aug 2019 14:45:34 +0200
> 
> sched/fair: Use rq_lock/unlock in online_fair_sched_group
> 
> Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
> warning to fire in update_rq_clock. This seems to be caused by onlining
> a new fair sched group not using the rq lock wrappers.
> 
>   [] rq->clock_update_flags & RQCF_UPDATED
>   [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 
> update_rq_clock+0xec/0x150
> 
>   [] Call Trace:
>   []  online_fair_sched_group+0x53/0x100
>   []  cpu_cgroup_css_online+0x16/0x20
>   []  online_css+0x1c/0x60
>   []  cgroup_apply_control_enable+0x231/0x3b0
>   []  cgroup_mkdir+0x41b/0x530
>   []  kernfs_iop_mkdir+0x61/0xa0
>   []  vfs_mkdir+0x108/0x1a0
>   []  do_mkdirat+0x77/0xe0
>   []  do_syscall_64+0x55/0x1d0
>   []  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> Using the wrappers in online_fair_sched_group instead of the raw locking
> removes this warning.
> 
> [ tglx: Use rq_*lock_irq() ]
> 
> Signed-off-by: Phil Auld 
> Signed-off-by: Peter Zijlstra (Intel) 
> Signed-off-by: Thomas Gleixner 
> Cc: Ingo Molnar 
> Cc: Vincent Guittot 
> Cc: Ingo Molnar 
> Link: https://lkml.kernel.org/r/20190801133749.11033-1-pa...@redhat.com
> ---
>  kernel/sched/fair.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 19c58599e967..1054d2cf6aaa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10281,18 +10281,18 @@ err:
>  void online_fair_sched_group(struct task_group *tg)
>  {
>   struct sched_entity *se;
> + struct rq_flags rf;
>   struct rq *rq;
>   int i;
>  
>   for_each_possible_cpu(i) {
>   rq = cpu_rq(i);
>   se = tg->se[i];
> -
> - raw_spin_lock_irq(>lock);
> + rq_lock_irq(rq, );
>   update_rq_clock(rq);
>   attach_entity_cfs_rq(se);
>   sync_throttle(tg, i);
> - raw_spin_unlock_irq(>lock);
> + rq_unlock_irq(rq, );
>   }
>  }
>  

Thanks Thomas!

--

[tip:sched/core] sched/fair: Use rq_lock/unlock in online_fair_sched_group

2019-08-12 Thread tip-bot for Phil Auld

Commit-ID:  a46d14eca7b75fffe35603aa8b81df654353d80f
Gitweb: https://git.kernel.org/tip/a46d14eca7b75fffe35603aa8b81df654353d80f
Author: Phil Auld 
AuthorDate: Thu, 1 Aug 2019 09:37:49 -0400
Committer:  Thomas Gleixner 
CommitDate: Mon, 12 Aug 2019 14:45:34 +0200

sched/fair: Use rq_lock/unlock in online_fair_sched_group

Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
warning to fire in update_rq_clock. This seems to be caused by onlining
a new fair sched group not using the rq lock wrappers.

  [] rq->clock_update_flags & RQCF_UPDATED
  [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 
update_rq_clock+0xec/0x150

  [] Call Trace:
  []  online_fair_sched_group+0x53/0x100
  []  cpu_cgroup_css_online+0x16/0x20
  []  online_css+0x1c/0x60
  []  cgroup_apply_control_enable+0x231/0x3b0
  []  cgroup_mkdir+0x41b/0x530
  []  kernfs_iop_mkdir+0x61/0xa0
  []  vfs_mkdir+0x108/0x1a0
  []  do_mkdirat+0x77/0xe0
  []  do_syscall_64+0x55/0x1d0
  []  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using the wrappers in online_fair_sched_group instead of the raw locking
removes this warning.

[ tglx: Use rq_*lock_irq() ]

Signed-off-by: Phil Auld 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Vincent Guittot 
Cc: Ingo Molnar 
Link: https://lkml.kernel.org/r/20190801133749.11033-1-pa...@redhat.com
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19c58599e967..1054d2cf6aaa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10281,18 +10281,18 @@ err:
 void online_fair_sched_group(struct task_group *tg)
 {
struct sched_entity *se;
+   struct rq_flags rf;
struct rq *rq;
int i;
 
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
-
-   raw_spin_lock_irq(>lock);
+   rq_lock_irq(rq, );
update_rq_clock(rq);
attach_entity_cfs_rq(se);
sync_throttle(tg, i);
-   raw_spin_unlock_irq(>lock);
+   rq_unlock_irq(rq, );
}
 }

Re: [tip:sched/core] sched/fair: Use rq_lock/unlock in online_fair_sched_group

2019-08-09 Thread Phil Auld

On Fri, Aug 09, 2019 at 06:21:22PM +0200 Dietmar Eggemann wrote:
> On 8/8/19 1:01 PM, tip-bot for Phil Auld wrote:
> 
> [...]
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 19c58599e967..d9407517dae9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10281,18 +10281,18 @@ err:
> >  void online_fair_sched_group(struct task_group *tg)
> >  {
> > struct sched_entity *se;
> > +   struct rq_flags rf;
> > struct rq *rq;
> > int i;
> >  
> > for_each_possible_cpu(i) {
> > rq = cpu_rq(i);
> > se = tg->se[i];
> > -
> > -   raw_spin_lock_irq(>lock);
> > +   rq_lock(rq, );
> > update_rq_clock(rq);
> > attach_entity_cfs_rq(se);
> > sync_throttle(tg, i);
> > -   raw_spin_unlock_irq(>lock);
> > +   rq_unlock(rq, );
> > }
> >  }
> 
> Shouldn't this be:
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d9407517dae9..1054d2cf6aaa 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10288,11 +10288,11 @@ void online_fair_sched_group(struct task_group
> *tg)
> for_each_possible_cpu(i) {
> rq = cpu_rq(i);
> se = tg->se[i];
> -   rq_lock(rq, );
> +   rq_lock_irq(rq, );
> update_rq_clock(rq);
> attach_entity_cfs_rq(se);
> sync_throttle(tg, i);
> -   rq_unlock(rq, );
> +   rq_unlock_irq(rq, );
> }
>  }
> 
> Currently, you should get a 'inconsistent lock state' warning with
> CONFIG_PROVE_LOCKING.

Yes, indeed. Sorry about that. Maybe it can be fixed in tip before 
it gets any farther?  Or do we need a new patch?


Cheers,
Phil

--

Re: [PATCH] sched: use rq_lock/unlock in online_fair_sched_group

2019-08-09 Thread Phil Auld

On Tue, Aug 06, 2019 at 03:03:34PM +0200 Peter Zijlstra wrote:
> On Thu, Aug 01, 2019 at 09:37:49AM -0400, Phil Auld wrote:
> > Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
> 
> ISTR there were more issues; but it sure is good to start picking them
> off.
> 

Following up on this I hit another in rt.c which looks like:

[  156.348854] Call Trace:
[  156.351301]  
[  156.353322]  sched_rt_period_timer+0x124/0x350
[  156.357766]  ? sched_rt_rq_enqueue+0x90/0x90
[  156.362037]  __hrtimer_run_queues+0xfb/0x270
[  156.366303]  hrtimer_interrupt+0x122/0x270
[  156.370403]  smp_apic_timer_interrupt+0x6a/0x140
[  156.375022]  apic_timer_interrupt+0xf/0x20
[  156.379119]  

It looks like the same issue of not using the rq_lock* wrappers and
hence not using the pinning. From looking at the code there is at 
least one potential hit in deadline.c in the push_dl_task path with 
find_lock_later_rq but I have not hit that in practice.

This commit, which introduced the warning, seems to imply that the use
of the rq_lock* wrappers is required, at least for any sections that will
call update_rq_clock:

commit 26ae58d23b94a075ae724fd18783a3773131cfbc
Author: Peter Zijlstra 
Date:   Mon Oct 3 16:53:49 2016 +0200

sched/core: Add WARNING for multiple update_rq_clock() calls

Now that we have no missing calls, add a warning to find multiple
calls.

By having only a single update_rq_clock() call per rq-lock section,
the section appears 'atomic' wrt time.

Is that the case? Otherwise we have these false positives.

I can spin up patches if so. 

Thanks,
Phil

--

[tip:sched/core] sched/fair: Use rq_lock/unlock in online_fair_sched_group

2019-08-08 Thread tip-bot for Phil Auld

Commit-ID:  6b8fd01b21f5f2701b407a7118f236ba4c41226d
Gitweb: https://git.kernel.org/tip/6b8fd01b21f5f2701b407a7118f236ba4c41226d
Author: Phil Auld 
AuthorDate: Thu, 1 Aug 2019 09:37:49 -0400
Committer:  Peter Zijlstra 
CommitDate: Thu, 8 Aug 2019 09:09:31 +0200

sched/fair: Use rq_lock/unlock in online_fair_sched_group

Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
warning to fire in update_rq_clock. This seems to be caused by onlining
a new fair sched group not using the rq lock wrappers.

  [] rq->clock_update_flags & RQCF_UPDATED
  [] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 
update_rq_clock+0xec/0x150

  [] Call Trace:
  []  online_fair_sched_group+0x53/0x100
  []  cpu_cgroup_css_online+0x16/0x20
  []  online_css+0x1c/0x60
  []  cgroup_apply_control_enable+0x231/0x3b0
  []  cgroup_mkdir+0x41b/0x530
  []  kernfs_iop_mkdir+0x61/0xa0
  []  vfs_mkdir+0x108/0x1a0
  []  do_mkdirat+0x77/0xe0
  []  do_syscall_64+0x55/0x1d0
  []  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using the wrappers in online_fair_sched_group instead of the raw locking
removes this warning.

Signed-off-by: Phil Auld 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Ingo Molnar 
Cc: Vincent Guittot 
Cc: Ingo Molnar 
Link: https://lkml.kernel.org/r/20190801133749.11033-1-pa...@redhat.com
---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 19c58599e967..d9407517dae9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10281,18 +10281,18 @@ err:
 void online_fair_sched_group(struct task_group *tg)
 {
struct sched_entity *se;
+   struct rq_flags rf;
struct rq *rq;
int i;
 
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
-
-   raw_spin_lock_irq(>lock);
+   rq_lock(rq, );
update_rq_clock(rq);
attach_entity_cfs_rq(se);
sync_throttle(tg, i);
-   raw_spin_unlock_irq(>lock);
+   rq_unlock(rq, );
}
 }

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-06 Thread Phil Auld

On Tue, Aug 06, 2019 at 10:41:25PM +0800 Aaron Lu wrote:
> On 2019/8/6 22:17, Phil Auld wrote:
> > On Tue, Aug 06, 2019 at 09:54:01PM +0800 Aaron Lu wrote:
> >> On Mon, Aug 05, 2019 at 04:09:15PM -0400, Phil Auld wrote:
> >>> Hi,
> >>>
> >>> On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> >>>> We tested both Aaron's and Tim's patches and here are our results.
> >>>>
> >>>> Test setup:
> >>>> - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> >>>>   mem benchmark
> >>>> - both started at the same time
> >>>> - both are pinned on the same core (2 hardware threads)
> >>>> - 10 30-seconds runs
> >>>> - test script: https://paste.debian.net/plainh/834cf45c
> >>>> - only showing the CPU events/sec (higher is better)
> >>>> - tested 4 tag configurations:
> >>>>   - no tag
> >>>>   - sysbench mem untagged, sysbench cpu tagged
> >>>>   - sysbench mem tagged, sysbench cpu untagged
> >>>>   - both tagged with a different tag
> >>>> - "Alone" is the sysbench CPU running alone on the core, no tag
> >>>> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> >>>> - "Tim's full patchset + sched" is an experiment with Tim's patchset
> >>>>   combined with Aaron's "hack patch" to get rid of the remaining deep
> >>>>   idle cases
> >>>> - In all test cases, both tasks can run simultaneously (which was not
> >>>>   the case without those patches), but the standard deviation is a
> >>>>   pretty good indicator of the fairness/consistency.
> >>>>
> >>>> No tag
> >>>> --
> >>>> TestAverage Stdev
> >>>> Alone   1306.90 0.94
> >>>> nosmt   649.95  1.44
> >>>> Aaron's full patchset:  828.15  32.45
> >>>> Aaron's first 2 patches:832.12  36.53
> >>>> Aaron's 3rd patch alone:864.21  3.68
> >>>> Tim's full patchset:852.50  4.11
> >>>> Tim's full patchset + sched:852.59  8.25
> >>>>
> >>>> Sysbench mem untagged, sysbench cpu tagged
> >>>> --
> >>>> TestAverage Stdev
> >>>> Alone   1306.90 0.94
> >>>> nosmt   649.95  1.44
> >>>> Aaron's full patchset:  586.06  1.77
> >>>> Aaron's first 2 patches:630.08  47.30
> >>>> Aaron's 3rd patch alone:1086.65 246.54
> >>>> Tim's full patchset:852.50  4.11
> >>>> Tim's full patchset + sched:390.49  15.76
> >>>>
> >>>> Sysbench mem tagged, sysbench cpu untagged
> >>>> --
> >>>> TestAverage Stdev
> >>>> Alone   1306.90 0.94
> >>>> nosmt   649.95  1.44
> >>>> Aaron's full patchset:  583.77  3.52
> >>>> Aaron's first 2 patches:513.63  63.09
> >>>> Aaron's 3rd patch alone:1171.23 3.35
> >>>> Tim's full patchset:564.04  58.05
> >>>> Tim's full patchset + sched:1026.16 49.43
> >>>>
> >>>> Both sysbench tagged
> >>>> 
> >>>> TestAverage Stdev
> >>>> Alone   1306.90 0.94
> >>>> nosmt   649.95  1.44
> >>>> Aaron's full patchset:  582.15  3.75
> >>>> Aaron's first 2 patches:561.07  91.61
> >>>> Aaron's 3rd patch alone:638.49  231.06
> >>>> Tim's full patchset:679.43  70.07
> >>>> Tim's full patchset + sched:664.34  210.14
> >>>>
> >>>
> >>> Sorry if I'm missing something obvious here but with only 2 processes 
> >>> of interest shouldn't one tagged and one untagged be about the same
> >>> as both tagged?  
> >>
> >> It should.
> >>
> >>> In both cases the 2 sysbenches should not be running on the core at 
> >>> the same time. 
> >>
> >> Agree.
> >>
> >>> There will be times when oher non-related threads could share the core
> >>> with the untagged one. Is that enough to account for this difference?
> >>
> >> What difference do you mean?
> > 
> > 
> > I was looking at the above posted numbers. For example:
> > 
> >>>> Sysbench mem untagged, sysbench cpu tagged
> >>>> Aaron's 3rd patch alone:1086.65 246.54
> > 
> >>>> Sysbench mem tagged, sysbench cpu untagged
> >>>> Aaron's 3rd patch alone:1171.23 3.35
> > 
> >>>> Both sysbench tagged
> >>>> Aaron's 3rd patch alone:638.49  231.06
> > 
> > 
> > Admittedly, there's some high variance on some of those numbers. 
> 
> The high variance suggests the code having some fairness issues :-)
> 
> For the test here, I didn't expect the 3rd patch being used alone
> since the fairness is solved by patch2 and patch3 together.

Makes sense, thanks.


--

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-06 Thread Phil Auld

On Tue, Aug 06, 2019 at 09:54:01PM +0800 Aaron Lu wrote:
> On Mon, Aug 05, 2019 at 04:09:15PM -0400, Phil Auld wrote:
> > Hi,
> > 
> > On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> > > We tested both Aaron's and Tim's patches and here are our results.
> > > 
> > > Test setup:
> > > - 2 1-thread sysbench, one running the cpu benchmark, the other one the
> > >   mem benchmark
> > > - both started at the same time
> > > - both are pinned on the same core (2 hardware threads)
> > > - 10 30-seconds runs
> > > - test script: https://paste.debian.net/plainh/834cf45c
> > > - only showing the CPU events/sec (higher is better)
> > > - tested 4 tag configurations:
> > >   - no tag
> > >   - sysbench mem untagged, sysbench cpu tagged
> > >   - sysbench mem tagged, sysbench cpu untagged
> > >   - both tagged with a different tag
> > > - "Alone" is the sysbench CPU running alone on the core, no tag
> > > - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> > > - "Tim's full patchset + sched" is an experiment with Tim's patchset
> > >   combined with Aaron's "hack patch" to get rid of the remaining deep
> > >   idle cases
> > > - In all test cases, both tasks can run simultaneously (which was not
> > >   the case without those patches), but the standard deviation is a
> > >   pretty good indicator of the fairness/consistency.
> > > 
> > > No tag
> > > --
> > > TestAverage Stdev
> > > Alone   1306.90 0.94
> > > nosmt   649.95  1.44
> > > Aaron's full patchset:  828.15  32.45
> > > Aaron's first 2 patches:832.12  36.53
> > > Aaron's 3rd patch alone:864.21  3.68
> > > Tim's full patchset:852.50  4.11
> > > Tim's full patchset + sched:852.59  8.25
> > > 
> > > Sysbench mem untagged, sysbench cpu tagged
> > > --
> > > TestAverage Stdev
> > > Alone   1306.90 0.94
> > > nosmt   649.95  1.44
> > > Aaron's full patchset:  586.06  1.77
> > > Aaron's first 2 patches:630.08  47.30
> > > Aaron's 3rd patch alone:1086.65 246.54
> > > Tim's full patchset:852.50  4.11
> > > Tim's full patchset + sched:390.49  15.76
> > > 
> > > Sysbench mem tagged, sysbench cpu untagged
> > > --
> > > TestAverage Stdev
> > > Alone   1306.90 0.94
> > > nosmt   649.95  1.44
> > > Aaron's full patchset:  583.77  3.52
> > > Aaron's first 2 patches:513.63  63.09
> > > Aaron's 3rd patch alone:1171.23 3.35
> > > Tim's full patchset:564.04  58.05
> > > Tim's full patchset + sched:1026.16 49.43
> > > 
> > > Both sysbench tagged
> > > 
> > > TestAverage Stdev
> > > Alone   1306.90 0.94
> > > nosmt   649.95  1.44
> > > Aaron's full patchset:  582.15  3.75
> > > Aaron's first 2 patches:561.07  91.61
> > > Aaron's 3rd patch alone:638.49  231.06
> > > Tim's full patchset:679.43  70.07
> > > Tim's full patchset + sched:664.34  210.14
> > > 
> > 
> > Sorry if I'm missing something obvious here but with only 2 processes 
> > of interest shouldn't one tagged and one untagged be about the same
> > as both tagged?  
> 
> It should.
> 
> > In both cases the 2 sysbenches should not be running on the core at 
> > the same time. 
> 
> Agree.
> 
> > There will be times when oher non-related threads could share the core
> > with the untagged one. Is that enough to account for this difference?
> 
> What difference do you mean?


I was looking at the above posted numbers. For example:

> > > Sysbench mem untagged, sysbench cpu tagged
> > > Aaron's 3rd patch alone:1086.65 246.54

> > > Sysbench mem tagged, sysbench cpu untagged
> > > Aaron's 3rd patch alone:1171.23 3.35

> > > Both sysbench tagged
> > > Aaron's 3rd patch alone:638.49  231.06


Admittedly, there's some high variance on some of those numbers. 


Cheers,
Phil

> 
> Thanks,
> Aaron
> 
> > > So in terms of fairness, Aaron's full patchset is the most consistent, 
> > > but only
> > > Tim's patchset performs better than nosmt in some conditions.
> > > 
> > > Of course, this is one of the worst case scenario, as soon as we have
> > > multithreaded applications on overcommitted systems, core scheduling 
> > > performs
> > > better than nosmt.
> > > 
> > > Thanks,
> > > 
> > > Julien
> > 
> > -- 

--

Re: [PATCH] sched: use rq_lock/unlock in online_fair_sched_group

2019-08-06 Thread Phil Auld

On Tue, Aug 06, 2019 at 03:03:34PM +0200 Peter Zijlstra wrote:
> On Thu, Aug 01, 2019 at 09:37:49AM -0400, Phil Auld wrote:
> > Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
> 
> ISTR there were more issues; but it sure is good to start picking them
> off.

I haven't hit any others but if/when I'll try to dig into them. 

> 
> > warning to fire in update_rq_clock. This seems to be caused by onlining
> > a new fair sched group not using the rq lock wrappers.
> > 
> > [472978.683085] rq->clock_update_flags & RQCF_UPDATED
> > [472978.683100] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 
> > update_rq_clock+0xec/0x150
> 
> > Using the wrappers in online_fair_sched_group instead of the raw locking 
> > removes this warning. 
> 
> Yeah, that seems sane. Thanks!

Thanks,
Phil

--

Re: [PATCH] sched: use rq_lock/unlock in online_fair_sched_group

2019-08-06 Thread Phil Auld

On Tue, Aug 06, 2019 at 02:04:16PM +0800 Hillf Danton wrote:
> 
> On Mon, 5 Aug 2019 22:07:05 +0800 Phil Auld wrote:
> >
> > If we're to clear that flag right there, outside of the lock pinning code,
> > then I think we might as well just remove the flag and all associated
> > comments etc, no?
> 
> A diff may tell the Peter folks more about your thoughts?
> 

I provided a diff with my thoughts of how to remove this warning in
the original post :)

This comment was about your patch which, to my mind, makes the flag 
meaningless and so could just remove the whole thing. I was not 
proposing to actually do that. I assumed it was there because it was
thought to be useful. Although, if that is what people want I could 
certainly spin up a patch to that effect. 

Cheers,
Phil

> Hillf
> 

--

Re: [RFC PATCH v3 00/16] Core scheduling v3

2019-08-05 Thread Phil Auld

Hi,

On Fri, Aug 02, 2019 at 11:37:15AM -0400 Julien Desfossez wrote:
> We tested both Aaron's and Tim's patches and here are our results.
> 
> Test setup:
> - 2 1-thread sysbench, one running the cpu benchmark, the other one the
>   mem benchmark
> - both started at the same time
> - both are pinned on the same core (2 hardware threads)
> - 10 30-seconds runs
> - test script: https://paste.debian.net/plainh/834cf45c
> - only showing the CPU events/sec (higher is better)
> - tested 4 tag configurations:
>   - no tag
>   - sysbench mem untagged, sysbench cpu tagged
>   - sysbench mem tagged, sysbench cpu untagged
>   - both tagged with a different tag
> - "Alone" is the sysbench CPU running alone on the core, no tag
> - "nosmt" is both sysbench pinned on the same hardware thread, no tag
> - "Tim's full patchset + sched" is an experiment with Tim's patchset
>   combined with Aaron's "hack patch" to get rid of the remaining deep
>   idle cases
> - In all test cases, both tasks can run simultaneously (which was not
>   the case without those patches), but the standard deviation is a
>   pretty good indicator of the fairness/consistency.
> 
> No tag
> --
> TestAverage Stdev
> Alone   1306.90 0.94
> nosmt   649.95  1.44
> Aaron's full patchset:  828.15  32.45
> Aaron's first 2 patches:832.12  36.53
> Aaron's 3rd patch alone:864.21  3.68
> Tim's full patchset:852.50  4.11
> Tim's full patchset + sched:852.59  8.25
> 
> Sysbench mem untagged, sysbench cpu tagged
> --
> TestAverage Stdev
> Alone   1306.90 0.94
> nosmt   649.95  1.44
> Aaron's full patchset:  586.06  1.77
> Aaron's first 2 patches:630.08  47.30
> Aaron's 3rd patch alone:1086.65 246.54
> Tim's full patchset:852.50  4.11
> Tim's full patchset + sched:390.49  15.76
> 
> Sysbench mem tagged, sysbench cpu untagged
> --
> TestAverage Stdev
> Alone   1306.90 0.94
> nosmt   649.95  1.44
> Aaron's full patchset:  583.77  3.52
> Aaron's first 2 patches:513.63  63.09
> Aaron's 3rd patch alone:1171.23 3.35
> Tim's full patchset:564.04  58.05
> Tim's full patchset + sched:1026.16 49.43
> 
> Both sysbench tagged
> 
> TestAverage Stdev
> Alone   1306.90 0.94
> nosmt   649.95  1.44
> Aaron's full patchset:  582.15  3.75
> Aaron's first 2 patches:561.07  91.61
> Aaron's 3rd patch alone:638.49  231.06
> Tim's full patchset:679.43  70.07
> Tim's full patchset + sched:664.34  210.14
> 

Sorry if I'm missing something obvious here but with only 2 processes 
of interest shouldn't one tagged and one untagged be about the same
as both tagged?  

In both cases the 2 sysbenches should not be running on the core at 
the same time. 

There will be times when oher non-related threads could share the core
with the untagged one. Is that enough to account for this difference?


Thanks,
Phil


> So in terms of fairness, Aaron's full patchset is the most consistent, but 
> only
> Tim's patchset performs better than nosmt in some conditions.
> 
> Of course, this is one of the worst case scenario, as soon as we have
> multithreaded applications on overcommitted systems, core scheduling performs
> better than nosmt.
> 
> Thanks,
> 
> Julien

--

Re: [PATCH] sched: use rq_lock/unlock in online_fair_sched_group

2019-08-05 Thread Phil Auld

On Fri, Aug 02, 2019 at 05:20:38PM +0800 Hillf Danton wrote:
> 
> On Thu,  1 Aug 2019 09:37:49 -0400 Phil Auld wrote:
> >
> > Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
> > warning to fire in update_rq_clock. This seems to be caused by onlining
> > a new fair sched group not using the rq lock wrappers.
> > 
> > [472978.683085] rq->clock_update_flags & RQCF_UPDATED
> > [472978.683100] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 
> > update_rq_clock+0xec/0x150
> 
> Another option perhaps only if that wrappers are not mandatory.
> 
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -212,10 +212,14 @@ void update_rq_clock(struct rq *rq)
>  #endif
>  
>   delta = sched_clock_cpu(cpu_of(rq)) - rq->clock;
> - if (delta < 0)
> - return;
> - rq->clock += delta;
> - update_rq_clock_task(rq, delta);
> + if (delta >= 0) {
> + rq->clock += delta;
> + update_rq_clock_task(rq, delta);
> + }
> +
> +#ifdef CONFIG_SCHED_DEBUG
> + rq->clock_update_flags &= ~RQCF_UPDATED;
> +#endif
>  }
>  
>  
> --
> 

I think that would silence the warning, but...

If we're to clear that flag right there, outside of the lock pinning code, 
then I think we might as well just remove the flag and all associated 
comments etc, no?


Cheers,
Phil

--

[PATCH] sched: use rq_lock/unlock in online_fair_sched_group

2019-08-01 Thread Phil Auld

Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
warning to fire in update_rq_clock. This seems to be caused by onlining
a new fair sched group not using the rq lock wrappers.

[472978.683085] rq->clock_update_flags & RQCF_UPDATED
[472978.683100] WARNING: CPU: 5 PID: 54385 at kernel/sched/core.c:210 
update_rq_clock+0xec/0x150
[472978.758465] Modules linked in: fuse vfat msdos fat ext4 mbcache jbd2 sunrpc 
iTCO_wdt gpio_ich iTCO_vendor_support intel_powerclamp coretemp kvm_intel kvm 
ipmi_ssif irqbypass ipmi_si crct10dif_pclmul crc32_pclmul ghash_clmulni_intel 
joydev pcspkr intel_cstate acpi_power_meter ipmi_devintf sg intel_uncore 
i7core_edac hpilo hpwdt lpc_ich ipmi_msghandler acpi_cpufreq ip_tables xfs 
libcrc32c sr_mod sd_mod cdrom ata_generic radeon i2c_algo_bit drm_kms_helper 
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix drm myri10ge hpsa 
libata serio_raw crc32c_intel scsi_transport_sas netxen_nic dca dm_mirror 
dm_region_hash dm_log dm_mod
[472979.050101] CPU: 5 PID: 54385 Comm: cgcreate Not tainted 5.2.0-rc6+ #135
[472979.089586] Hardware name: HP ProLiant DL580 G7, BIOS P65 08/16/2015
[472979.123638] RIP: 0010:update_rq_clock+0xec/0x150
[472979.146435] Code: a8 04 0f 84 55 ff ff ff 80 3d 93 34 2c 01 00 0f 85 48 ff 
ff ff 48 c7 c7 08 b9 a8 b7 31 c0 c6 05 7d 34 2c 01 01 e8 04 21 fd ff <0f> 0b 8b 
83 b0 09 00 00 e9 26 ff ff ff e8 72 c3 f5 ff 66 90 4c 8b
[472979.247671] RSP: 0018:a9addac67d48 EFLAGS: 00010086
[472979.277071] RAX:  RBX: 9db3ff62a380 RCX: 

[472979.314842] RDX: 0005 RSI: b8434a45 RDI: 
b84325ec
[472979.352189] RBP:  R08: b8434a20 R09: 
ffb27562d751
[472979.389326] R10: d471 R11: a9addac67a58 R12: 
9d8a1c9d5a00
[472979.425255] R13: 9db3fbbed400 R14: 0002a380 R15: 

[472979.462417] FS:  7f6a8218cb80() GS:9db3ff74() 
knlGS:
[472979.511306] CS:  0010 DS:  ES:  CR0: 80050033
[472979.543386] CR2: 7f6a82198000 CR3: 003f33f16005 CR4: 
000206e0
[472979.578646] Call Trace:
[472979.591702]  online_fair_sched_group+0x53/0x100
[472979.618172]  cpu_cgroup_css_online+0x16/0x20
[472979.640264]  online_css+0x1c/0x60
[472979.657648]  cgroup_apply_control_enable+0x231/0x3b0
[472979.684979]  cgroup_mkdir+0x41b/0x530
[472979.704845]  kernfs_iop_mkdir+0x61/0xa0
[472979.726278]  vfs_mkdir+0x108/0x1a0
[472979.745665]  do_mkdirat+0x77/0xe0
[472979.764981]  do_syscall_64+0x55/0x1d0
[472979.785990]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Using the wrappers in online_fair_sched_group instead of the raw locking 
removes this warning. 

Signed-off-by: Phil Auld 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Vincent Guittot 
---
 Resend with PATCH instead of CHANGE in subject, and more recent upstream x86 
backtrace.

 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95a87e9..5c1299a5675c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10242,17 +10242,17 @@ void online_fair_sched_group(struct task_group *tg)
 {
struct sched_entity *se;
struct rq *rq;
+   struct rq_flags rf;
int i;
 
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
-
-   raw_spin_lock_irq(>lock);
+   rq_lock(rq, );
update_rq_clock(rq);
attach_entity_cfs_rq(se);
sync_throttle(tg, i);
-   raw_spin_unlock_irq(>lock);
+   rq_unlock(rq, );
}
 }
 
-- 
2.18.0

Re: [RFC][PATCH 02/13] stop_machine: Fix stop_cpus_in_progress ordering

2019-07-30 Thread Phil Auld

On Fri, Jul 26, 2019 at 04:54:11PM +0200 Peter Zijlstra wrote:
> Make sure the entire for loop has stop_cpus_in_progress set.
> 
> Cc: Valentin Schneider 
> Cc: Aaron Lu 
> Cc: keesc...@chromium.org
> Cc: mi...@kernel.org
> Cc: Pawan Gupta 
> Cc: Phil Auld 
> Cc: torva...@linux-foundation.org
> Cc: Tim Chen 
> Cc: fweis...@gmail.com
> Cc: subhra.mazum...@oracle.com
> Cc: t...@linutronix.de
> Cc: Julien Desfossez 
> Cc: p...@google.com
> Cc: Nishanth Aravamudan 
> Cc: Aubrey Li 
> Cc: Mel Gorman 
> Cc: kerr...@google.com
> Cc: Paolo Bonzini 
> Signed-off-by: Peter Zijlstra (Intel) 
> Link: 
> https://lkml.kernel.org/r/0fd8fd4b99b9b9aa88d8b2dff897f7fd0d88f72c.1559129225.git.vpil...@digitalocean.com
> ---
>  kernel/stop_machine.c |2 ++
>  1 file changed, 2 insertions(+)
> 
> --- a/kernel/stop_machine.c
> +++ b/kernel/stop_machine.c
> @@ -383,6 +383,7 @@ static bool queue_stop_cpus_work(const s
>*/
>   preempt_disable();
>   stop_cpus_in_progress = true;
> + barrier();
>   for_each_cpu(cpu, cpumask) {
>   work = _cpu(cpu_stopper.stop_work, cpu);
>   work->fn = fn;
> @@ -391,6 +392,7 @@ static bool queue_stop_cpus_work(const s
>   if (cpu_stop_queue_work(cpu, work))
>   queued = true;
>   }
> +     barrier();
>   stop_cpus_in_progress = false;
>   preempt_enable();
>  
> 
> 

This looks good.

Reviewed-by: Phil Auld 


--

[CHANGE] sched: use rq_lock/unlock in online_fair_sched_group

2019-07-26 Thread Phil Auld

Enabling WARN_DOUBLE_CLOCK in /sys/kernel/debug/sched_features causes
warning to fire in update_rq_clock. This seems to be caused by onlining
a new fair sched group not using the rq lock wrappers.

[  612.379993] rq->clock_update_flags & RQCF_UPDATED
[  612.380007]  WARNING: CPU: 6 PID: 21426 at kernel/sched/core.c:225 
update_rq_clock+0x90/0x130
[  612.393082] Modules linked in: binfmt_misc rpcsec_gss_krb5 auth_rpcgss nfsv4 
dns_resolver nfs lockd grace fscache sunrpc vfat fat sg xgene_hwmon 
gpio_xgene_sb gpio_dwapb gpio_generic xgene_edac mailbox_xgene_slimpro 
xgene_rng uio_pdrv_genirq uio sch_fq_codel xfs libcrc32c xgene_enet 
i2c_xgene_slimpro at803x realtek ahci_xgene libahci_platform mdio_xgene 
dm_mirror dm_region_hash dm_log dm_mod
[  612.427615] CPU: 6 PID: 21426 Comm: (dnf) Not tainted 
4.16.0-10.el8+5.aarch64 #1
[  612.434973] Hardware name: AppliedMicro X-Gene Mustang Board/X-Gene Mustang 
Board, BIOS 3.06.25 Oct 17 2016
[  612.444667] pstate: 6085 (nZCv daIf -PAN -UAO)
[  612.449434] pc : update_rq_clock+0x90/0x130
[  612.453595] lr : update_rq_clock+0x90/0x130
[  612.457754] sp : 0efefd60
[  612.461050] x29: 0efefd60 x28: 8003c23d5400
[  612.466335] x27: 8003ca119c00 x26: 090bc000
[  612.471620] x25: 090bc090 x24: 090b3c68
[  612.476905] x23: 8003cefe1500 x22: 08dbd280
[  612.482192] x21:  x20: 
[  612.487478] x19: 8003ffddd280 x18: 
[  612.492763] x17:  x16: 
[  612.498049] x15: 090b3c08 x14: 897a5a17
[  612.503334] x13: 097a5a25 x12: 090ff000
[  612.508620] x11: 090bbc90 x10: 08548a78
[  612.513906] x9 : ffd0 x8 : 50555f4643515220
[  612.519191] x7 : 26207367616c665f x6 : 01cd
[  612.524477] x5 : 00ff x4 : 
[  612.529761] x3 :  x2 : 
[  612.535047] x1 : 1f9a6a58385a9900 x0 : 
[  612.540333] Call trace:
[  612.542767]  update_rq_clock+0x90/0x130
[  612.546585]  online_fair_sched_group+0x70/0x140
[  612.551092]  sched_online_group+0xd0/0xf0
[  612.555082]  sched_autogroup_create_attach+0xd0/0x198
[  612.560108]  sys_setsid+0x140/0x160
[  612.563579]  el0_svc_naked+0x44/0x48

Signed-off-by: Phil Auld 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Vincent Guittot 

---
 kernel/sched/fair.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 036be95a87e9..5c1299a5675c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10242,17 +10242,17 @@ void online_fair_sched_group(struct task_group *tg)
 {
struct sched_entity *se;
struct rq *rq;
+   struct rq_flags rf;
int i;
 
for_each_possible_cpu(i) {
rq = cpu_rq(i);
se = tg->se[i];
-
-   raw_spin_lock_irq(>lock);
+   rq_lock(rq, );
update_rq_clock(rq);
attach_entity_cfs_rq(se);
sync_throttle(tg, i);
-   raw_spin_unlock_irq(>lock);
+   rq_unlock(rq, );
}
 }
 
-- 
2.18.0

Re: [RESEND PATCH v3] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-06-12 Thread Phil Auld

On Wed, Jun 12, 2019 at 11:50:48AM -0400 Joel Savitz wrote:
> In the case that a process is constrained by taskset(1) (i.e.
> sched_setaffinity(2)) to a subset of available cpus, and all of those are
> subsequently offlined, the scheduler will set tsk->cpus_allowed to
> the current value of task_cs(tsk)->effective_cpus.
> 
> This is done via a call to do_set_cpus_allowed() in the context of 
> cpuset_cpus_allowed_fallback() made by the scheduler when this case is
> detected. This is the only call made to cpuset_cpus_allowed_fallback()
> in the latest mainline kernel.
> 
> However, this is not sane behavior.
> 
> I will demonstrate this on a system running the latest upstream kernel
> with the following initial configuration:
> 
>   # grep -i cpu /proc/$$/status
>   Cpus_allowed:   ,fff
>   Cpus_allowed_list:  0-63
> 
> (Where cpus 32-63 are provided via smt.)
> 
> If we limit our current shell process to cpu2 only and then offline it
> and reonline it:
> 
>   # taskset -p 4 $$
>   pid 2272's current affinity mask: 
>   pid 2272's new affinity mask: 4
> 
>   # echo off > /sys/devices/system/cpu/cpu2/online
>   # dmesg | tail -3
>   [ 2195.866089] process 2272 (bash) no longer affine to cpu2
>   [ 2195.872700] IRQ 114: no longer affine to CPU2
>   [ 2195.879128] smpboot: CPU 2 is now offline
> 
>   # echo on > /sys/devices/system/cpu/cpu2/online
>   # dmesg | tail -1
>   [ 2617.043572] smpboot: Booting Node 0 Processor 2 APIC 0x4
> 
> 
> We see that our current process now has an affinity mask containing
> every cpu available on the system _except_ the one we originally
> constrained it to:
> 
>   # grep -i cpu /proc/$$/status
>   Cpus_allowed:   ,fffb
>   Cpus_allowed_list:  0-1,3-63 
> 
> This is not sane behavior, as the scheduler can now not only place the
> process on previously forbidden cpus, it can't even schedule it on
> the cpu it was originally constrained to!
> 
> Other cases result in even more exotic affinity masks. Take for instance
> a process with an affinity mask containing only cpus provided by smt at
> the moment that smt is toggled, in a configuration such as the following:
> 
>   # taskset -p f0 $$
>   # grep -i cpu /proc/$$/status
>   Cpus_allowed:   00f0,
>   Cpus_allowed_list:  36-39
> 
> A double toggle of smt results in the following behavior:
> 
>   # echo off > /sys/devices/system/cpu/smt/control
>   # echo on > /sys/devices/system/cpu/smt/control
>   # grep -i cpus /proc/$$/status
>   Cpus_allowed:   ff00,
>   Cpus_allowed_list:  0-31,40-63
> 
> This is even less sane than the previous case, as the new affinity mask
> excludes all smt-provided cpus with ids less than those that were
> previously in the affinity mask, as well as those that were actually in
> the mask.
> 
> With this patch applied, both of these cases end in the following state:
> 
>   # grep -i cpu /proc/$$/status
>   Cpus_allowed:   ,
>   Cpus_allowed_list:  0-63
> 
> The original policy is discarded. Though not ideal, it is the simplest way
> to restore sanity to this fallback case without reinventing the cpuset
> wheel that rolls down the kernel just fine in cgroup v2. A user who wishes
> for the previous affinity mask to be restored in this fallback case can use
> that mechanism instead.
> 
> This patch modifies scheduler behavior by instead resetting the mask to
> task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
> mode. I tested the cases above on both modes.
> 
> Note that the scheduler uses this fallback mechanism if and only if
> _every_ other valid avenue has been traveled, and it is the last resort
> before calling BUG().
> 
> Suggested-by: Waiman Long 
> Suggested-by: Phil Auld 
> Signed-off-by: Joel Savitz 
> ---
>  kernel/cgroup/cpuset.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 6a1942ed781c..515525ff1cfd 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3254,10 +3254,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
> struct cpumask *pmask)
>   spin_unlock_irqrestore(_lock, flags);
>  }
>  
> +/**
> + * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe.
> + * @tsk: pointer to task_struct with which the scheduler is struggling
> + *
> + * Description: In the case that the scheduler cannot find an allowed cpu in
> + * tsk->cpus_

Re: [PATCH v2] sched/fair: don't push cfs_bandwith slack timers forward

2019-06-11 Thread Phil Auld

On Tue, Jun 11, 2019 at 04:24:43PM +0200 Peter Zijlstra wrote:
> On Tue, Jun 11, 2019 at 10:12:19AM -0400, Phil Auld wrote:
> 
> > That looks reasonable to me. 
> > 
> > Out of curiosity, why not bool? Is sizeof bool architecture dependent?
> 
> Yeah, sizeof(_Bool) is unspecified and depends on ABI. It is mostly 1,
> but there are known cases where it is 4.

Makes sense. Thanks!

--

Re: [PATCH v2] sched/fair: don't push cfs_bandwith slack timers forward

2019-06-11 Thread Phil Auld

On Tue, Jun 11, 2019 at 03:53:25PM +0200 Peter Zijlstra wrote:
> On Thu, Jun 06, 2019 at 10:21:01AM -0700, bseg...@google.com wrote:
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index efa686eeff26..60219acda94b 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -356,6 +356,7 @@ struct cfs_bandwidth {
> > u64 throttled_time;
> >  
> > booldistribute_running;
> > +   boolslack_started;
> >  #endif
> >  };
> 
> I'm thinking we can this instead? afaict both idle and period_active are
> already effecitively booleans and don't need the full 16 bits.
> 
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -338,8 +338,10 @@ struct cfs_bandwidth {
>   u64 runtime_expires;
>   int expires_seq;
>  
> - short   idle;
> - short   period_active;
> + u8  idle;
> + u8  period_active;
> + u8  distribute_running;
> + u8  slack_started;
>   struct hrtimer  period_timer;
>   struct hrtimer  slack_timer;
>   struct list_headthrottled_cfs_rq;
> @@ -348,9 +350,6 @@ struct cfs_bandwidth {
>   int nr_periods;
>   int nr_throttled;
>   u64 throttled_time;
> -
> - booldistribute_running;
> - boolslack_started;
>  #endif
>  };
>  


That looks reasonable to me. 

Out of curiosity, why not bool? Is sizeof bool architecture dependent?

--

Re: [PATCH v2] sched/fair: don't push cfs_bandwith slack timers forward

2019-06-11 Thread Phil Auld

On Thu, Jun 06, 2019 at 10:21:01AM -0700 bseg...@google.com wrote:
> When a cfs_rq sleeps and returns its quota, we delay for 5ms before
> waking any throttled cfs_rqs to coalesce with other cfs_rqs going to
> sleep, as this has to be done outside of the rq lock we hold.
> 
> The current code waits for 5ms without any sleeps, instead of waiting
> for 5ms from the first sleep, which can delay the unthrottle more than
> we want. Switch this around so that we can't push this forward forever.
> 
> This requires an extra flag rather than using hrtimer_active, since we
> need to start a new timer if the current one is in the process of
> finishing.
> 
> Signed-off-by: Ben Segall 
> Reviewed-by: Xunlei Pang 
> ---
>  kernel/sched/fair.c  | 7 +++
>  kernel/sched/sched.h | 1 +
>  2 files changed, 8 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8213ff6e365d..2ead252cfa32 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4729,6 +4729,11 @@ static void start_cfs_slack_bandwidth(struct 
> cfs_bandwidth *cfs_b)
>   if (runtime_refresh_within(cfs_b, min_left))
>   return;
>  
> + /* don't push forwards an existing deferred unthrottle */
> + if (cfs_b->slack_started)
> + return;
> + cfs_b->slack_started = true;
> +
>   hrtimer_start(_b->slack_timer,
>   ns_to_ktime(cfs_bandwidth_slack_period),
>   HRTIMER_MODE_REL);
> @@ -4782,6 +4787,7 @@ static void do_sched_cfs_slack_timer(struct 
> cfs_bandwidth *cfs_b)
>  
>   /* confirm we're still not at a refresh boundary */
>   raw_spin_lock_irqsave(_b->lock, flags);
> + cfs_b->slack_started = false;
>   if (cfs_b->distribute_running) {
>   raw_spin_unlock_irqrestore(_b->lock, flags);
>   return;
> @@ -4920,6 +4926,7 @@ void init_cfs_bandwidth(struct cfs_bandwidth *cfs_b)
>   hrtimer_init(_b->slack_timer, CLOCK_MONOTONIC, HRTIMER_MODE_REL);
>   cfs_b->slack_timer.function = sched_cfs_slack_timer;
>   cfs_b->distribute_running = 0;
> + cfs_b->slack_started = false;
>  }
>  
>  static void init_cfs_rq_runtime(struct cfs_rq *cfs_rq)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index efa686eeff26..60219acda94b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -356,6 +356,7 @@ struct cfs_bandwidth {
>   u64 throttled_time;
>  
>   booldistribute_running;
> + boolslack_started;
>  #endif
>  };
>  
> -- 
> 2.22.0.rc1.257.g3120a18244-goog
> 

I think this looks good. I like not delaying that further even if it
does not fix Dave's use case. 

It does make it glaring that I should have used false/true for setting
distribute_running though :)


Acked-by: Phil Auld 

--

Re: [PATCH v2 1/1] sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices

2019-05-24 Thread Phil Auld

On Fri, May 24, 2019 at 10:14:36AM -0500 Dave Chiluk wrote:
> On Fri, May 24, 2019 at 9:32 AM Phil Auld  wrote:
> > On Thu, May 23, 2019 at 02:01:58PM -0700 Peter Oskolkov wrote:
> 
> > > If the machine runs at/close to capacity, won't the overallocation
> > > of the quota to bursty tasks necessarily negatively impact every other
> > > task? Should the "unused" quota be available only on idle CPUs?
> > > (Or maybe this is the behavior achieved here, and only the comment and
> > > the commit message should be fixed...)
> > >
> >
> > It's bounded by the amount left unused from the previous period. So
> > theoretically a process could use almost twice its quota. But then it
> > would have nothing left over in the next period. To repeat it would have
> > to not use any that next period. Over a longer number of periods it's the
> > same amount of CPU usage.
> >
> > I think that is more fair than throttling a process that has never used
> > its full quota.
> >
> > And it removes complexity.
> >
> > Cheers,
> > Phil
> 
> Actually it's not even that bad.  The overallocation of quota to a
> bursty task in a period is limited to at most one slice per cpu, and
> that slice must not have been used in the previous periods.  The slice
> size is set with /proc/sys/kernel/sched_cfs_bandwidth_slice_us and
> defaults to 5ms.  If a bursty task goes from underutilizing quota to
> using it's entire quota, it will not be able to burst in the
> subsequent periods.  Therefore in an absolute worst case contrived
> scenario, a bursty task can add at most 5ms to the latency of other
> threads on the same CPU.  I think this worst case 5ms tradeoff is
> entirely worth it.
> 
> This does mean that a theoretically a poorly written massively
> threaded application on an 80 core box, that spreads itself onto 80
> cpu run queues, can overutilize it's quota in a period by at most 5ms
> * 80 CPUs in a sincle period (slice * number of runqueues the
> application is running on).  But that means that each of those threads
>  would have had to not be use their quota in a previous period, and it
> also means that the application would have to be carefully written to
> exacerbate this behavior.
> 
> Additionally if cpu bound threads underutilize a slice of their quota
> in a period due to the cfs choosing a bursty task to run, they should
> theoretically be able to make it up in the following periods when the
> bursty task is unable to "burst".
> 
> Please be careful here quota and slice are being treated differently.
> Quota does not roll-over between periods, only slices of quota that
> has already been allocated to per cpu run queues. If you allocate
> 100ms of quota per period to an application, but it only spreads onto
> 3 cpu run queues that means it can in the worst case use 3 x slice
> size = 15ms in periods following underutilization.
> 
> So why does this matter.  Well applications that use thread pools
> *(*cough* java *cough*) with lots of tiny little worker threads, tend
> to spread themselves out onto a lot of run queues.  These worker
> threads grab quota slices in order to run, then rarely use all of
> their slice (1 or 2ms out of the 5ms).  This results in those worker
> threads starving the main application of quota, and then expiring the
> remainder of that quota slice on the per-cpu.  Going back to my
> earlier 100ms quota / 80 cpu example.  That means only
> 100ms/cfs_bandwidth_slice_us(5ms) = 20 slices are available in a
> period.  So only 20 out of these 80 cpus ever get a slice allocated to
> them.  By allowing these per-cpu run queues to use their remaining
> slice in following periods these worker threads do not need to be
> allocated additional slice, and thereby the main threads are actually
> able to use the allocated cpu quota.
> 
> This can be experienced by running fibtest available at
> https://github.com/indeedeng/fibtest/.
> $ runfibtest 1
> runs a single fast thread taskset to cpu 0
> $ runfibtest 8
> Runs a single fast thread taskset to cpu 0, and 7 slow threads taskset
> to cpus 1-7.  This run is expected to show less iterations, but the
> worse problem is that the cpu usage is far less than the 500ms that it
> should have received.
> 
> Thanks for the engagement on this,
> Dave Chiluk

Thanks for the clarification. This is an even better explanation. 

Fwiw, I ran some of my cfs throttling tests with this, none of which are
designed to measure or hit this particular issue. They are more focused
on starvation and hard lockups that I've hit. But I did not see any issues
or oddities with this patch. 

Cheers,
Phil

--

Re: [RFC PATCH v2 13/17] sched: Add core wide task selection and scheduling.

2019-05-20 Thread Phil Auld

On Sat, May 18, 2019 at 11:37:56PM +0800 Aubrey Li wrote:
> On Wed, Apr 24, 2019 at 12:18 AM Vineeth Remanan Pillai
>  wrote:
> >
> > From: Peter Zijlstra (Intel) 
> >
> > Instead of only selecting a local task, select a task for all SMT
> > siblings for every reschedule on the core (irrespective which logical
> > CPU does the reschedule).
> >
> > NOTE: there is still potential for siblings rivalry.
> > NOTE: this is far too complicated; but thus far I've failed to
> >   simplify it further.
> >
> > Signed-off-by: Peter Zijlstra (Intel) 
> > ---
> >  kernel/sched/core.c  | 222 ++-
> >  kernel/sched/sched.h |   5 +-
> >  2 files changed, 224 insertions(+), 3 deletions(-)
> >
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index e5bdc1c4d8d7..9e6e90c6f9b9 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -3574,7 +3574,7 @@ static inline void schedule_debug(struct task_struct 
> > *prev)
> >   * Pick up the highest-prio task:
> >   */
> >  static inline struct task_struct *
> > -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags 
> > *rf)
> > +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags 
> > *rf)
> >  {
> > const struct sched_class *class;
> > struct task_struct *p;
> > @@ -3619,6 +3619,220 @@ pick_next_task(struct rq *rq, struct task_struct 
> > *prev, struct rq_flags *rf)
> > BUG();
> >  }
> >
> > +#ifdef CONFIG_SCHED_CORE
> > +
> > +static inline bool cookie_match(struct task_struct *a, struct task_struct 
> > *b)
> > +{
> > +   if (is_idle_task(a) || is_idle_task(b))
> > +   return true;
> > +
> > +   return a->core_cookie == b->core_cookie;
> > +}
> > +
> > +// XXX fairness/fwd progress conditions
> > +static struct task_struct *
> > +pick_task(struct rq *rq, const struct sched_class *class, struct 
> > task_struct *max)
> > +{
> > +   struct task_struct *class_pick, *cookie_pick;
> > +   unsigned long cookie = 0UL;
> > +
> > +   /*
> > +* We must not rely on rq->core->core_cookie here, because we fail 
> > to reset
> > +* rq->core->core_cookie on new picks, such that we can detect if 
> > we need
> > +* to do single vs multi rq task selection.
> > +*/
> > +
> > +   if (max && max->core_cookie) {
> > +   WARN_ON_ONCE(rq->core->core_cookie != max->core_cookie);
> > +   cookie = max->core_cookie;
> > +   }
> > +
> > +   class_pick = class->pick_task(rq);
> > +   if (!cookie)
> > +   return class_pick;
> > +
> > +   cookie_pick = sched_core_find(rq, cookie);
> > +   if (!class_pick)
> > +   return cookie_pick;
> > +
> > +   /*
> > +* If class > max && class > cookie, it is the highest priority 
> > task on
> > +* the core (so far) and it must be selected, otherwise we must go 
> > with
> > +* the cookie pick in order to satisfy the constraint.
> > +*/
> > +   if (cpu_prio_less(cookie_pick, class_pick) && core_prio_less(max, 
> > class_pick))
> > +   return class_pick;
> > +
> > +   return cookie_pick;
> > +}
> > +
> > +static struct task_struct *
> > +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags 
> > *rf)
> > +{
> > +   struct task_struct *next, *max = NULL;
> > +   const struct sched_class *class;
> > +   const struct cpumask *smt_mask;
> > +   int i, j, cpu;
> > +
> > +   if (!sched_core_enabled(rq))
> > +   return __pick_next_task(rq, prev, rf);
> > +
> > +   /*
> > +* If there were no {en,de}queues since we picked (IOW, the task
> > +* pointers are all still valid), and we haven't scheduled the last
> > +* pick yet, do so now.
> > +*/
> > +   if (rq->core->core_pick_seq == rq->core->core_task_seq &&
> > +   rq->core->core_pick_seq != rq->core_sched_seq) {
> > +   WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq);
> > +
> > +   next = rq->core_pick;
> > +   if (next != prev) {
> > +   put_prev_task(rq, prev);
> > +   set_next_task(rq, next);
> > +   }
> > +   return next;
> > +   }
> > +
> 
> The following patch improved my test cases.
> Welcome any comments.
> 

This is certainly better than violating the point of the core scheduler :)

If I'm understanding this right what will happen in this case is instead
of using the idle process selected by the sibling we do the core scheduling
again. This may start with a newidle_balance which might bring over something
to run that matches what we want to put on the sibling. If that works then I 
can see this helping. 

But I'd be a little concerned that we could end up thrashing. Once we do core 
scheduling again here we'd force the sibling to resched and if we got a 
different 
result which "helped" him

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-29 Thread Phil Auld

On Mon, Apr 29, 2019 at 09:25:35PM +0800 Li, Aubrey wrote:
> On 2019/4/29 14:14, Ingo Molnar wrote:
> > 
> > * Li, Aubrey  wrote:
> > 
> >>> I suspect it's pretty low, below 1% for all rows?
> >>
> >> Hope my this mail box works for this...
> >>
> >> .-.
> >> |NA/AVX vanilla-SMT [std% / sem%] | coresched-SMT   [std% / sem%] 
> >> +/- |  no-SMT [std% / sem%]+/-  |
> >> |-|
> >> |  1/1508.5 [ 0.2%/ 0.0%] | 504.7   [ 1.1%/ 0.1%]
> >> -0.8%|   509.0 [ 0.2%/ 0.0%]0.1% |
> >> |  2/2   1000.2 [ 1.4%/ 0.1%] |1004.1   [ 1.6%/ 0.2%] 
> >> 0.4%|   997.6 [ 1.2%/ 0.1%]   -0.3% |
> >> |  4/4   1912.1 [ 1.0%/ 0.1%] |1904.2   [ 1.1%/ 0.1%]
> >> -0.4%|  1914.9 [ 1.3%/ 0.1%]0.1% |
> >> |  8/8   3753.5 [ 0.3%/ 0.0%] |3748.2   [ 0.3%/ 0.0%]
> >> -0.1%|  3751.3 [ 0.4%/ 0.0%]   -0.1% |
> >> | 16/16  7139.3 [ 2.4%/ 0.2%] |7137.9   [ 1.8%/ 0.2%]
> >> -0.0%|  7049.2 [ 2.4%/ 0.2%]   -1.3% |
> >> | 32/32 10899.0 [ 4.2%/ 0.4%] |   10780.3   [ 4.4%/ 0.4%]
> >> -1.1%| 10339.2 [ 9.6%/ 0.9%]   -5.1% |
> >> | 64/64 15086.1 [11.5%/ 1.2%] |   14262.0   [ 8.2%/ 0.8%]
> >> -5.5%| 11168.7 [22.2%/ 1.7%]  -26.0% |
> >> |128/12815371.9 [22.0%/ 2.2%] |   14675.8   [14.4%/ 1.4%]
> >> -4.5%| 10963.9 [18.5%/ 1.4%]  -28.7% |
> >> |256/25615990.8 [22.0%/ 2.2%] |   12227.9   [10.3%/ 1.0%]   
> >> -23.5%| 10469.9 [19.6%/ 1.7%]  -34.5% |
> >> '-'
> > 
> > Perfectly presented, thank you very much!
> 
> My pleasure! ;-)
> 
> > 
> > My final questin would be about the environment:
> > 
> >> Skylake server, 2 numa nodes, 104 CPUs (HT on)
> > 
> > Is the typical nr_running value the sum of 'NA+AVX', i.e. is it ~256 
> > threads for the 128/128 row for example - or is it 128 parallel tasks?
> 
> That means 128 sysbench threads and 128 gemmbench tasks, so 256 threads in 
> sum.
> > 
> > I.e. showing the approximate CPU thread-load figure column would be very 
> > useful too, where '50%' shows half-loaded, '100%' fully-loaded, '200%' 
> > over-saturated, etc. - for each row?
> 
> See below, hope this helps.
> .--.
> |NA/AVX vanilla-SMT [std% / sem%] cpu% |coresched-SMT   [std% / sem%] 
> +/- cpu% |  no-SMT [std% / sem%]   +/-  cpu% |
> |--|
> |  1/1508.5 [ 0.2%/ 0.0%] 2.1% |504.7   [ 1.1%/ 0.1%] 
>-0.8%2.1% |   509.0 [ 0.2%/ 0.0%]   0.1% 4.3% |
> |  2/2   1000.2 [ 1.4%/ 0.1%] 4.1% |   1004.1   [ 1.6%/ 0.2%] 
> 0.4%4.1% |   997.6 [ 1.2%/ 0.1%]  -0.3% 8.1% |
> |  4/4   1912.1 [ 1.0%/ 0.1%] 7.9% |   1904.2   [ 1.1%/ 0.1%] 
>-0.4%7.9% |  1914.9 [ 1.3%/ 0.1%]   0.1%15.1% |
> |  8/8   3753.5 [ 0.3%/ 0.0%]14.9% |   3748.2   [ 0.3%/ 0.0%] 
>-0.1%   14.9% |  3751.3 [ 0.4%/ 0.0%]  -0.1%30.5% |
> | 16/16  7139.3 [ 2.4%/ 0.2%]30.3% |   7137.9   [ 1.8%/ 0.2%] 
>-0.0%   30.3% |  7049.2 [ 2.4%/ 0.2%]  -1.3%60.4% |
> | 32/32 10899.0 [ 4.2%/ 0.4%]60.3% |  10780.3   [ 4.4%/ 0.4%] 
>-1.1%   55.9% | 10339.2 [ 9.6%/ 0.9%]  -5.1%97.7% |
> | 64/64 15086.1 [11.5%/ 1.2%]97.7% |  14262.0   [ 8.2%/ 0.8%] 
>-5.5%   82.0% | 11168.7 [22.2%/ 1.7%] -26.0%   100.0% |
> |128/12815371.9 [22.0%/ 2.2%]   100.0% |  14675.8   [14.4%/ 1.4%] 
>-4.5%   82.8% | 10963.9 [18.5%/ 1.4%] -28.7%   100.0% |
> |256/25615990.8 [22.0%/ 2.2%]   100.0% |  12227.9   [10.3%/ 1.0%] 
>   -23.5%   73.2% | 10469.9 [19.6%/ 1.7%] -34.5%   100.0% |
> '--'
> 

That's really nice and clear.

We start to see the penalty for the coresched at 32/32, leaving some cpus more 
idle than otherwise.  
But it's pretty good overall, for this benchmark at least.

Is this with stock v2 or with any of the fixes posted after? I wonder how much 
the fixes for
the race that violates the rule effects this, for example. 



Cheers,
Phil


> Thanks,
> -Aubrey

--

Re: [RFC PATCH v2 12/17] sched: A quick and dirty cgroup tagging interface

2019-04-26 Thread Phil Auld

On Fri, Apr 26, 2019 at 04:13:07PM +0200 Peter Zijlstra wrote:
> On Thu, Apr 25, 2019 at 10:26:53AM -0400, Phil Auld wrote:
> > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > index e8e5f26db052..b312ea1e28a4 100644
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -7541,6 +7541,9 @@ static int cpu_core_tag_write_u64(struct 
> > cgroup_subsys_state *css, struct cftype
> > if (val > 1)
> > return -ERANGE;
> >  
> > +   if (num_online_cpus() <= 1) 
> > +   return -EINVAL;
> 
> We actually know if there SMT on the system or not, which is much better
> indication still:
> 
>   if (!static_branch_likely(_smt_present))
>   return -EINVAL;
> 
> > if (tg->tagged == !!val)
> > return 0;
> >  
> > 
> > 
> > 
> > -- 

Yeah, I thought there was probably a better check.

Thanks!

--

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-26 Thread Phil Auld

On Thu, Apr 25, 2019 at 08:53:43PM +0200 Ingo Molnar wrote:
> Interesting. This strongly suggests sub-optimal SMT-scheduling in the 
> non-saturated HT case, i.e. a scheduler balancing bug.
> 
> As long as loads are clearly below the physical cores count (which they 
> are in the early phases of your table) the scheduler should spread tasks 
> without overlapping two tasks on the same core.
> 
> Clearly it doesn't.
> 

That's especially true if there are cgroups with different numbers of
tasks in them involved. 

Here's an example showing the average number of tasks on each of the 4 numa
nodes during a test run. 20 cpus per node. There are 78 threads total, 76
for lu and 2 stress cpu hogs. So fewer than the 80 CPUs on the box. The GROUP
test has the two stresses and lu in distinct cgroups. The NORMAL test has them
all in one. This is from 5.0-rc3+, but the version doesn't matter. It's
reproducible on any kernel. SMT is on, but that also doesn't matter here.

The first two lines show where the stress jobs ran and the second show where
the 76 threads of lu ran.

GROUP_1.stress.ps.numa.hist  Average1.00   1.00
NORMAL_1.stress.ps.numa.hist Average0.00   1.10   0.90

lu.C.x_76_GROUP_1.ps.numa.hist   Average10.97  11.78  26.28  26.97
lu.C.x_76_NORMAL_1.ps.numa.hist  Average19.70  18.70  17.80  19.80

The NORMAL test is evenly balanced across the 20 cpus per numa node.  There
is between a 4x and 10x performance hit to the lu benchmark between group
and normal in any of these test runs. In this particular case it was 10x:

76_GROUPMop/s===
min q1  median  q3  max
3776.51 3776.51 3776.51 3776.51 3776.51
76_GROUPtime
min q1  median  q3  max
539.92  539.92  539.92  539.92  539.92
76_NORMALMop/s===
min q1  median  q3  max
39386   39386   39386   39386   39386
76_NORMALtime
min q1  median  q3  max
51.77   51.77   51.77   51.77   51.77

This a bit off topic, but since balancing bugs was mentioned and I've been
trying to track this down for a while (and learning the scheduler code in
the process) I figured I'd just throw it out there :)

Cheers,
Phil

--

Re: [RFC PATCH v2 11/17] sched: Basic tracking of matching tasks

2019-04-25 Thread Phil Auld

On Wed, Apr 24, 2019 at 08:43:36PM + Vineeth Remanan Pillai wrote:
> > A minor nitpick.  I find keeping the vruntime base readjustment in
> > core_prio_less probably is more straight forward rather than pass a
> > core_cmp bool around.
> 
> The reason I moved the vruntime base adjustment to __prio_less is
> because, the vruntime seemed alien to __prio_less when looked as
> a standalone function.
> 
> I do not have a strong opinion on both. Probably a better approach
> would be to replace both cpu_prio_less/core_prio_less with prio_less
> which takes the third arguement 'bool on_same_rq'?
> 

Fwiw, I find the two names easier to read than a boolean flag. Could still
be wrapped to a single implementation I suppose. 

An enum to control cpu or core would be more readable, but probably overkill... 


Cheers,
Phil


> Thanks

--

Re: [RFC PATCH v2 12/17] sched: A quick and dirty cgroup tagging interface

2019-04-25 Thread Phil Auld

On Tue, Apr 23, 2019 at 04:18:17PM + Vineeth Remanan Pillai wrote:
> From: Peter Zijlstra (Intel) 
> 
> Marks all tasks in a cgroup as matching for core-scheduling.
> 
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/sched/core.c  | 62 
>  kernel/sched/sched.h |  4 +++
>  2 files changed, 66 insertions(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5066a1493acf..e5bdc1c4d8d7 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6658,6 +6658,15 @@ static void sched_change_group(struct task_struct 
> *tsk, int type)
>   tg = container_of(task_css_check(tsk, cpu_cgrp_id, true),
> struct task_group, css);
>   tg = autogroup_task_group(tsk, tg);
> +
> +#ifdef CONFIG_SCHED_CORE
> + if ((unsigned long)tsk->sched_task_group == tsk->core_cookie)
> + tsk->core_cookie = 0UL;
> +
> + if (tg->tagged /* && !tsk->core_cookie ? */)
> + tsk->core_cookie = (unsigned long)tg;
> +#endif
> +
>   tsk->sched_task_group = tg;
>  
>  #ifdef CONFIG_FAIR_GROUP_SCHED
> @@ -7117,6 +7126,43 @@ static u64 cpu_rt_period_read_uint(struct 
> cgroup_subsys_state *css,
>  }
>  #endif /* CONFIG_RT_GROUP_SCHED */
>  
> +#ifdef CONFIG_SCHED_CORE
> +static u64 cpu_core_tag_read_u64(struct cgroup_subsys_state *css, struct 
> cftype *cft)
> +{
> + struct task_group *tg = css_tg(css);
> +
> + return !!tg->tagged;
> +}
> +
> +static int cpu_core_tag_write_u64(struct cgroup_subsys_state *css, struct 
> cftype *cft, u64 val)
> +{
> + struct task_group *tg = css_tg(css);
> + struct css_task_iter it;
> + struct task_struct *p;
> +
> + if (val > 1)
> + return -ERANGE;
> +
> + if (tg->tagged == !!val)
> + return 0;
> +
> + tg->tagged = !!val;
> +
> + if (!!val)
> + sched_core_get();
> +
> + css_task_iter_start(css, 0, );
> + while ((p = css_task_iter_next()))
> + p->core_cookie = !!val ? (unsigned long)tg : 0UL;
> + css_task_iter_end();
> +
> + if (!val)
> + sched_core_put();
> +
> + return 0;
> +}
> +#endif
> +
>  static struct cftype cpu_legacy_files[] = {
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   {
> @@ -7152,6 +7198,14 @@ static struct cftype cpu_legacy_files[] = {
>   .read_u64 = cpu_rt_period_read_uint,
>   .write_u64 = cpu_rt_period_write_uint,
>   },
> +#endif
> +#ifdef CONFIG_SCHED_CORE
> + {
> + .name = "tag",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_core_tag_read_u64,
> + .write_u64 = cpu_core_tag_write_u64,
> + },
>  #endif
>   { } /* Terminate */
>  };
> @@ -7319,6 +7373,14 @@ static struct cftype cpu_files[] = {
>   .seq_show = cpu_max_show,
>   .write = cpu_max_write,
>   },
> +#endif
> +#ifdef CONFIG_SCHED_CORE
> + {
> + .name = "tag",
> + .flags = CFTYPE_NOT_ON_ROOT,
> + .read_u64 = cpu_core_tag_read_u64,
> + .write_u64 = cpu_core_tag_write_u64,
> + },
>  #endif
>   { } /* terminate */
>  };
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 42dd620797d7..16fb236eab7b 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -363,6 +363,10 @@ struct cfs_bandwidth {
>  struct task_group {
>   struct cgroup_subsys_state css;
>  
> +#ifdef CONFIG_SCHED_CORE
> + int tagged;
> +#endif
> +
>  #ifdef CONFIG_FAIR_GROUP_SCHED
>   /* schedulable entities of this group on each CPU */
>   struct sched_entity **se;
> -- 
> 2.17.1
> 

Since CPU0 never goes through the cpu add code it will never get initialized if
it's the only cpu and then enabling core scheduling and adding a task crashes.

Since there is no point in using core sched in this case maybe just disallow it 
with something the below?


Cheers,
Phil

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e8e5f26db052..b312ea1e28a4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7541,6 +7541,9 @@ static int cpu_core_tag_write_u64(struct 
cgroup_subsys_state *css, struct cftype
if (val > 1)
return -ERANGE;
 
+   if (num_online_cpus() <= 1) 
+   return -EINVAL;
+
if (tg->tagged == !!val)
return 0;
 



--

Re: [RFC PATCH v2 00/17] Core scheduling v2

2019-04-23 Thread Phil Auld

Hi,

On Tue, Apr 23, 2019 at 04:18:05PM + Vineeth Remanan Pillai wrote:
> Second iteration of the core-scheduling feature.

Thanks for spinning V2 of this.

> 
> This version fixes apparent bugs and performance issues in v1. This
> doesn't fully address the issue of core sharing between processes
> with different tags. Core sharing still happens 1% to 5% of the time
> based on the nature of workload and timing of the runnable processes.
> 
> Changes in v2
> -
> - rebased on mainline commit: 6d906f99817951e2257d577656899da02bb33105
> - Fixes for couple of NULL pointer dereference crashes
>   - Subhra Mazumdar
>   - Tim Chen
> - Improves priority comparison logic for process in different cpus
>   - Peter Zijlstra
>   - Aaron Lu
> - Fixes a hard lockup in rq locking
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fixes a performance issue seen on IO heavy workloads
>   - Vineeth Pillai
>   - Julien Desfossez
> - Fix for 32bit build
>   - Aubrey Li
> 
> Issues
> --
> - Processes with different tags can still share the core

I may have missed something... Could you explain this statement?

This, to me, is the whole point of the patch series. If it's not
doing this then ... what?



Thanks,
Phil



> - A crash when disabling cpus with core-scheduling on
>- https://paste.debian.net/plainh/fa6bcfa8
> 
> ---
> 
> Peter Zijlstra (16):
>   stop_machine: Fix stop_cpus_in_progress ordering
>   sched: Fix kerneldoc comment for ia64_set_curr_task
>   sched: Wrap rq::lock access
>   sched/{rt,deadline}: Fix set_next_task vs pick_next_task
>   sched: Add task_struct pointer to sched_class::set_curr_task
>   sched/fair: Export newidle_balance()
>   sched: Allow put_prev_task() to drop rq->lock
>   sched: Rework pick_next_task() slow-path
>   sched: Introduce sched_class::pick_task()
>   sched: Core-wide rq->lock
>   sched: Basic tracking of matching tasks
>   sched: A quick and dirty cgroup tagging interface
>   sched: Add core wide task selection and scheduling.
>   sched/fair: Add a few assertions
>   sched: Trivial forced-newidle balancer
>   sched: Debug bits...
> 
> Vineeth Remanan Pillai (1):
>   sched: Wake up sibling if it has something to run
> 
>  include/linux/sched.h|   9 +-
>  kernel/Kconfig.preempt   |   7 +-
>  kernel/sched/core.c  | 800 +--
>  kernel/sched/cpuacct.c   |  12 +-
>  kernel/sched/deadline.c  |  99 +++--
>  kernel/sched/debug.c |   4 +-
>  kernel/sched/fair.c  | 137 +--
>  kernel/sched/idle.c  |  42 +-
>  kernel/sched/pelt.h  |   2 +-
>  kernel/sched/rt.c|  96 +++--
>  kernel/sched/sched.h | 185 ++---
>  kernel/sched/stop_task.c |  35 +-
>  kernel/sched/topology.c  |   4 +-
>  kernel/stop_machine.c|   2 +
>  14 files changed, 1145 insertions(+), 289 deletions(-)
> 
> -- 
> 2.17.1
> 

--

Re: [tip:sched/urgent] sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

2019-04-16 Thread Phil Auld



Hi Sasha,

On Tue, Apr 16, 2019 at 08:32:09AM -0700 tip-bot for Phil Auld wrote:
> Commit-ID:  2e8e19226398db8265a8e675fcc0118b9e80c9e8
> Gitweb: 
> https://git.kernel.org/tip/2e8e19226398db8265a8e675fcc0118b9e80c9e8
> Author:     Phil Auld 
> AuthorDate: Tue, 19 Mar 2019 09:00:05 -0400
> Committer:  Ingo Molnar 
> CommitDate: Tue, 16 Apr 2019 16:50:05 +0200
> 
> sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup
> 
> With extremely short cfs_period_us setting on a parent task group with a large
> number of children the for loop in sched_cfs_period_timer() can run until the
> watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
> will ever return 0.  The large number of children can make
> do_sched_cfs_period_timer() take longer than the period.
> 
>  NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
>  RIP: 0010:tg_nop+0x0/0x10
>   
>   walk_tg_tree_from+0x29/0xb0
>   unthrottle_cfs_rq+0xe0/0x1a0
>   distribute_cfs_runtime+0xd3/0xf0
>   sched_cfs_period_timer+0xcb/0x160
>   ? sched_cfs_slack_timer+0xd0/0xd0
>   __hrtimer_run_queues+0xfb/0x270
>   hrtimer_interrupt+0x122/0x270
>   smp_apic_timer_interrupt+0x6a/0x140
>   apic_timer_interrupt+0xf/0x20
>   
> 
> To prevent this we add protection to the loop that detects when the loop has 
> run
> too many times and scales the period and quota up, proportionally, so that 
> the timer
> can complete before then next period expires.  This preserves the relative 
> runtime
> quota while preventing the hard lockup.
> 
> A warning is issued reporting this state and the new values.
> 
> Signed-off-by: Phil Auld 
> Signed-off-by: Peter Zijlstra (Intel) 
> Cc: 
> Cc: Anton Blanchard 
> Cc: Ben Segall 
> Cc: Linus Torvalds 
> Cc: Peter Zijlstra 
> Cc: Thomas Gleixner 
> Link: https://lkml.kernel.org/r/20190319130005.25492-1-pa...@redhat.com
> Signed-off-by: Ingo Molnar 
> ---

The above commit won't work on the stable trees. Below is an updated version
that will work on v5.0.7, v4.19.34, v4.14.111, v4.9.168, and v4.4.178 with 
increasing offsets. I believe v3.18.138 will require more so that one is not 
included.

There is only a minor change to context, none of actual changes in the patch are
different.


Thanks,
Phil
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 310d0637fe4b..f0380229b6f2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4859,12 +4859,15 @@ static enum hrtimer_restart 
sched_cfs_slack_timer(struct hrtimer *timer)
return HRTIMER_NORESTART;
 }
 
+extern const u64 max_cfs_quota_period;
+
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
struct cfs_bandwidth *cfs_b =
container_of(timer, struct cfs_bandwidth, period_timer);
int overrun;
int idle = 0;
+   int count = 0;
 
raw_spin_lock(_b->lock);
for (;;) {
@@ -4872,6 +4875,28 @@ static enum hrtimer_restart 
sched_cfs_period_timer(struct hrtimer *timer)
if (!overrun)
break;
 
+   if (++count > 3) {
+   u64 new, old = ktime_to_ns(cfs_b->period);
+
+   new = (old * 147) / 128; /* ~115% */
+   new = min(new, max_cfs_quota_period);
+
+   cfs_b->period = ns_to_ktime(new);
+
+   /* since max is 1s, this is limited to 1e9^2, which 
fits in u64 */
+   cfs_b->quota *= new;
+   cfs_b->quota = div64_u64(cfs_b->quota, old);
+
+   pr_warn_ratelimited(
+"cfs_period_timer[cpu%d]: period too short, scaling up (new 
cfs_period_us %lld, cfs_quota_us = %lld)\n",
+   smp_processor_id(),
+   div_u64(new, NSEC_PER_USEC),
+div_u64(cfs_b->quota, NSEC_PER_USEC));
+
+   /* reset count so we don't come right back in here */
+   count = 0;
+   }
+
idle = do_sched_cfs_period_timer(cfs_b, overrun);
}
if (idle)



--

Re: [tip:sched/core] sched/fair: Limit sched_cfs_period_timer loop to avoid hard lockup

2019-04-16 Thread Phil Auld

On Tue, Apr 09, 2019 at 03:05:27PM +0200 Peter Zijlstra wrote:
> On Tue, Apr 09, 2019 at 08:48:16AM -0400, Phil Auld wrote:
> > Hi Ingo, Peter,
> > 
> > On Wed, Apr 03, 2019 at 01:38:39AM -0700 tip-bot for Phil Auld wrote:
> > > Commit-ID:  06ec5d30e8d57b820d44df6340dcb25010d6d0fa
> > > Gitweb: 
> > > https://git.kernel.org/tip/06ec5d30e8d57b820d44df6340dcb25010d6d0fa
> > > Author: Phil Auld 
> > > AuthorDate: Tue, 19 Mar 2019 09:00:05 -0400
> > > Committer:  Ingo Molnar 
> > > CommitDate: Wed, 3 Apr 2019 09:50:23 +0200
> > 
> > This commit seems to have gotten lost. It's not in tip and now the 
> > direct gitweb link is also showing bad commit reference. 
> > 
> > Did this fall victim to a reset or something?
> 
> It had (trivial) builds fails on 32 bit. I have a fixed up version
> around somewhere, but that hasn't made it back in yet.


Thank you :)

I'll post the follow up version for the stable trees soon.


--

[tip:sched/urgent] sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

2019-04-16 Thread tip-bot for Phil Auld

Commit-ID:  2e8e19226398db8265a8e675fcc0118b9e80c9e8
Gitweb: https://git.kernel.org/tip/2e8e19226398db8265a8e675fcc0118b9e80c9e8
Author: Phil Auld 
AuthorDate: Tue, 19 Mar 2019 09:00:05 -0400
Committer:  Ingo Molnar 
CommitDate: Tue, 16 Apr 2019 16:50:05 +0200

sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer() can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0.  The large number of children can make
do_sched_cfs_period_timer() take longer than the period.

 NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
 RIP: 0010:tg_nop+0x0/0x10
  
  walk_tg_tree_from+0x29/0xb0
  unthrottle_cfs_rq+0xe0/0x1a0
  distribute_cfs_runtime+0xd3/0xf0
  sched_cfs_period_timer+0xcb/0x160
  ? sched_cfs_slack_timer+0xd0/0xd0
  __hrtimer_run_queues+0xfb/0x270
  hrtimer_interrupt+0x122/0x270
  smp_apic_timer_interrupt+0x6a/0x140
  apic_timer_interrupt+0xf/0x20
  

To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the 
timer
can complete before then next period expires.  This preserves the relative 
runtime
quota while preventing the hard lockup.

A warning is issued reporting this state and the new values.

Signed-off-by: Phil Auld 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: 
Cc: Anton Blanchard 
Cc: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pa...@redhat.com
Signed-off-by: Ingo Molnar 
---
 kernel/sched/fair.c | 25 +
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40bd1e27b1b7..a4d9e14bf138 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4885,6 +4885,8 @@ static enum hrtimer_restart sched_cfs_slack_timer(struct 
hrtimer *timer)
return HRTIMER_NORESTART;
 }
 
+extern const u64 max_cfs_quota_period;
+
 static enum hrtimer_restart sched_cfs_period_timer(struct hrtimer *timer)
 {
struct cfs_bandwidth *cfs_b =
@@ -4892,6 +4894,7 @@ static enum hrtimer_restart sched_cfs_period_timer(struct 
hrtimer *timer)
unsigned long flags;
int overrun;
int idle = 0;
+   int count = 0;
 
raw_spin_lock_irqsave(_b->lock, flags);
for (;;) {
@@ -4899,6 +4902,28 @@ static enum hrtimer_restart 
sched_cfs_period_timer(struct hrtimer *timer)
if (!overrun)
break;
 
+   if (++count > 3) {
+   u64 new, old = ktime_to_ns(cfs_b->period);
+
+   new = (old * 147) / 128; /* ~115% */
+   new = min(new, max_cfs_quota_period);
+
+   cfs_b->period = ns_to_ktime(new);
+
+   /* since max is 1s, this is limited to 1e9^2, which 
fits in u64 */
+   cfs_b->quota *= new;
+   cfs_b->quota = div64_u64(cfs_b->quota, old);
+
+   pr_warn_ratelimited(
+   "cfs_period_timer[cpu%d]: period too short, scaling up (new 
cfs_period_us %lld, cfs_quota_us = %lld)\n",
+   smp_processor_id(),
+   div_u64(new, NSEC_PER_USEC),
+   div_u64(cfs_b->quota, NSEC_PER_USEC));
+
+   /* reset count so we don't come right back in here */
+   count = 0;
+   }
+
idle = do_sched_cfs_period_timer(cfs_b, overrun, flags);
}
if (idle)

Re: [tip:sched/core] sched/fair: Limit sched_cfs_period_timer loop to avoid hard lockup

2019-04-16 Thread Phil Auld

On Tue, Apr 09, 2019 at 03:05:27PM +0200 Peter Zijlstra wrote:
> On Tue, Apr 09, 2019 at 08:48:16AM -0400, Phil Auld wrote:
> > Hi Ingo, Peter,
> > 
> > On Wed, Apr 03, 2019 at 01:38:39AM -0700 tip-bot for Phil Auld wrote:
> > > Commit-ID:  06ec5d30e8d57b820d44df6340dcb25010d6d0fa
> > > Gitweb: 
> > > https://git.kernel.org/tip/06ec5d30e8d57b820d44df6340dcb25010d6d0fa
> > > Author: Phil Auld 
> > > AuthorDate: Tue, 19 Mar 2019 09:00:05 -0400
> > > Committer:  Ingo Molnar 
> > > CommitDate: Wed, 3 Apr 2019 09:50:23 +0200
> > 
> > This commit seems to have gotten lost. It's not in tip and now the 
> > direct gitweb link is also showing bad commit reference. 
> > 
> > Did this fall victim to a reset or something?
> 
> It had (trivial) builds fails on 32 bit. I have a fixed up version
> around somewhere, but that hasn't made it back in yet.


Friendly ping...


Thanks,
Phil

--

Re: [PATCH v2] cpuset: restore sanity to cpuset_cpus_allowed_fallback()

2019-04-10 Thread Phil Auld

On Tue, Apr 09, 2019 at 04:40:03PM -0400 Joel Savitz wrote:
> If a process is limited by taskset (i.e. cpuset) to only be allowed to
> run on cpu N, and then cpu N is offlined via hotplug, the process will
> be assigned the current value of its cpuset cgroup's effective_cpus field
> in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
> This argument's value does not makes sense for this case, because
> task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
> to reflect the new value of cpu_active_mask after cpu N is removed from
> the mask. While this may make sense for the cgroup affinity mask, it
> does not make sense on a per-task basis, as a task that was previously
> limited to only be run on cpu N will be limited to every cpu _except_ for
> cpu N after it is offlined/onlined via hotplug.
> 
> Pre-patch behavior:
> 
>   $ grep Cpus /proc/$$/status
>   Cpus_allowed:   ff
>   Cpus_allowed_list:  0-7
> 
>   $ taskset -p 4 $$
>   pid 19202's current affinity mask: f
>   pid 19202's new affinity mask: 4
> 
>   $ grep Cpus /proc/self/status
>   Cpus_allowed:   04
>   Cpus_allowed_list:  2
> 
>   # echo off > /sys/devices/system/cpu/cpu2/online
>   $ grep Cpus /proc/$$/status
>   Cpus_allowed:   0b
>   Cpus_allowed_list:  0-1,3
> 
>   # echo on > /sys/devices/system/cpu/cpu2/online
>   $ grep Cpus /proc/$$/status
>   Cpus_allowed:   0b
>   Cpus_allowed_list:  0-1,3
> 
> On a patched system, the final grep produces the following
> output instead:
> 
>   $ grep Cpus /proc/$$/status
>   Cpus_allowed:   ff
>   Cpus_allowed_list:  0-7
> 
> This patch changes the above behavior by instead resetting the mask to
> task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
> mode.
> 
> This fallback mechanism is only triggered if _every_ other valid avenue
> has been traveled, and it is the last resort before calling BUG().
> 
> Signed-off-by: Joel Savitz 
> ---
>  kernel/cgroup/cpuset.c | 15 ++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 4834c4214e9c..6c9deb2cc687 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3255,10 +3255,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, 
> struct cpumask *pmask)
>   spin_unlock_irqrestore(_lock, flags);
>  }
>  
> +/**
> + * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe.
> + * @tsk: pointer to task_struct with which the scheduler is struggling
> + *
> + * Description: In the case that the scheduler cannot find an allowed cpu in
> + * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy
> + * mode however, this value is the same as task_cs(tsk)->effective_cpus,
> + * which will not contain a sane cpumask during cases such as cpu 
> hotplugging.
> + * This is the absolute last resort for the scheduler and it is only used if
> + * _every_ other avenue has been traveled.
> + **/
> +
>  void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
>  {
>   rcu_read_lock();
> - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> + do_set_cpus_allowed(tsk, is_in_v2_mode() ?
> + task_cs(tsk)->cpus_allowed : cpu_possible_mask);
>   rcu_read_unlock();
>  
>   /*
> -- 
> 2.18.1
> 

Fwiw,

Acked-by: Phil Auld 


--

1 2 >

1 - 100 of 142 matches

Mail list logo