subject:"\[RFC\] sched\: Limit idle_balance\(\) when it is being used too frequently"

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-19 Thread Jason Low

On Fri, 2013-07-19 at 20:37 +0200, Peter Zijlstra wrote:
> On Thu, Jul 18, 2013 at 12:06:39PM -0700, Jason Low wrote:
> 
> > N = 1
> > -
> > 19.21%  reaim  [k] __read_lock_failed 
> > 14.79%  reaim  [k] mspin_lock 
> > 12.19%  reaim  [k] __write_lock_failed
> > 7.87%   reaim  [k] _raw_spin_lock  
> > 2.03%   reaim  [k] start_this_handle   
> > 1.98%   reaim  [k] update_sd_lb_stats  
> > 1.92%   reaim  [k] mutex_spin_on_owner 
> > 1.86%   reaim  [k] update_cfs_rq_blocked_load  
> > 1.14%   swapper  [k] intel_idle  
> > 1.10%   reaim  [.] add_long
> > 1.09%   reaim  [.] add_int 
> > 1.08%   reaim  [k] load_balance
> 
> But but but but.. wth is causing this? The only thing we do more of with
> N=1 is idle_balance(); where would that cause __{read,write}_lock_failed
> and or mspin_lock() contention like that.
> 
> There shouldn't be a rwlock_t in the entire scheduler; those things suck
> worse than quicksand.
> 
> If, as Rik thought, we'd have more rq->lock contention, then I'd
> expected _raw_spin_lock to be up highest.

For this particular fserver workload, that mutex was acquired in the
function calls from ext4_orphan_add() and ext4_orphan_del(). Those read
and write lock calls were from start_this_handle(). 

Although these functions are not called within the idle_balance() code
path, update_sd_lb_stats(), tg_load_down(), idle_cpu(), spin_lock(),
ect... increases the time spent in the kernel and that appears to be
indirectly causing more time to be spent acquiring those other kernel
locks.

Thanks,
Jason



--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-19 Thread Peter Zijlstra

On Thu, Jul 18, 2013 at 12:06:39PM -0700, Jason Low wrote:

> N = 1
> -
> 19.21%  reaim  [k] __read_lock_failed 
> 14.79%  reaim  [k] mspin_lock 
> 12.19%  reaim  [k] __write_lock_failed
> 7.87%   reaim  [k] _raw_spin_lock  
> 2.03%   reaim  [k] start_this_handle   
> 1.98%   reaim  [k] update_sd_lb_stats  
> 1.92%   reaim  [k] mutex_spin_on_owner 
> 1.86%   reaim  [k] update_cfs_rq_blocked_load  
> 1.14%   swapper  [k] intel_idle  
> 1.10%   reaim  [.] add_long
> 1.09%   reaim  [.] add_int 
> 1.08%   reaim  [k] load_balance

But but but but.. wth is causing this? The only thing we do more of with
N=1 is idle_balance(); where would that cause __{read,write}_lock_failed
and or mspin_lock() contention like that.

There shouldn't be a rwlock_t in the entire scheduler; those things suck
worse than quicksand.

If, as Rik thought, we'd have more rq->lock contention, then I'd
expected _raw_spin_lock to be up highest.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-19 Thread Peter Zijlstra

On Thu, Jul 18, 2013 at 12:06:39PM -0700, Jason Low wrote:

 N = 1
 -
 19.21%  reaim  [k] __read_lock_failed 
 14.79%  reaim  [k] mspin_lock 
 12.19%  reaim  [k] __write_lock_failed
 7.87%   reaim  [k] _raw_spin_lock  
 2.03%   reaim  [k] start_this_handle   
 1.98%   reaim  [k] update_sd_lb_stats  
 1.92%   reaim  [k] mutex_spin_on_owner 
 1.86%   reaim  [k] update_cfs_rq_blocked_load  
 1.14%   swapper  [k] intel_idle  
 1.10%   reaim  [.] add_long
 1.09%   reaim  [.] add_int 
 1.08%   reaim  [k] load_balance

But but but but.. wth is causing this? The only thing we do more of with
N=1 is idle_balance(); where would that cause __{read,write}_lock_failed
and or mspin_lock() contention like that.

There shouldn't be a rwlock_t in the entire scheduler; those things suck
worse than quicksand.

If, as Rik thought, we'd have more rq-lock contention, then I'd
expected _raw_spin_lock to be up highest.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-19 Thread Jason Low

On Fri, 2013-07-19 at 20:37 +0200, Peter Zijlstra wrote:
 On Thu, Jul 18, 2013 at 12:06:39PM -0700, Jason Low wrote:
 
  N = 1
  -
  19.21%  reaim  [k] __read_lock_failed 
  14.79%  reaim  [k] mspin_lock 
  12.19%  reaim  [k] __write_lock_failed
  7.87%   reaim  [k] _raw_spin_lock  
  2.03%   reaim  [k] start_this_handle   
  1.98%   reaim  [k] update_sd_lb_stats  
  1.92%   reaim  [k] mutex_spin_on_owner 
  1.86%   reaim  [k] update_cfs_rq_blocked_load  
  1.14%   swapper  [k] intel_idle  
  1.10%   reaim  [.] add_long
  1.09%   reaim  [.] add_int 
  1.08%   reaim  [k] load_balance
 
 But but but but.. wth is causing this? The only thing we do more of with
 N=1 is idle_balance(); where would that cause __{read,write}_lock_failed
 and or mspin_lock() contention like that.
 
 There shouldn't be a rwlock_t in the entire scheduler; those things suck
 worse than quicksand.
 
 If, as Rik thought, we'd have more rq-lock contention, then I'd
 expected _raw_spin_lock to be up highest.

For this particular fserver workload, that mutex was acquired in the
function calls from ext4_orphan_add() and ext4_orphan_del(). Those read
and write lock calls were from start_this_handle(). 

Although these functions are not called within the idle_balance() code
path, update_sd_lb_stats(), tg_load_down(), idle_cpu(), spin_lock(),
ect... increases the time spent in the kernel and that appears to be
indirectly causing more time to be spent acquiring those other kernel
locks.

Thanks,
Jason



--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Jason Low

On Thu, 2013-07-18 at 07:59 -0400, Rik van Riel wrote:
> On 07/18/2013 05:32 AM, Peter Zijlstra wrote:
> > On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:
> >
> >> I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> >> to set N to more than 20 in order to get the big performance gains.
> >>
> >> One thing that I thought of was to have N be based on how often idle
> >> balance attempts does not pull task(s).
> >>
> >> For example, N can be calculated based on the number of idle balance
> >> attempts for the CPU  since the last "successful" idle balance attempt.
> >> So if the previous 30 idle balance attempts resulted in no tasks moved,
> >> then n = 30 / 5. So idle balance gets less time to run as the number of
> >> unneeded idle balance attempts increases, and thus N will not be set too
> >> high during situations where idle balancing is "successful" more often.
> >> Any comments on this idea?
> >
> > It would be good to get a solid explanation for why we need such high N.
> > But yes that might work.
> 
> I have some idea, though no proof :)
> 
> I suspect a lot of the idle balancing time is spent waiting for
> and acquiring the runqueue locks of remote CPUs.
> 
> If we spend half our idle time causing contention to remote
> runqueue locks, we could be a big factor in keeping those other
> CPUs from getting work done.

I collected some perf samples when running fserver when N=1 and N=60.

N = 1
-
19.21%  reaim  [k] __read_lock_failed 
14.79%  reaim  [k] mspin_lock 
12.19%  reaim  [k] __write_lock_failed
7.87%   reaim  [k] _raw_spin_lock  
2.03%   reaim  [k] start_this_handle   
1.98%   reaim  [k] update_sd_lb_stats  
1.92%   reaim  [k] mutex_spin_on_owner 
1.86%   reaim  [k] update_cfs_rq_blocked_load  
1.14%   swapper  [k] intel_idle  
1.10%   reaim  [.] add_long
1.09%   reaim  [.] add_int 
1.08%   reaim  [k] load_balance

N = 60
--
7.70%  reaim  [k] _raw_spin_lock 
7.25%  reaim  [k] mspin_lock 
6.30%  reaim  [.] add_long   
6.26%  reaim  [.] add_int
4.05%  reaim  [.] strncat
3.81%  reaim  [.] string_rtns_1  
3.66%  reaim  [.] div_long   
3.44%  reaim  [k] mutex_spin_on_owner
2.91%  reaim  [.] add_short  
2.73%  swapper  [k] intel_idle 
2.65%  reaim  [k] __read_lock_failed

With idle_balance(), we get more contention in kernel functions such as
update_sd_lb_stats(), load_balance(), and spin_lock() for the rq lock.
Additionally, it increases the time spent in mutex's mspin_lock(),
__read_lock_failed() and __write_lock_failed() by a lot.

N needs to be large because avg_idle time is still a lot higher than the
avg time spent in each load_balance() call per sched domain. Despite the
high ratio of avg_idle time to time spent in load_balance(),
load_balance() still increases the time spent in the kernel by quite a
bit, probably because of how often it is being used.

Jason


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Jason Low

On Thu, 2013-07-18 at 17:42 +0530, Srikar Dronamraju wrote:
> > > 
> > > idle_balance(u64 idle_duration)
> > > {
> > >   u64 cost = 0;
> > > 
> > >   for_each_domain(sd) {
> > > if (cost + sd->cost > idle_duration/N)
> > >   break;
> > > 
> > > ...
> > > 
> > > sd->cost = (sd->cost + this_cost) / 2;
> > > cost += this_cost;
> > >   }
> > > }
> > > 
> > > I would've initially suggested using something like N=2 since we're 
> > > dealing
> > > with averages and half should ensure we don't run over except for the 
> > > worst
> > > peaks. But we could easily use a bigger N.
> > 
> > I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> > to set N to more than 20 in order to get the big performance gains.
> > 
> 
> As per your observation, newly idle balancing isn't picking tasks and
> mostly finding the domains to be balanced. find_busiest_queue() is
> under rcu. So where and how are we getting these performance gains?

I actually just ran fserver on 8 sockets (which idle balance lowers the
performance in this workload at this socket count), and for this
workload, idle balancing is finding tasks to move fairly often on a
per-cpu basis. So I guess it is not always the case that idle balancing
isn't moving tasks on this box.

Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Srikar Dronamraju

* Peter Zijlstra  [2013-07-18 14:35:31]:

> On Thu, Jul 18, 2013 at 05:45:46PM +0530, Srikar Dronamraju wrote:
> > We take locks if and only if we see imbalance and want to pull the
> > tasks.
> > However if the newly idle balance is not finding an imbalance then this
> > may not be an issue.
> > 
> > Probably /proc/schedstats will give a better picture.
> 
> Right, so we're interested in move_tasks() calls that fail to 'deliver'.
> There's a few conditions in there that can cause us to not move a task,
> most of them not counted.
> 
> The few that are; are from can_mirgrate_task():
> 
>   se.statistics.nr_failed_migrations_affine
>   se.statistics.nr_failed_migrations_running
>   se.statistics.nr_failed_migrations_hot
> 
> If we see significant increments on those we'll be taking locks. 
> 

Agree. 

Even I think number of times no busy group was found, number of times no
busy queue was found also will tell us that locks are not being taken.

In schedstats, I generally see them as overwhelming majority.


> The only one I can see a good way around is the hot one, we could ignore
> hotness in favour of newidle -- although I could see that being
> detrimental, we'll just have to try or so ;-)
> 
> _running shouldn't be much of a problem since we don't bother if
> nr_running <= 1. And _affine is out of our reach anyway.
> 

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Peter Zijlstra

On Thu, Jul 18, 2013 at 05:45:46PM +0530, Srikar Dronamraju wrote:
> We take locks if and only if we see imbalance and want to pull the
> tasks.
> However if the newly idle balance is not finding an imbalance then this
> may not be an issue.
> 
> Probably /proc/schedstats will give a better picture.

Right, so we're interested in move_tasks() calls that fail to 'deliver'.
There's a few conditions in there that can cause us to not move a task,
most of them not counted.

The few that are; are from can_mirgrate_task():

  se.statistics.nr_failed_migrations_affine
  se.statistics.nr_failed_migrations_running
  se.statistics.nr_failed_migrations_hot

If we see significant increments on those we'll be taking locks. 

The only one I can see a good way around is the hot one, we could ignore
hotness in favour of newidle -- although I could see that being
detrimental, we'll just have to try or so ;-)

_running shouldn't be much of a problem since we don't bother if
nr_running <= 1. And _affine is out of our reach anyway.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Srikar Dronamraju

* Rik van Riel  [2013-07-18 07:59:22]:

> On 07/18/2013 05:32 AM, Peter Zijlstra wrote:
> >On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:
> >
> >>I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> >>to set N to more than 20 in order to get the big performance gains.
> >>
> >>One thing that I thought of was to have N be based on how often idle
> >>balance attempts does not pull task(s).
> >>
> >>For example, N can be calculated based on the number of idle balance
> >>attempts for the CPU  since the last "successful" idle balance attempt.
> >>So if the previous 30 idle balance attempts resulted in no tasks moved,
> >>then n = 30 / 5. So idle balance gets less time to run as the number of
> >>unneeded idle balance attempts increases, and thus N will not be set too
> >>high during situations where idle balancing is "successful" more often.
> >>Any comments on this idea?
> >
> >It would be good to get a solid explanation for why we need such high N.
> >But yes that might work.
> 
> I have some idea, though no proof :)
> 
> I suspect a lot of the idle balancing time is spent waiting for
> and acquiring the runqueue locks of remote CPUs.
> 

We take locks if and only if we see imbalance and want to pull the
tasks.
However if the newly idle balance is not finding an imbalance then this
may not be an issue.

Probably /proc/schedstats will give a better picture.

> If we spend half our idle time causing contention to remote
> runqueue locks, we could be a big factor in keeping those other
> CPUs from getting work done.
> 
> -- 
> All rights reversed
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Srikar Dronamraju

> > 
> > idle_balance(u64 idle_duration)
> > {
> >   u64 cost = 0;
> > 
> >   for_each_domain(sd) {
> > if (cost + sd->cost > idle_duration/N)
> >   break;
> > 
> > ...
> > 
> > sd->cost = (sd->cost + this_cost) / 2;
> > cost += this_cost;
> >   }
> > }
> > 
> > I would've initially suggested using something like N=2 since we're dealing
> > with averages and half should ensure we don't run over except for the worst
> > peaks. But we could easily use a bigger N.
> 
> I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> to set N to more than 20 in order to get the big performance gains.
> 

As per your observation, newly idle balancing isn't picking tasks and
mostly finding the domains to be balanced. find_busiest_queue() is
under rcu. So where and how are we getting these performance gains?

Is it that tasks are getting woken up and queued while the cpu is doing
newly idle load balance? Or is it that the regular CPU_IDLE balancing
which follows idle_balance() does a more aggressive balancing and hence
is able to find a task to balance?


-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Rik van Riel


On 07/18/2013 05:32 AM, Peter Zijlstra wrote:

On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:


I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
to set N to more than 20 in order to get the big performance gains.

One thing that I thought of was to have N be based on how often idle
balance attempts does not pull task(s).

For example, N can be calculated based on the number of idle balance
attempts for the CPU  since the last "successful" idle balance attempt.
So if the previous 30 idle balance attempts resulted in no tasks moved,
then n = 30 / 5. So idle balance gets less time to run as the number of
unneeded idle balance attempts increases, and thus N will not be set too
high during situations where idle balancing is "successful" more often.
Any comments on this idea?


It would be good to get a solid explanation for why we need such high N.
But yes that might work.


I have some idea, though no proof :)

I suspect a lot of the idle balancing time is spent waiting for
and acquiring the runqueue locks of remote CPUs.

If we spend half our idle time causing contention to remote
runqueue locks, we could be a big factor in keeping those other
CPUs from getting work done.

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Peter Zijlstra

On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:

> I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
> to set N to more than 20 in order to get the big performance gains.
> 
> One thing that I thought of was to have N be based on how often idle
> balance attempts does not pull task(s).
> 
> For example, N can be calculated based on the number of idle balance
> attempts for the CPU  since the last "successful" idle balance attempt.
> So if the previous 30 idle balance attempts resulted in no tasks moved,
> then n = 30 / 5. So idle balance gets less time to run as the number of
> unneeded idle balance attempts increases, and thus N will not be set too
> high during situations where idle balancing is "successful" more often.
> Any comments on this idea?

It would be good to get a solid explanation for why we need such high N.
But yes that might work.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Peter Zijlstra

On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:

 I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
 to set N to more than 20 in order to get the big performance gains.
 
 One thing that I thought of was to have N be based on how often idle
 balance attempts does not pull task(s).
 
 For example, N can be calculated based on the number of idle balance
 attempts for the CPU  since the last successful idle balance attempt.
 So if the previous 30 idle balance attempts resulted in no tasks moved,
 then n = 30 / 5. So idle balance gets less time to run as the number of
 unneeded idle balance attempts increases, and thus N will not be set too
 high during situations where idle balancing is successful more often.
 Any comments on this idea?

It would be good to get a solid explanation for why we need such high N.
But yes that might work.


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Rik van Riel


On 07/18/2013 05:32 AM, Peter Zijlstra wrote:

On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:


I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
to set N to more than 20 in order to get the big performance gains.

One thing that I thought of was to have N be based on how often idle
balance attempts does not pull task(s).

For example, N can be calculated based on the number of idle balance
attempts for the CPU  since the last successful idle balance attempt.
So if the previous 30 idle balance attempts resulted in no tasks moved,
then n = 30 / 5. So idle balance gets less time to run as the number of
unneeded idle balance attempts increases, and thus N will not be set too
high during situations where idle balancing is successful more often.
Any comments on this idea?


It would be good to get a solid explanation for why we need such high N.
But yes that might work.


I have some idea, though no proof :)

I suspect a lot of the idle balancing time is spent waiting for
and acquiring the runqueue locks of remote CPUs.

If we spend half our idle time causing contention to remote
runqueue locks, we could be a big factor in keeping those other
CPUs from getting work done.

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Srikar Dronamraju

  
  idle_balance(u64 idle_duration)
  {
u64 cost = 0;
  
for_each_domain(sd) {
  if (cost + sd-cost  idle_duration/N)
break;
  
  ...
  
  sd-cost = (sd-cost + this_cost) / 2;
  cost += this_cost;
}
  }
  
  I would've initially suggested using something like N=2 since we're dealing
  with averages and half should ensure we don't run over except for the worst
  peaks. But we could easily use a bigger N.
 
 I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
 to set N to more than 20 in order to get the big performance gains.
 

As per your observation, newly idle balancing isn't picking tasks and
mostly finding the domains to be balanced. find_busiest_queue() is
under rcu. So where and how are we getting these performance gains?

Is it that tasks are getting woken up and queued while the cpu is doing
newly idle load balance? Or is it that the regular CPU_IDLE balancing
which follows idle_balance() does a more aggressive balancing and hence
is able to find a task to balance?


-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Srikar Dronamraju

* Rik van Riel r...@redhat.com [2013-07-18 07:59:22]:

 On 07/18/2013 05:32 AM, Peter Zijlstra wrote:
 On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:
 
 I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
 to set N to more than 20 in order to get the big performance gains.
 
 One thing that I thought of was to have N be based on how often idle
 balance attempts does not pull task(s).
 
 For example, N can be calculated based on the number of idle balance
 attempts for the CPU  since the last successful idle balance attempt.
 So if the previous 30 idle balance attempts resulted in no tasks moved,
 then n = 30 / 5. So idle balance gets less time to run as the number of
 unneeded idle balance attempts increases, and thus N will not be set too
 high during situations where idle balancing is successful more often.
 Any comments on this idea?
 
 It would be good to get a solid explanation for why we need such high N.
 But yes that might work.
 
 I have some idea, though no proof :)
 
 I suspect a lot of the idle balancing time is spent waiting for
 and acquiring the runqueue locks of remote CPUs.
 

We take locks if and only if we see imbalance and want to pull the
tasks.
However if the newly idle balance is not finding an imbalance then this
may not be an issue.

Probably /proc/schedstats will give a better picture.

 If we spend half our idle time causing contention to remote
 runqueue locks, we could be a big factor in keeping those other
 CPUs from getting work done.
 
 -- 
 All rights reversed
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Peter Zijlstra

On Thu, Jul 18, 2013 at 05:45:46PM +0530, Srikar Dronamraju wrote:
 We take locks if and only if we see imbalance and want to pull the
 tasks.
 However if the newly idle balance is not finding an imbalance then this
 may not be an issue.
 
 Probably /proc/schedstats will give a better picture.

Right, so we're interested in move_tasks() calls that fail to 'deliver'.
There's a few conditions in there that can cause us to not move a task,
most of them not counted.

The few that are; are from can_mirgrate_task():

  se.statistics.nr_failed_migrations_affine
  se.statistics.nr_failed_migrations_running
  se.statistics.nr_failed_migrations_hot

If we see significant increments on those we'll be taking locks. 

The only one I can see a good way around is the hot one, we could ignore
hotness in favour of newidle -- although I could see that being
detrimental, we'll just have to try or so ;-)

_running shouldn't be much of a problem since we don't bother if
nr_running = 1. And _affine is out of our reach anyway.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Srikar Dronamraju

* Peter Zijlstra pet...@infradead.org [2013-07-18 14:35:31]:

 On Thu, Jul 18, 2013 at 05:45:46PM +0530, Srikar Dronamraju wrote:
  We take locks if and only if we see imbalance and want to pull the
  tasks.
  However if the newly idle balance is not finding an imbalance then this
  may not be an issue.
  
  Probably /proc/schedstats will give a better picture.
 
 Right, so we're interested in move_tasks() calls that fail to 'deliver'.
 There's a few conditions in there that can cause us to not move a task,
 most of them not counted.
 
 The few that are; are from can_mirgrate_task():
 
   se.statistics.nr_failed_migrations_affine
   se.statistics.nr_failed_migrations_running
   se.statistics.nr_failed_migrations_hot
 
 If we see significant increments on those we'll be taking locks. 
 

Agree. 

Even I think number of times no busy group was found, number of times no
busy queue was found also will tell us that locks are not being taken.

In schedstats, I generally see them as overwhelming majority.


 The only one I can see a good way around is the hot one, we could ignore
 hotness in favour of newidle -- although I could see that being
 detrimental, we'll just have to try or so ;-)
 
 _running shouldn't be much of a problem since we don't bother if
 nr_running = 1. And _affine is out of our reach anyway.
 

-- 
Thanks and Regards
Srikar Dronamraju

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Jason Low

On Thu, 2013-07-18 at 17:42 +0530, Srikar Dronamraju wrote:
   
   idle_balance(u64 idle_duration)
   {
 u64 cost = 0;
   
 for_each_domain(sd) {
   if (cost + sd-cost  idle_duration/N)
 break;
   
   ...
   
   sd-cost = (sd-cost + this_cost) / 2;
   cost += this_cost;
 }
   }
   
   I would've initially suggested using something like N=2 since we're 
   dealing
   with averages and half should ensure we don't run over except for the 
   worst
   peaks. But we could easily use a bigger N.
  
  I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
  to set N to more than 20 in order to get the big performance gains.
  
 
 As per your observation, newly idle balancing isn't picking tasks and
 mostly finding the domains to be balanced. find_busiest_queue() is
 under rcu. So where and how are we getting these performance gains?

I actually just ran fserver on 8 sockets (which idle balance lowers the
performance in this workload at this socket count), and for this
workload, idle balancing is finding tasks to move fairly often on a
per-cpu basis. So I guess it is not always the case that idle balancing
isn't moving tasks on this box.

Jason

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-18 Thread Jason Low

On Thu, 2013-07-18 at 07:59 -0400, Rik van Riel wrote:
 On 07/18/2013 05:32 AM, Peter Zijlstra wrote:
  On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote:
 
  I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
  to set N to more than 20 in order to get the big performance gains.
 
  One thing that I thought of was to have N be based on how often idle
  balance attempts does not pull task(s).
 
  For example, N can be calculated based on the number of idle balance
  attempts for the CPU  since the last successful idle balance attempt.
  So if the previous 30 idle balance attempts resulted in no tasks moved,
  then n = 30 / 5. So idle balance gets less time to run as the number of
  unneeded idle balance attempts increases, and thus N will not be set too
  high during situations where idle balancing is successful more often.
  Any comments on this idea?
 
  It would be good to get a solid explanation for why we need such high N.
  But yes that might work.
 
 I have some idea, though no proof :)
 
 I suspect a lot of the idle balancing time is spent waiting for
 and acquiring the runqueue locks of remote CPUs.
 
 If we spend half our idle time causing contention to remote
 runqueue locks, we could be a big factor in keeping those other
 CPUs from getting work done.

I collected some perf samples when running fserver when N=1 and N=60.

N = 1
-
19.21%  reaim  [k] __read_lock_failed 
14.79%  reaim  [k] mspin_lock 
12.19%  reaim  [k] __write_lock_failed
7.87%   reaim  [k] _raw_spin_lock  
2.03%   reaim  [k] start_this_handle   
1.98%   reaim  [k] update_sd_lb_stats  
1.92%   reaim  [k] mutex_spin_on_owner 
1.86%   reaim  [k] update_cfs_rq_blocked_load  
1.14%   swapper  [k] intel_idle  
1.10%   reaim  [.] add_long
1.09%   reaim  [.] add_int 
1.08%   reaim  [k] load_balance

N = 60
--
7.70%  reaim  [k] _raw_spin_lock 
7.25%  reaim  [k] mspin_lock 
6.30%  reaim  [.] add_long   
6.26%  reaim  [.] add_int
4.05%  reaim  [.] strncat
3.81%  reaim  [.] string_rtns_1  
3.66%  reaim  [.] div_long   
3.44%  reaim  [k] mutex_spin_on_owner
2.91%  reaim  [.] add_short  
2.73%  swapper  [k] intel_idle 
2.65%  reaim  [k] __read_lock_failed

With idle_balance(), we get more contention in kernel functions such as
update_sd_lb_stats(), load_balance(), and spin_lock() for the rq lock.
Additionally, it increases the time spent in mutex's mspin_lock(),
__read_lock_failed() and __write_lock_failed() by a lot.

N needs to be large because avg_idle time is still a lot higher than the
avg time spent in each load_balance() call per sched domain. Despite the
high ratio of avg_idle time to time spent in load_balance(),
load_balance() still increases the time spent in the kernel by quite a
bit, probably because of how often it is being used.

Jason


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Jason Low

On Wed, 2013-07-17 at 20:01 +0200, Peter Zijlstra wrote:
> On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote:
> > On 07/17/2013 12:18 PM, Peter Zijlstra wrote:
> 
> > >So the way I see things is that the only way newidle balance can slow down
> > >things is if it runs when we could have ran something useful.
> > 
> > Due to contention on the runqueue locks of other CPUs,
> > newidle also has the potential to keep _others_ from
> > running something useful.
> 
> Right, although that should only happen when we do have an imbalance and want
> to go move something. Which in Jason's case is 'rare'. But yes, I suppose
> there's other scenarios where this is far more likely.
> 
> > Could we prevent that downside by measuring both the
> > time spent idle, and the time spent in idle balancing,
> > and making sure the idle balancing time never exceeds
> > more than N% of the idle time?
> 
> Sure:
> 
> idle_balance(u64 idle_duration)
> {
>   u64 cost = 0;
> 
>   for_each_domain(sd) {
> if (cost + sd->cost > idle_duration/N)
>   break;
> 
> ...
> 
> sd->cost = (sd->cost + this_cost) / 2;
> cost += this_cost;
>   }
> }
> 
> I would've initially suggested using something like N=2 since we're dealing
> with averages and half should ensure we don't run over except for the worst
> peaks. But we could easily use a bigger N.

I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
to set N to more than 20 in order to get the big performance gains.

One thing that I thought of was to have N be based on how often idle
balance attempts does not pull task(s).

For example, N can be calculated based on the number of idle balance
attempts for the CPU  since the last "successful" idle balance attempt.
So if the previous 30 idle balance attempts resulted in no tasks moved,
then n = 30 / 5. So idle balance gets less time to run as the number of
unneeded idle balance attempts increases, and thus N will not be set too
high during situations where idle balancing is "successful" more often.
Any comments on this idea?

Thanks,
Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Jason Low

On Wed, 2013-07-17 at 20:01 +0200, Peter Zijlstra wrote:
> On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote:
> > On 07/17/2013 12:18 PM, Peter Zijlstra wrote:
> 
> > >So the way I see things is that the only way newidle balance can slow down
> > >things is if it runs when we could have ran something useful.
> > 
> > Due to contention on the runqueue locks of other CPUs,
> > newidle also has the potential to keep _others_ from
> > running something useful.
> 
> Right, although that should only happen when we do have an imbalance and want
> to go move something. Which in Jason's case is 'rare'. But yes, I suppose
> there's other scenarios where this is far more likely.
> 
> > Could we prevent that downside by measuring both the
> > time spent idle, and the time spent in idle balancing,
> > and making sure the idle balancing time never exceeds
> > more than N% of the idle time?
> 
> Sure:
> 
> idle_balance(u64 idle_duration)
> {
>   u64 cost = 0;
> 
>   for_each_domain(sd) {
> if (cost + sd->cost > idle_duration/N)
>   break;
> 
> ...
> 
> sd->cost = (sd->cost + this_cost) / 2;
> cost += this_cost;
>   }
> }
> 
> I would've initially suggested using something like N=2 since we're dealing
> with averages and half should ensure we don't run over except for the worst
> peaks. But we could easily use a bigger N.

Okay, I'll try this out. Thank you for your suggestions.

Jason.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Peter Zijlstra

On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote:
> On 07/17/2013 12:18 PM, Peter Zijlstra wrote:

> >So the way I see things is that the only way newidle balance can slow down
> >things is if it runs when we could have ran something useful.
> 
> Due to contention on the runqueue locks of other CPUs,
> newidle also has the potential to keep _others_ from
> running something useful.

Right, although that should only happen when we do have an imbalance and want
to go move something. Which in Jason's case is 'rare'. But yes, I suppose
there's other scenarios where this is far more likely.

> Could we prevent that downside by measuring both the
> time spent idle, and the time spent in idle balancing,
> and making sure the idle balancing time never exceeds
> more than N% of the idle time?

Sure:

idle_balance(u64 idle_duration)
{
  u64 cost = 0;

  for_each_domain(sd) {
if (cost + sd->cost > idle_duration/N)
  break;

...

sd->cost = (sd->cost + this_cost) / 2;
cost += this_cost;
  }
}

I would've initially suggested using something like N=2 since we're dealing
with averages and half should ensure we don't run over except for the worst
peaks. But we could easily use a bigger N.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Rik van Riel


On 07/17/2013 12:18 PM, Peter Zijlstra wrote:

On Wed, Jul 17, 2013 at 08:59:01AM -0700, Jason Low wrote:





Do you think its worth a try to consider each newidle balance attempt as
the total load_balance attempts until it is able to move a task, and
then skip balancing within the domain if a CPU's avg idle time is less
than that avg time doing newidle balance?


So the way I see things is that the only way newidle balance can slow down
things is if it runs when we could have ran something useful.


Due to contention on the runqueue locks of other CPUs,
newidle also has the potential to keep _others_ from
running something useful.


So all we need to ensure is to not run longer than we expect to be idle for and
things should be 'free', right?

So the problem I have with your proposal is that supposing we're successful
once every 10 newidle balances. Then the sd->newidle_balance_cost gets inflated
by a factor 10 -- for we'd count 10 of them before 'success'.

However when we're idle for that amount of time (10 times longer than it takes
to do a single newidle balance) we'd still only do a single newidle balance,
not 10.


Could we prevent that downside by measuring both the
time spent idle, and the time spent in idle balancing,
and making sure the idle balancing time never exceeds
more than N% of the idle time?

Say, have the CPU never spend more than 10% of its
idle time in the idle balancing code, as averaged
over some time period?

That way we might still do idle balancing every X
idle periods, even if the idle periods themselves
are relatively short.

It might also be enough to prevent excessive lock
contention triggered by the idle balancing code,
though I have to admit I did not really think that
part through :)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Peter Zijlstra

On Wed, Jul 17, 2013 at 08:59:01AM -0700, Jason Low wrote:
> 
> So if we have the following: 
> 
> for_each_domain(sd)
>   before = sched_clock_cpu
>   load_balance(sd)
>   after = sched_clock_cpu
>   idle_balance_completion_time = after - before
> 
> At this point, the "idle_balance_completion_time" is usually a very
> small value and is usually a lot smaller than the avg CPU idle time.
> However, the vast majority of the time, load_balance returns 0.

I think the interesting question here is: is it significantly more when we do
find a task?

I would also expect sd->newidle_balance_cost (less typing there) to scale with 
the
number of CPUs in the domain - thus larger domains will take longer etc.

And (obviously) the cost of the entire newidle balance is the direct sum of
individual domain costs.

> Do you think its worth a try to consider each newidle balance attempt as
> the total load_balance attempts until it is able to move a task, and
> then skip balancing within the domain if a CPU's avg idle time is less
> than that avg time doing newidle balance? 

So the way I see things is that the only way newidle balance can slow down
things is if it runs when we could have ran something useful.

So all we need to ensure is to not run longer than we expect to be idle for and
things should be 'free', right?

So the problem I have with your proposal is that supposing we're successful
once every 10 newidle balances. Then the sd->newidle_balance_cost gets inflated
by a factor 10 -- for we'd count 10 of them before 'success'.

However when we're idle for that amount of time (10 times longer than it takes
to do a single newidle balance) we'd still only do a single newidle balance,
not 10.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Jason Low

Hi Peter,

On Wed, 2013-07-17 at 11:39 +0200, Peter Zijlstra wrote:
> On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote:
> > For the more complex model, are you suggesting that each completion time
> > is the time it takes to complete 1 iteration of the for_each_domain()
> > loop? 
> 
> Per sd, yes? So higher domains (or lower depending on how you model the thing
> in you head) have bigger CPU spans, and thus take longer to complete. Imagine
> the top domain of a 4096 cpu system, it would go look at all cpus to see if it
> could find a task.
> 
> > Based on some of the data I collected, a single iteration of the
> > for_each_domain() loop is almost always significantly lower than the
> > approximate CPU idle time, even in workloads where idle_balance is
> > lowering performance. The bigger issue is that it takes so many of these
> > attempts before idle_balance actually "worked" and pulls a tasks.
> 
> I'm confused, so:
> 
>   schedule()
> if (!rq->nr_running)
>   idle_balance()
> for_each_domain(sd)
>   load_balance(sd)
> 
> is the entire thing, there's no other loop in there.

So if we have the following: 

for_each_domain(sd)
before = sched_clock_cpu
load_balance(sd)
after = sched_clock_cpu
idle_balance_completion_time = after - before

At this point, the "idle_balance_completion_time" is usually a very
small value and is usually a lot smaller than the avg CPU idle time.
However, the vast majority of the time, load_balance returns 0.

> > I initially was thinking about each "completion time" of an idle balance
> > as the sum total of the times of all iterations to complete until a task
> > is successfully pulled within each domain.
> 
> So you're saying that normally idle_balance() won't find a task to pull? And 
> we
> need many times going newidle before we do get something?

Yes, a while ago, I collected some data on the rate in which
idle_balance() does not pull tasks, and it was a very high number.

> Wouldn't this mean that there simply weren't enough tasks to keep all cpus 
> busy?

If I remember correctly, in a lot of those load_balance attempts when
the machine is under a high Java load, there were no "imbalance" between
the groups in each sched_domain.

> If there were tasks we could've pulled, we might need to look at why they
> weren't and maybe fix that. Now it could be that it things this cpu, even with
> the (little) idle time it has is sufficiently loaded and we'll get a 'local'
> wakeup soon enough. That's perfectly fine.
> 
> What we should avoid is spending more time looking for tasks then we have 
> idle,
> since that reduces the total time we can spend doing useful work. So that is I
> think the critical cut-off point.

Do you think its worth a try to consider each newidle balance attempt as
the total load_balance attempts until it is able to move a task, and
then skip balancing within the domain if a CPU's avg idle time is less
than that avg time doing newidle balance? 

Thanks,
Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Peter Zijlstra

On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote:
> For the more complex model, are you suggesting that each completion time
> is the time it takes to complete 1 iteration of the for_each_domain()
> loop? 

Per sd, yes? So higher domains (or lower depending on how you model the thing
in you head) have bigger CPU spans, and thus take longer to complete. Imagine
the top domain of a 4096 cpu system, it would go look at all cpus to see if it
could find a task.

> Based on some of the data I collected, a single iteration of the
> for_each_domain() loop is almost always significantly lower than the
> approximate CPU idle time, even in workloads where idle_balance is
> lowering performance. The bigger issue is that it takes so many of these
> attempts before idle_balance actually "worked" and pulls a tasks.

I'm confused, so:

  schedule()
if (!rq->nr_running)
  idle_balance()
for_each_domain(sd)
  load_balance(sd)

is the entire thing, there's no other loop in there.

> I initially was thinking about each "completion time" of an idle balance
> as the sum total of the times of all iterations to complete until a task
> is successfully pulled within each domain.

So you're saying that normally idle_balance() won't find a task to pull? And we
need many times going newidle before we do get something?

Wouldn't this mean that there simply weren't enough tasks to keep all cpus busy?

If there were tasks we could've pulled, we might need to look at why they
weren't and maybe fix that. Now it could be that it things this cpu, even with
the (little) idle time it has is sufficiently loaded and we'll get a 'local'
wakeup soon enough. That's perfectly fine.

What we should avoid is spending more time looking for tasks then we have idle,
since that reduces the total time we can spend doing useful work. So that is I
think the critical cut-off point.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Jason Low

On Wed, 2013-07-17 at 09:25 +0200, Peter Zijlstra wrote:
> On Tue, Jul 16, 2013 at 03:48:01PM -0700, Jason Low wrote:
> > On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote:
> > > On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
> > > > When running benchmarks on an 8 socket 80 core machine with a 3.10 
> > > > kernel,
> > > > there can be a lot of contention in idle_balance() and related 
> > > > functions.
> > > > On many AIM7 workloads in which CPUs go idle very often and idle balance
> > > > gets called a lot, it is actually lowering performance.
> > > > 
> > > > Since idle balance often helps performance (when it is not overused), I
> > > > looked into trying to avoid attempting idle balance only when it is
> > > > occurring too frequently.
> > > > 
> > > > This RFC patch attempts to keep track of the approximate "average" time 
> > > > between
> > > > idle balance attempts per CPU. Each time the idle_balance() function is
> > > > invoked, it will compute the duration since the last idle_balance() for
> > > > the current CPU. The avg time between idle balance attempts is then 
> > > > updated
> > > > using a very similar method as how rq->avg_idle is computed. 
> > > > 
> > > > Once the average time between idle balance attempts drops below a 
> > > > certain
> > > > value (which in this patch is sysctl_sched_idle_balance_limit), 
> > > > idle_balance
> > > > for that CPU will be skipped. The average time between idle balances 
> > > > will
> > > > continue to be updated, even if it ends up getting skipped. The
> > > > initial/maximum average is set a lot higher though to make sure that the
> > > > avg doesn't fall below the threshold until the sample size is large and 
> > > > to
> > > > prevent the avg from being overestimated.
> > > 
> > > One of the things I've been talking about for a while now is how I'd
> > > like to use the idle guestimator used for cpuidle for newidle balance.
> > > 
> > > Basically based on the estimated idle time limit how far/wide you'll
> > > search for tasks to run.
> > > 
> > > You can remove the sysctl and auto-tune by measuring how long it takes
> > > on avg to do a newidle balance.
> > 
> > Hi Peter,
> > 
> > When you say how long it takes on avg to do a newidle balance, are you
> > referring to the avg time it takes for each call to CPU_NEWLY_IDLE
> > load_balance() to complete, or the avg time it takes for newidle balance
> > attempts within a domain to eventually successfully pull/move a task(s)?
> 
> Both :-), being as the completion time would be roughly equivalent for the
> top domain and the entire call.
> 
> So I suppose I was somewhat unclear :-) I initially started out with a
> simpler model, where you measure the avg time of the entire
> idle_balance() call and measure the avg idle time and compare the two.
> 
> Then I progressed to the more complex model where you measure the
> completion time of each domain in the for_each_domain() iteration of
> idle_balance() and compare that against the estimated idle time, bailing
> out of the domain iteration when the avg completion time exceeds the
> expected idle time.

Hi Peter,

For the more complex model, are you suggesting that each completion time
is the time it takes to complete 1 iteration of the for_each_domain()
loop? 

Based on some of the data I collected, a single iteration of the
for_each_domain() loop is almost always significantly lower than the
approximate CPU idle time, even in workloads where idle_balance is
lowering performance. The bigger issue is that it takes so many of these
attempts before idle_balance actually "worked" and pulls a tasks.

I initially was thinking about each "completion time" of an idle balance
as the sum total of the times of all iterations to complete until a task
is successfully pulled within each domain.

Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Peter Zijlstra

On Wed, Jul 17, 2013 at 09:25:04AM +0200, Peter Zijlstra wrote:
> One thing that I thought of since is that we need to consider what
> happens for people with a low resolution sched_clock. IIRC there are
> still platforms that are jiffy based.

Ignore that, they're all UP.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Peter Zijlstra

On Tue, Jul 16, 2013 at 03:48:01PM -0700, Jason Low wrote:
> On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote:
> > On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
> > > When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
> > > there can be a lot of contention in idle_balance() and related functions.
> > > On many AIM7 workloads in which CPUs go idle very often and idle balance
> > > gets called a lot, it is actually lowering performance.
> > > 
> > > Since idle balance often helps performance (when it is not overused), I
> > > looked into trying to avoid attempting idle balance only when it is
> > > occurring too frequently.
> > > 
> > > This RFC patch attempts to keep track of the approximate "average" time 
> > > between
> > > idle balance attempts per CPU. Each time the idle_balance() function is
> > > invoked, it will compute the duration since the last idle_balance() for
> > > the current CPU. The avg time between idle balance attempts is then 
> > > updated
> > > using a very similar method as how rq->avg_idle is computed. 
> > > 
> > > Once the average time between idle balance attempts drops below a certain
> > > value (which in this patch is sysctl_sched_idle_balance_limit), 
> > > idle_balance
> > > for that CPU will be skipped. The average time between idle balances will
> > > continue to be updated, even if it ends up getting skipped. The
> > > initial/maximum average is set a lot higher though to make sure that the
> > > avg doesn't fall below the threshold until the sample size is large and to
> > > prevent the avg from being overestimated.
> > 
> > One of the things I've been talking about for a while now is how I'd
> > like to use the idle guestimator used for cpuidle for newidle balance.
> > 
> > Basically based on the estimated idle time limit how far/wide you'll
> > search for tasks to run.
> > 
> > You can remove the sysctl and auto-tune by measuring how long it takes
> > on avg to do a newidle balance.
> 
> Hi Peter,
> 
> When you say how long it takes on avg to do a newidle balance, are you
> referring to the avg time it takes for each call to CPU_NEWLY_IDLE
> load_balance() to complete, or the avg time it takes for newidle balance
> attempts within a domain to eventually successfully pull/move a task(s)?

Both :-), being as the completion time would be roughly equivalent for the
top domain and the entire call.

So I suppose I was somewhat unclear :-) I initially started out with a
simpler model, where you measure the avg time of the entire
idle_balance() call and measure the avg idle time and compare the two.

Then I progressed to the more complex model where you measure the
completion time of each domain in the for_each_domain() iteration of
idle_balance() and compare that against the estimated idle time, bailing
out of the domain iteration when the avg completion time exceeds the
expected idle time.

One thing that I thought of since is that we need to consider what
happens for people with a low resolution sched_clock. IIRC there are
still platforms that are jiffy based.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Peter Zijlstra

On Tue, Jul 16, 2013 at 03:48:01PM -0700, Jason Low wrote:
 On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote:
  On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
   When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
   there can be a lot of contention in idle_balance() and related functions.
   On many AIM7 workloads in which CPUs go idle very often and idle balance
   gets called a lot, it is actually lowering performance.
   
   Since idle balance often helps performance (when it is not overused), I
   looked into trying to avoid attempting idle balance only when it is
   occurring too frequently.
   
   This RFC patch attempts to keep track of the approximate average time 
   between
   idle balance attempts per CPU. Each time the idle_balance() function is
   invoked, it will compute the duration since the last idle_balance() for
   the current CPU. The avg time between idle balance attempts is then 
   updated
   using a very similar method as how rq-avg_idle is computed. 
   
   Once the average time between idle balance attempts drops below a certain
   value (which in this patch is sysctl_sched_idle_balance_limit), 
   idle_balance
   for that CPU will be skipped. The average time between idle balances will
   continue to be updated, even if it ends up getting skipped. The
   initial/maximum average is set a lot higher though to make sure that the
   avg doesn't fall below the threshold until the sample size is large and to
   prevent the avg from being overestimated.
  
  One of the things I've been talking about for a while now is how I'd
  like to use the idle guestimator used for cpuidle for newidle balance.
  
  Basically based on the estimated idle time limit how far/wide you'll
  search for tasks to run.
  
  You can remove the sysctl and auto-tune by measuring how long it takes
  on avg to do a newidle balance.
 
 Hi Peter,
 
 When you say how long it takes on avg to do a newidle balance, are you
 referring to the avg time it takes for each call to CPU_NEWLY_IDLE
 load_balance() to complete, or the avg time it takes for newidle balance
 attempts within a domain to eventually successfully pull/move a task(s)?

Both :-), being as the completion time would be roughly equivalent for the
top domain and the entire call.

So I suppose I was somewhat unclear :-) I initially started out with a
simpler model, where you measure the avg time of the entire
idle_balance() call and measure the avg idle time and compare the two.

Then I progressed to the more complex model where you measure the
completion time of each domain in the for_each_domain() iteration of
idle_balance() and compare that against the estimated idle time, bailing
out of the domain iteration when the avg completion time exceeds the
expected idle time.

One thing that I thought of since is that we need to consider what
happens for people with a low resolution sched_clock. IIRC there are
still platforms that are jiffy based.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Peter Zijlstra

On Wed, Jul 17, 2013 at 09:25:04AM +0200, Peter Zijlstra wrote:
 One thing that I thought of since is that we need to consider what
 happens for people with a low resolution sched_clock. IIRC there are
 still platforms that are jiffy based.

Ignore that, they're all UP.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Jason Low

On Wed, 2013-07-17 at 09:25 +0200, Peter Zijlstra wrote:
 On Tue, Jul 16, 2013 at 03:48:01PM -0700, Jason Low wrote:
  On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote:
   On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
When running benchmarks on an 8 socket 80 core machine with a 3.10 
kernel,
there can be a lot of contention in idle_balance() and related 
functions.
On many AIM7 workloads in which CPUs go idle very often and idle balance
gets called a lot, it is actually lowering performance.

Since idle balance often helps performance (when it is not overused), I
looked into trying to avoid attempting idle balance only when it is
occurring too frequently.

This RFC patch attempts to keep track of the approximate average time 
between
idle balance attempts per CPU. Each time the idle_balance() function is
invoked, it will compute the duration since the last idle_balance() for
the current CPU. The avg time between idle balance attempts is then 
updated
using a very similar method as how rq-avg_idle is computed. 

Once the average time between idle balance attempts drops below a 
certain
value (which in this patch is sysctl_sched_idle_balance_limit), 
idle_balance
for that CPU will be skipped. The average time between idle balances 
will
continue to be updated, even if it ends up getting skipped. The
initial/maximum average is set a lot higher though to make sure that the
avg doesn't fall below the threshold until the sample size is large and 
to
prevent the avg from being overestimated.
   
   One of the things I've been talking about for a while now is how I'd
   like to use the idle guestimator used for cpuidle for newidle balance.
   
   Basically based on the estimated idle time limit how far/wide you'll
   search for tasks to run.
   
   You can remove the sysctl and auto-tune by measuring how long it takes
   on avg to do a newidle balance.
  
  Hi Peter,
  
  When you say how long it takes on avg to do a newidle balance, are you
  referring to the avg time it takes for each call to CPU_NEWLY_IDLE
  load_balance() to complete, or the avg time it takes for newidle balance
  attempts within a domain to eventually successfully pull/move a task(s)?
 
 Both :-), being as the completion time would be roughly equivalent for the
 top domain and the entire call.
 
 So I suppose I was somewhat unclear :-) I initially started out with a
 simpler model, where you measure the avg time of the entire
 idle_balance() call and measure the avg idle time and compare the two.
 
 Then I progressed to the more complex model where you measure the
 completion time of each domain in the for_each_domain() iteration of
 idle_balance() and compare that against the estimated idle time, bailing
 out of the domain iteration when the avg completion time exceeds the
 expected idle time.

Hi Peter,

For the more complex model, are you suggesting that each completion time
is the time it takes to complete 1 iteration of the for_each_domain()
loop? 

Based on some of the data I collected, a single iteration of the
for_each_domain() loop is almost always significantly lower than the
approximate CPU idle time, even in workloads where idle_balance is
lowering performance. The bigger issue is that it takes so many of these
attempts before idle_balance actually worked and pulls a tasks.

I initially was thinking about each completion time of an idle balance
as the sum total of the times of all iterations to complete until a task
is successfully pulled within each domain.

Jason

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Peter Zijlstra

On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote:
 For the more complex model, are you suggesting that each completion time
 is the time it takes to complete 1 iteration of the for_each_domain()
 loop? 

Per sd, yes? So higher domains (or lower depending on how you model the thing
in you head) have bigger CPU spans, and thus take longer to complete. Imagine
the top domain of a 4096 cpu system, it would go look at all cpus to see if it
could find a task.

 Based on some of the data I collected, a single iteration of the
 for_each_domain() loop is almost always significantly lower than the
 approximate CPU idle time, even in workloads where idle_balance is
 lowering performance. The bigger issue is that it takes so many of these
 attempts before idle_balance actually worked and pulls a tasks.

I'm confused, so:

  schedule()
if (!rq-nr_running)
  idle_balance()
for_each_domain(sd)
  load_balance(sd)

is the entire thing, there's no other loop in there.

 I initially was thinking about each completion time of an idle balance
 as the sum total of the times of all iterations to complete until a task
 is successfully pulled within each domain.

So you're saying that normally idle_balance() won't find a task to pull? And we
need many times going newidle before we do get something?

Wouldn't this mean that there simply weren't enough tasks to keep all cpus busy?

If there were tasks we could've pulled, we might need to look at why they
weren't and maybe fix that. Now it could be that it things this cpu, even with
the (little) idle time it has is sufficiently loaded and we'll get a 'local'
wakeup soon enough. That's perfectly fine.

What we should avoid is spending more time looking for tasks then we have idle,
since that reduces the total time we can spend doing useful work. So that is I
think the critical cut-off point.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Jason Low

Hi Peter,

On Wed, 2013-07-17 at 11:39 +0200, Peter Zijlstra wrote:
 On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote:
  For the more complex model, are you suggesting that each completion time
  is the time it takes to complete 1 iteration of the for_each_domain()
  loop? 
 
 Per sd, yes? So higher domains (or lower depending on how you model the thing
 in you head) have bigger CPU spans, and thus take longer to complete. Imagine
 the top domain of a 4096 cpu system, it would go look at all cpus to see if it
 could find a task.
 
  Based on some of the data I collected, a single iteration of the
  for_each_domain() loop is almost always significantly lower than the
  approximate CPU idle time, even in workloads where idle_balance is
  lowering performance. The bigger issue is that it takes so many of these
  attempts before idle_balance actually worked and pulls a tasks.
 
 I'm confused, so:
 
   schedule()
 if (!rq-nr_running)
   idle_balance()
 for_each_domain(sd)
   load_balance(sd)
 
 is the entire thing, there's no other loop in there.

So if we have the following: 

for_each_domain(sd)
before = sched_clock_cpu
load_balance(sd)
after = sched_clock_cpu
idle_balance_completion_time = after - before

At this point, the idle_balance_completion_time is usually a very
small value and is usually a lot smaller than the avg CPU idle time.
However, the vast majority of the time, load_balance returns 0.

  I initially was thinking about each completion time of an idle balance
  as the sum total of the times of all iterations to complete until a task
  is successfully pulled within each domain.
 
 So you're saying that normally idle_balance() won't find a task to pull? And 
 we
 need many times going newidle before we do get something?

Yes, a while ago, I collected some data on the rate in which
idle_balance() does not pull tasks, and it was a very high number.

 Wouldn't this mean that there simply weren't enough tasks to keep all cpus 
 busy?

If I remember correctly, in a lot of those load_balance attempts when
the machine is under a high Java load, there were no imbalance between
the groups in each sched_domain.

 If there were tasks we could've pulled, we might need to look at why they
 weren't and maybe fix that. Now it could be that it things this cpu, even with
 the (little) idle time it has is sufficiently loaded and we'll get a 'local'
 wakeup soon enough. That's perfectly fine.
 
 What we should avoid is spending more time looking for tasks then we have 
 idle,
 since that reduces the total time we can spend doing useful work. So that is I
 think the critical cut-off point.

Do you think its worth a try to consider each newidle balance attempt as
the total load_balance attempts until it is able to move a task, and
then skip balancing within the domain if a CPU's avg idle time is less
than that avg time doing newidle balance? 

Thanks,
Jason

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Peter Zijlstra

On Wed, Jul 17, 2013 at 08:59:01AM -0700, Jason Low wrote:
 
 So if we have the following: 
 
 for_each_domain(sd)
   before = sched_clock_cpu
   load_balance(sd)
   after = sched_clock_cpu
   idle_balance_completion_time = after - before
 
 At this point, the idle_balance_completion_time is usually a very
 small value and is usually a lot smaller than the avg CPU idle time.
 However, the vast majority of the time, load_balance returns 0.

I think the interesting question here is: is it significantly more when we do
find a task?

I would also expect sd-newidle_balance_cost (less typing there) to scale with 
the
number of CPUs in the domain - thus larger domains will take longer etc.

And (obviously) the cost of the entire newidle balance is the direct sum of
individual domain costs.

 Do you think its worth a try to consider each newidle balance attempt as
 the total load_balance attempts until it is able to move a task, and
 then skip balancing within the domain if a CPU's avg idle time is less
 than that avg time doing newidle balance? 

So the way I see things is that the only way newidle balance can slow down
things is if it runs when we could have ran something useful.

So all we need to ensure is to not run longer than we expect to be idle for and
things should be 'free', right?

So the problem I have with your proposal is that supposing we're successful
once every 10 newidle balances. Then the sd-newidle_balance_cost gets inflated
by a factor 10 -- for we'd count 10 of them before 'success'.

However when we're idle for that amount of time (10 times longer than it takes
to do a single newidle balance) we'd still only do a single newidle balance,
not 10.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Rik van Riel


On 07/17/2013 12:18 PM, Peter Zijlstra wrote:

On Wed, Jul 17, 2013 at 08:59:01AM -0700, Jason Low wrote:





Do you think its worth a try to consider each newidle balance attempt as
the total load_balance attempts until it is able to move a task, and
then skip balancing within the domain if a CPU's avg idle time is less
than that avg time doing newidle balance?


So the way I see things is that the only way newidle balance can slow down
things is if it runs when we could have ran something useful.


Due to contention on the runqueue locks of other CPUs,
newidle also has the potential to keep _others_ from
running something useful.


So all we need to ensure is to not run longer than we expect to be idle for and
things should be 'free', right?

So the problem I have with your proposal is that supposing we're successful
once every 10 newidle balances. Then the sd-newidle_balance_cost gets inflated
by a factor 10 -- for we'd count 10 of them before 'success'.

However when we're idle for that amount of time (10 times longer than it takes
to do a single newidle balance) we'd still only do a single newidle balance,
not 10.


Could we prevent that downside by measuring both the
time spent idle, and the time spent in idle balancing,
and making sure the idle balancing time never exceeds
more than N% of the idle time?

Say, have the CPU never spend more than 10% of its
idle time in the idle balancing code, as averaged
over some time period?

That way we might still do idle balancing every X
idle periods, even if the idle periods themselves
are relatively short.

It might also be enough to prevent excessive lock
contention triggered by the idle balancing code,
though I have to admit I did not really think that
part through :)
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Peter Zijlstra

On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote:
 On 07/17/2013 12:18 PM, Peter Zijlstra wrote:

 So the way I see things is that the only way newidle balance can slow down
 things is if it runs when we could have ran something useful.
 
 Due to contention on the runqueue locks of other CPUs,
 newidle also has the potential to keep _others_ from
 running something useful.

Right, although that should only happen when we do have an imbalance and want
to go move something. Which in Jason's case is 'rare'. But yes, I suppose
there's other scenarios where this is far more likely.

 Could we prevent that downside by measuring both the
 time spent idle, and the time spent in idle balancing,
 and making sure the idle balancing time never exceeds
 more than N% of the idle time?

Sure:

idle_balance(u64 idle_duration)
{
  u64 cost = 0;

  for_each_domain(sd) {
if (cost + sd-cost  idle_duration/N)
  break;

...

sd-cost = (sd-cost + this_cost) / 2;
cost += this_cost;
  }
}

I would've initially suggested using something like N=2 since we're dealing
with averages and half should ensure we don't run over except for the worst
peaks. But we could easily use a bigger N.
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Jason Low

On Wed, 2013-07-17 at 20:01 +0200, Peter Zijlstra wrote:
 On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote:
  On 07/17/2013 12:18 PM, Peter Zijlstra wrote:
 
  So the way I see things is that the only way newidle balance can slow down
  things is if it runs when we could have ran something useful.
  
  Due to contention on the runqueue locks of other CPUs,
  newidle also has the potential to keep _others_ from
  running something useful.
 
 Right, although that should only happen when we do have an imbalance and want
 to go move something. Which in Jason's case is 'rare'. But yes, I suppose
 there's other scenarios where this is far more likely.
 
  Could we prevent that downside by measuring both the
  time spent idle, and the time spent in idle balancing,
  and making sure the idle balancing time never exceeds
  more than N% of the idle time?
 
 Sure:
 
 idle_balance(u64 idle_duration)
 {
   u64 cost = 0;
 
   for_each_domain(sd) {
 if (cost + sd-cost  idle_duration/N)
   break;
 
 ...
 
 sd-cost = (sd-cost + this_cost) / 2;
 cost += this_cost;
   }
 }
 
 I would've initially suggested using something like N=2 since we're dealing
 with averages and half should ensure we don't run over except for the worst
 peaks. But we could easily use a bigger N.

Okay, I'll try this out. Thank you for your suggestions.

Jason.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-17 Thread Jason Low

On Wed, 2013-07-17 at 20:01 +0200, Peter Zijlstra wrote:
 On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote:
  On 07/17/2013 12:18 PM, Peter Zijlstra wrote:
 
  So the way I see things is that the only way newidle balance can slow down
  things is if it runs when we could have ran something useful.
  
  Due to contention on the runqueue locks of other CPUs,
  newidle also has the potential to keep _others_ from
  running something useful.
 
 Right, although that should only happen when we do have an imbalance and want
 to go move something. Which in Jason's case is 'rare'. But yes, I suppose
 there's other scenarios where this is far more likely.
 
  Could we prevent that downside by measuring both the
  time spent idle, and the time spent in idle balancing,
  and making sure the idle balancing time never exceeds
  more than N% of the idle time?
 
 Sure:
 
 idle_balance(u64 idle_duration)
 {
   u64 cost = 0;
 
   for_each_domain(sd) {
 if (cost + sd-cost  idle_duration/N)
   break;
 
 ...
 
 sd-cost = (sd-cost + this_cost) / 2;
 cost += this_cost;
   }
 }
 
 I would've initially suggested using something like N=2 since we're dealing
 with averages and half should ensure we don't run over except for the worst
 peaks. But we could easily use a bigger N.

I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed
to set N to more than 20 in order to get the big performance gains.

One thing that I thought of was to have N be based on how often idle
balance attempts does not pull task(s).

For example, N can be calculated based on the number of idle balance
attempts for the CPU  since the last successful idle balance attempt.
So if the previous 30 idle balance attempts resulted in no tasks moved,
then n = 30 / 5. So idle balance gets less time to run as the number of
unneeded idle balance attempts increases, and thus N will not be set too
high during situations where idle balancing is successful more often.
Any comments on this idea?

Thanks,
Jason


--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-16 Thread Jason Low

On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote:
> On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
> > When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
> > there can be a lot of contention in idle_balance() and related functions.
> > On many AIM7 workloads in which CPUs go idle very often and idle balance
> > gets called a lot, it is actually lowering performance.
> > 
> > Since idle balance often helps performance (when it is not overused), I
> > looked into trying to avoid attempting idle balance only when it is
> > occurring too frequently.
> > 
> > This RFC patch attempts to keep track of the approximate "average" time 
> > between
> > idle balance attempts per CPU. Each time the idle_balance() function is
> > invoked, it will compute the duration since the last idle_balance() for
> > the current CPU. The avg time between idle balance attempts is then updated
> > using a very similar method as how rq->avg_idle is computed. 
> > 
> > Once the average time between idle balance attempts drops below a certain
> > value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
> > for that CPU will be skipped. The average time between idle balances will
> > continue to be updated, even if it ends up getting skipped. The
> > initial/maximum average is set a lot higher though to make sure that the
> > avg doesn't fall below the threshold until the sample size is large and to
> > prevent the avg from being overestimated.
> 
> One of the things I've been talking about for a while now is how I'd
> like to use the idle guestimator used for cpuidle for newidle balance.
> 
> Basically based on the estimated idle time limit how far/wide you'll
> search for tasks to run.
> 
> You can remove the sysctl and auto-tune by measuring how long it takes
> on avg to do a newidle balance.

Hi Peter,

When you say how long it takes on avg to do a newidle balance, are you
referring to the avg time it takes for each call to CPU_NEWLY_IDLE
load_balance() to complete, or the avg time it takes for newidle balance
attempts within a domain to eventually successfully pull/move a task(s)?

Thanks,
Jason

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-16 Thread Peter Zijlstra

On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
> When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
> there can be a lot of contention in idle_balance() and related functions.
> On many AIM7 workloads in which CPUs go idle very often and idle balance
> gets called a lot, it is actually lowering performance.
> 
> Since idle balance often helps performance (when it is not overused), I
> looked into trying to avoid attempting idle balance only when it is
> occurring too frequently.
> 
> This RFC patch attempts to keep track of the approximate "average" time 
> between
> idle balance attempts per CPU. Each time the idle_balance() function is
> invoked, it will compute the duration since the last idle_balance() for
> the current CPU. The avg time between idle balance attempts is then updated
> using a very similar method as how rq->avg_idle is computed. 
> 
> Once the average time between idle balance attempts drops below a certain
> value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
> for that CPU will be skipped. The average time between idle balances will
> continue to be updated, even if it ends up getting skipped. The
> initial/maximum average is set a lot higher though to make sure that the
> avg doesn't fall below the threshold until the sample size is large and to
> prevent the avg from being overestimated.

One of the things I've been talking about for a while now is how I'd
like to use the idle guestimator used for cpuidle for newidle balance.

Basically based on the estimated idle time limit how far/wide you'll
search for tasks to run.

You can remove the sysctl and auto-tune by measuring how long it takes
on avg to do a newidle balance.

Also, its not the time between idles that's interesting, but how long
you're going to be idle.

If you track the time it takes to newidle balance per sched-domain you
traverse, you're almost there.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-16 Thread Rik van Riel


On 07/16/2013 03:21 PM, Jason Low wrote:

When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
there can be a lot of contention in idle_balance() and related functions.
On many AIM7 workloads in which CPUs go idle very often and idle balance
gets called a lot, it is actually lowering performance.

Since idle balance often helps performance (when it is not overused), I
looked into trying to avoid attempting idle balance only when it is
occurring too frequently.

This RFC patch attempts to keep track of the approximate "average" time between
idle balance attempts per CPU. Each time the idle_balance() function is
invoked, it will compute the duration since the last idle_balance() for
the current CPU. The avg time between idle balance attempts is then updated
using a very similar method as how rq->avg_idle is computed.

Once the average time between idle balance attempts drops below a certain
value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
for that CPU will be skipped. The average time between idle balances will
continue to be updated, even if it ends up getting skipped. The
initial/maximum average is set a lot higher though to make sure that the
avg doesn't fall below the threshold until the sample size is large and to
prevent the avg from being overestimated.

This change improved the performance of many AIM7 workloads at 1, 2, 4, 8
sockets on the 3.10 kernel. The most significant differences were at
8 sockets HT-enabled. The table below compares the average jobs per minute
at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this
patch. I included data for both hyperthreading disabled and enabled. I used
numactl to restrict AIM7 to run on certain number of nodes. I only included
data in which the % difference was beyond a 2% noise range.


Reviewed-by: Rik van Riel 

--
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

[RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-16 Thread Jason Low

When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
there can be a lot of contention in idle_balance() and related functions.
On many AIM7 workloads in which CPUs go idle very often and idle balance
gets called a lot, it is actually lowering performance.

Since idle balance often helps performance (when it is not overused), I
looked into trying to avoid attempting idle balance only when it is
occurring too frequently.

This RFC patch attempts to keep track of the approximate "average" time between
idle balance attempts per CPU. Each time the idle_balance() function is
invoked, it will compute the duration since the last idle_balance() for
the current CPU. The avg time between idle balance attempts is then updated
using a very similar method as how rq->avg_idle is computed. 

Once the average time between idle balance attempts drops below a certain
value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
for that CPU will be skipped. The average time between idle balances will
continue to be updated, even if it ends up getting skipped. The
initial/maximum average is set a lot higher though to make sure that the
avg doesn't fall below the threshold until the sample size is large and to
prevent the avg from being overestimated.

This change improved the performance of many AIM7 workloads at 1, 2, 4, 8
sockets on the 3.10 kernel. The most significant differences were at
8 sockets HT-enabled. The table below compares the average jobs per minute
at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this
patch. I included data for both hyperthreading disabled and enabled. I used
numactl to restrict AIM7 to run on certain number of nodes. I only included
data in which the % difference was beyond a 2% noise range.

--
1 socket
--
workload| HT-disabled   | HT-enabled|
| % improvement | % improvement |
| with patch| with patch|
--
disk|   +17.7%  |   +4.7%   |
--
high_systime|   +2.9%   |   -   |
--


--
2 sockets
--
workload| HT-disabled   | HT-enabled|
| % improvement | % improvement |
| with patch| with patch|
--
alltests|   -   |   +2.3%   |
--
disk|   +10.5%  |   -   |
--
fserver |   +3.6%   |   -   |
--
new_fserver |   +3.7%   |   -   |
--


--
4 sockets
--
workload| HT-disabled   | HT-enabled|
| % improvement | % improvement |
| with patch| with patch|
--
alltests|   +3.7%   |   -   |
--
custom  |   -2.2%   |   +14.0%  |
--
fserver |   +2.8%   |   -   |
--
high_systime|   -3.6%   |   +18.7%  |
--
new_fserver |   +3.4%   |   -   |
--


--
8 sockets
--
workload| HT-disabled   | HT-enabled|
| % improvement | % improvement |
| with patch| with patch|
--
alltests|   +4.4%   |   +13.3%  |
--
custom  |   +8.1%   |   +15.2%  |

[RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-16 Thread Jason Low

When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
there can be a lot of contention in idle_balance() and related functions.
On many AIM7 workloads in which CPUs go idle very often and idle balance
gets called a lot, it is actually lowering performance.

Since idle balance often helps performance (when it is not overused), I
looked into trying to avoid attempting idle balance only when it is
occurring too frequently.

This RFC patch attempts to keep track of the approximate average time between
idle balance attempts per CPU. Each time the idle_balance() function is
invoked, it will compute the duration since the last idle_balance() for
the current CPU. The avg time between idle balance attempts is then updated
using a very similar method as how rq-avg_idle is computed. 

Once the average time between idle balance attempts drops below a certain
value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
for that CPU will be skipped. The average time between idle balances will
continue to be updated, even if it ends up getting skipped. The
initial/maximum average is set a lot higher though to make sure that the
avg doesn't fall below the threshold until the sample size is large and to
prevent the avg from being overestimated.

This change improved the performance of many AIM7 workloads at 1, 2, 4, 8
sockets on the 3.10 kernel. The most significant differences were at
8 sockets HT-enabled. The table below compares the average jobs per minute
at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this
patch. I included data for both hyperthreading disabled and enabled. I used
numactl to restrict AIM7 to run on certain number of nodes. I only included
data in which the % difference was beyond a 2% noise range.

--
1 socket
--
workload| HT-disabled   | HT-enabled|
| % improvement | % improvement |
| with patch| with patch|
--
disk|   +17.7%  |   +4.7%   |
--
high_systime|   +2.9%   |   -   |
--


--
2 sockets
--
workload| HT-disabled   | HT-enabled|
| % improvement | % improvement |
| with patch| with patch|
--
alltests|   -   |   +2.3%   |
--
disk|   +10.5%  |   -   |
--
fserver |   +3.6%   |   -   |
--
new_fserver |   +3.7%   |   -   |
--


--
4 sockets
--
workload| HT-disabled   | HT-enabled|
| % improvement | % improvement |
| with patch| with patch|
--
alltests|   +3.7%   |   -   |
--
custom  |   -2.2%   |   +14.0%  |
--
fserver |   +2.8%   |   -   |
--
high_systime|   -3.6%   |   +18.7%  |
--
new_fserver |   +3.4%   |   -   |
--


--
8 sockets
--
workload| HT-disabled   | HT-enabled|
| % improvement | % improvement |
| with patch| with patch|
--
alltests|   +4.4%   |   +13.3%  |
--
custom  |   +8.1%   |   +15.2%  |

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-16 Thread Rik van Riel


On 07/16/2013 03:21 PM, Jason Low wrote:

When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
there can be a lot of contention in idle_balance() and related functions.
On many AIM7 workloads in which CPUs go idle very often and idle balance
gets called a lot, it is actually lowering performance.

Since idle balance often helps performance (when it is not overused), I
looked into trying to avoid attempting idle balance only when it is
occurring too frequently.

This RFC patch attempts to keep track of the approximate average time between
idle balance attempts per CPU. Each time the idle_balance() function is
invoked, it will compute the duration since the last idle_balance() for
the current CPU. The avg time between idle balance attempts is then updated
using a very similar method as how rq-avg_idle is computed.

Once the average time between idle balance attempts drops below a certain
value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
for that CPU will be skipped. The average time between idle balances will
continue to be updated, even if it ends up getting skipped. The
initial/maximum average is set a lot higher though to make sure that the
avg doesn't fall below the threshold until the sample size is large and to
prevent the avg from being overestimated.

This change improved the performance of many AIM7 workloads at 1, 2, 4, 8
sockets on the 3.10 kernel. The most significant differences were at
8 sockets HT-enabled. The table below compares the average jobs per minute
at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this
patch. I included data for both hyperthreading disabled and enabled. I used
numactl to restrict AIM7 to run on certain number of nodes. I only included
data in which the % difference was beyond a 2% noise range.


Reviewed-by: Rik van Riel r...@redhat.com

--
All rights reversed
--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-16 Thread Peter Zijlstra

On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
 When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
 there can be a lot of contention in idle_balance() and related functions.
 On many AIM7 workloads in which CPUs go idle very often and idle balance
 gets called a lot, it is actually lowering performance.
 
 Since idle balance often helps performance (when it is not overused), I
 looked into trying to avoid attempting idle balance only when it is
 occurring too frequently.
 
 This RFC patch attempts to keep track of the approximate average time 
 between
 idle balance attempts per CPU. Each time the idle_balance() function is
 invoked, it will compute the duration since the last idle_balance() for
 the current CPU. The avg time between idle balance attempts is then updated
 using a very similar method as how rq-avg_idle is computed. 
 
 Once the average time between idle balance attempts drops below a certain
 value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
 for that CPU will be skipped. The average time between idle balances will
 continue to be updated, even if it ends up getting skipped. The
 initial/maximum average is set a lot higher though to make sure that the
 avg doesn't fall below the threshold until the sample size is large and to
 prevent the avg from being overestimated.

One of the things I've been talking about for a while now is how I'd
like to use the idle guestimator used for cpuidle for newidle balance.

Basically based on the estimated idle time limit how far/wide you'll
search for tasks to run.

You can remove the sysctl and auto-tune by measuring how long it takes
on avg to do a newidle balance.

Also, its not the time between idles that's interesting, but how long
you're going to be idle.

If you track the time it takes to newidle balance per sched-domain you
traverse, you're almost there.

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

Re: [RFC] sched: Limit idle_balance() when it is being used too frequently

2013-07-16 Thread Jason Low

On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote:
 On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote:
  When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel,
  there can be a lot of contention in idle_balance() and related functions.
  On many AIM7 workloads in which CPUs go idle very often and idle balance
  gets called a lot, it is actually lowering performance.
  
  Since idle balance often helps performance (when it is not overused), I
  looked into trying to avoid attempting idle balance only when it is
  occurring too frequently.
  
  This RFC patch attempts to keep track of the approximate average time 
  between
  idle balance attempts per CPU. Each time the idle_balance() function is
  invoked, it will compute the duration since the last idle_balance() for
  the current CPU. The avg time between idle balance attempts is then updated
  using a very similar method as how rq-avg_idle is computed. 
  
  Once the average time between idle balance attempts drops below a certain
  value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance
  for that CPU will be skipped. The average time between idle balances will
  continue to be updated, even if it ends up getting skipped. The
  initial/maximum average is set a lot higher though to make sure that the
  avg doesn't fall below the threshold until the sample size is large and to
  prevent the avg from being overestimated.
 
 One of the things I've been talking about for a while now is how I'd
 like to use the idle guestimator used for cpuidle for newidle balance.
 
 Basically based on the estimated idle time limit how far/wide you'll
 search for tasks to run.
 
 You can remove the sysctl and auto-tune by measuring how long it takes
 on avg to do a newidle balance.

Hi Peter,

When you say how long it takes on avg to do a newidle balance, are you
referring to the avg time it takes for each call to CPU_NEWLY_IDLE
load_balance() to complete, or the avg time it takes for newidle balance
attempts within a domain to eventually successfully pull/move a task(s)?

Thanks,
Jason

--
To unsubscribe from this list: send the line unsubscribe linux-kernel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

48 matches

Mail list logo