Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Fri, 2013-07-19 at 20:37 +0200, Peter Zijlstra wrote: > On Thu, Jul 18, 2013 at 12:06:39PM -0700, Jason Low wrote: > > > N = 1 > > - > > 19.21% reaim [k] __read_lock_failed > > 14.79% reaim [k] mspin_lock > > 12.19% reaim [k] __write_lock_failed > > 7.87% reaim [k] _raw_spin_lock > > 2.03% reaim [k] start_this_handle > > 1.98% reaim [k] update_sd_lb_stats > > 1.92% reaim [k] mutex_spin_on_owner > > 1.86% reaim [k] update_cfs_rq_blocked_load > > 1.14% swapper [k] intel_idle > > 1.10% reaim [.] add_long > > 1.09% reaim [.] add_int > > 1.08% reaim [k] load_balance > > But but but but.. wth is causing this? The only thing we do more of with > N=1 is idle_balance(); where would that cause __{read,write}_lock_failed > and or mspin_lock() contention like that. > > There shouldn't be a rwlock_t in the entire scheduler; those things suck > worse than quicksand. > > If, as Rik thought, we'd have more rq->lock contention, then I'd > expected _raw_spin_lock to be up highest. For this particular fserver workload, that mutex was acquired in the function calls from ext4_orphan_add() and ext4_orphan_del(). Those read and write lock calls were from start_this_handle(). Although these functions are not called within the idle_balance() code path, update_sd_lb_stats(), tg_load_down(), idle_cpu(), spin_lock(), ect... increases the time spent in the kernel and that appears to be indirectly causing more time to be spent acquiring those other kernel locks. Thanks, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Thu, Jul 18, 2013 at 12:06:39PM -0700, Jason Low wrote: > N = 1 > - > 19.21% reaim [k] __read_lock_failed > 14.79% reaim [k] mspin_lock > 12.19% reaim [k] __write_lock_failed > 7.87% reaim [k] _raw_spin_lock > 2.03% reaim [k] start_this_handle > 1.98% reaim [k] update_sd_lb_stats > 1.92% reaim [k] mutex_spin_on_owner > 1.86% reaim [k] update_cfs_rq_blocked_load > 1.14% swapper [k] intel_idle > 1.10% reaim [.] add_long > 1.09% reaim [.] add_int > 1.08% reaim [k] load_balance But but but but.. wth is causing this? The only thing we do more of with N=1 is idle_balance(); where would that cause __{read,write}_lock_failed and or mspin_lock() contention like that. There shouldn't be a rwlock_t in the entire scheduler; those things suck worse than quicksand. If, as Rik thought, we'd have more rq->lock contention, then I'd expected _raw_spin_lock to be up highest. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Thu, Jul 18, 2013 at 12:06:39PM -0700, Jason Low wrote: N = 1 - 19.21% reaim [k] __read_lock_failed 14.79% reaim [k] mspin_lock 12.19% reaim [k] __write_lock_failed 7.87% reaim [k] _raw_spin_lock 2.03% reaim [k] start_this_handle 1.98% reaim [k] update_sd_lb_stats 1.92% reaim [k] mutex_spin_on_owner 1.86% reaim [k] update_cfs_rq_blocked_load 1.14% swapper [k] intel_idle 1.10% reaim [.] add_long 1.09% reaim [.] add_int 1.08% reaim [k] load_balance But but but but.. wth is causing this? The only thing we do more of with N=1 is idle_balance(); where would that cause __{read,write}_lock_failed and or mspin_lock() contention like that. There shouldn't be a rwlock_t in the entire scheduler; those things suck worse than quicksand. If, as Rik thought, we'd have more rq-lock contention, then I'd expected _raw_spin_lock to be up highest. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Fri, 2013-07-19 at 20:37 +0200, Peter Zijlstra wrote: On Thu, Jul 18, 2013 at 12:06:39PM -0700, Jason Low wrote: N = 1 - 19.21% reaim [k] __read_lock_failed 14.79% reaim [k] mspin_lock 12.19% reaim [k] __write_lock_failed 7.87% reaim [k] _raw_spin_lock 2.03% reaim [k] start_this_handle 1.98% reaim [k] update_sd_lb_stats 1.92% reaim [k] mutex_spin_on_owner 1.86% reaim [k] update_cfs_rq_blocked_load 1.14% swapper [k] intel_idle 1.10% reaim [.] add_long 1.09% reaim [.] add_int 1.08% reaim [k] load_balance But but but but.. wth is causing this? The only thing we do more of with N=1 is idle_balance(); where would that cause __{read,write}_lock_failed and or mspin_lock() contention like that. There shouldn't be a rwlock_t in the entire scheduler; those things suck worse than quicksand. If, as Rik thought, we'd have more rq-lock contention, then I'd expected _raw_spin_lock to be up highest. For this particular fserver workload, that mutex was acquired in the function calls from ext4_orphan_add() and ext4_orphan_del(). Those read and write lock calls were from start_this_handle(). Although these functions are not called within the idle_balance() code path, update_sd_lb_stats(), tg_load_down(), idle_cpu(), spin_lock(), ect... increases the time spent in the kernel and that appears to be indirectly causing more time to be spent acquiring those other kernel locks. Thanks, Jason -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Thu, 2013-07-18 at 07:59 -0400, Rik van Riel wrote: > On 07/18/2013 05:32 AM, Peter Zijlstra wrote: > > On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote: > > > >> I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed > >> to set N to more than 20 in order to get the big performance gains. > >> > >> One thing that I thought of was to have N be based on how often idle > >> balance attempts does not pull task(s). > >> > >> For example, N can be calculated based on the number of idle balance > >> attempts for the CPU since the last "successful" idle balance attempt. > >> So if the previous 30 idle balance attempts resulted in no tasks moved, > >> then n = 30 / 5. So idle balance gets less time to run as the number of > >> unneeded idle balance attempts increases, and thus N will not be set too > >> high during situations where idle balancing is "successful" more often. > >> Any comments on this idea? > > > > It would be good to get a solid explanation for why we need such high N. > > But yes that might work. > > I have some idea, though no proof :) > > I suspect a lot of the idle balancing time is spent waiting for > and acquiring the runqueue locks of remote CPUs. > > If we spend half our idle time causing contention to remote > runqueue locks, we could be a big factor in keeping those other > CPUs from getting work done. I collected some perf samples when running fserver when N=1 and N=60. N = 1 - 19.21% reaim [k] __read_lock_failed 14.79% reaim [k] mspin_lock 12.19% reaim [k] __write_lock_failed 7.87% reaim [k] _raw_spin_lock 2.03% reaim [k] start_this_handle 1.98% reaim [k] update_sd_lb_stats 1.92% reaim [k] mutex_spin_on_owner 1.86% reaim [k] update_cfs_rq_blocked_load 1.14% swapper [k] intel_idle 1.10% reaim [.] add_long 1.09% reaim [.] add_int 1.08% reaim [k] load_balance N = 60 -- 7.70% reaim [k] _raw_spin_lock 7.25% reaim [k] mspin_lock 6.30% reaim [.] add_long 6.26% reaim [.] add_int 4.05% reaim [.] strncat 3.81% reaim [.] string_rtns_1 3.66% reaim [.] div_long 3.44% reaim [k] mutex_spin_on_owner 2.91% reaim [.] add_short 2.73% swapper [k] intel_idle 2.65% reaim [k] __read_lock_failed With idle_balance(), we get more contention in kernel functions such as update_sd_lb_stats(), load_balance(), and spin_lock() for the rq lock. Additionally, it increases the time spent in mutex's mspin_lock(), __read_lock_failed() and __write_lock_failed() by a lot. N needs to be large because avg_idle time is still a lot higher than the avg time spent in each load_balance() call per sched domain. Despite the high ratio of avg_idle time to time spent in load_balance(), load_balance() still increases the time spent in the kernel by quite a bit, probably because of how often it is being used. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Thu, 2013-07-18 at 17:42 +0530, Srikar Dronamraju wrote: > > > > > > idle_balance(u64 idle_duration) > > > { > > > u64 cost = 0; > > > > > > for_each_domain(sd) { > > > if (cost + sd->cost > idle_duration/N) > > > break; > > > > > > ... > > > > > > sd->cost = (sd->cost + this_cost) / 2; > > > cost += this_cost; > > > } > > > } > > > > > > I would've initially suggested using something like N=2 since we're > > > dealing > > > with averages and half should ensure we don't run over except for the > > > worst > > > peaks. But we could easily use a bigger N. > > > > I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed > > to set N to more than 20 in order to get the big performance gains. > > > > As per your observation, newly idle balancing isn't picking tasks and > mostly finding the domains to be balanced. find_busiest_queue() is > under rcu. So where and how are we getting these performance gains? I actually just ran fserver on 8 sockets (which idle balance lowers the performance in this workload at this socket count), and for this workload, idle balancing is finding tasks to move fairly often on a per-cpu basis. So I guess it is not always the case that idle balancing isn't moving tasks on this box. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
* Peter Zijlstra [2013-07-18 14:35:31]: > On Thu, Jul 18, 2013 at 05:45:46PM +0530, Srikar Dronamraju wrote: > > We take locks if and only if we see imbalance and want to pull the > > tasks. > > However if the newly idle balance is not finding an imbalance then this > > may not be an issue. > > > > Probably /proc/schedstats will give a better picture. > > Right, so we're interested in move_tasks() calls that fail to 'deliver'. > There's a few conditions in there that can cause us to not move a task, > most of them not counted. > > The few that are; are from can_mirgrate_task(): > > se.statistics.nr_failed_migrations_affine > se.statistics.nr_failed_migrations_running > se.statistics.nr_failed_migrations_hot > > If we see significant increments on those we'll be taking locks. > Agree. Even I think number of times no busy group was found, number of times no busy queue was found also will tell us that locks are not being taken. In schedstats, I generally see them as overwhelming majority. > The only one I can see a good way around is the hot one, we could ignore > hotness in favour of newidle -- although I could see that being > detrimental, we'll just have to try or so ;-) > > _running shouldn't be much of a problem since we don't bother if > nr_running <= 1. And _affine is out of our reach anyway. > -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Thu, Jul 18, 2013 at 05:45:46PM +0530, Srikar Dronamraju wrote: > We take locks if and only if we see imbalance and want to pull the > tasks. > However if the newly idle balance is not finding an imbalance then this > may not be an issue. > > Probably /proc/schedstats will give a better picture. Right, so we're interested in move_tasks() calls that fail to 'deliver'. There's a few conditions in there that can cause us to not move a task, most of them not counted. The few that are; are from can_mirgrate_task(): se.statistics.nr_failed_migrations_affine se.statistics.nr_failed_migrations_running se.statistics.nr_failed_migrations_hot If we see significant increments on those we'll be taking locks. The only one I can see a good way around is the hot one, we could ignore hotness in favour of newidle -- although I could see that being detrimental, we'll just have to try or so ;-) _running shouldn't be much of a problem since we don't bother if nr_running <= 1. And _affine is out of our reach anyway. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
* Rik van Riel [2013-07-18 07:59:22]: > On 07/18/2013 05:32 AM, Peter Zijlstra wrote: > >On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote: > > > >>I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed > >>to set N to more than 20 in order to get the big performance gains. > >> > >>One thing that I thought of was to have N be based on how often idle > >>balance attempts does not pull task(s). > >> > >>For example, N can be calculated based on the number of idle balance > >>attempts for the CPU since the last "successful" idle balance attempt. > >>So if the previous 30 idle balance attempts resulted in no tasks moved, > >>then n = 30 / 5. So idle balance gets less time to run as the number of > >>unneeded idle balance attempts increases, and thus N will not be set too > >>high during situations where idle balancing is "successful" more often. > >>Any comments on this idea? > > > >It would be good to get a solid explanation for why we need such high N. > >But yes that might work. > > I have some idea, though no proof :) > > I suspect a lot of the idle balancing time is spent waiting for > and acquiring the runqueue locks of remote CPUs. > We take locks if and only if we see imbalance and want to pull the tasks. However if the newly idle balance is not finding an imbalance then this may not be an issue. Probably /proc/schedstats will give a better picture. > If we spend half our idle time causing contention to remote > runqueue locks, we could be a big factor in keeping those other > CPUs from getting work done. > > -- > All rights reversed > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
> > > > idle_balance(u64 idle_duration) > > { > > u64 cost = 0; > > > > for_each_domain(sd) { > > if (cost + sd->cost > idle_duration/N) > > break; > > > > ... > > > > sd->cost = (sd->cost + this_cost) / 2; > > cost += this_cost; > > } > > } > > > > I would've initially suggested using something like N=2 since we're dealing > > with averages and half should ensure we don't run over except for the worst > > peaks. But we could easily use a bigger N. > > I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed > to set N to more than 20 in order to get the big performance gains. > As per your observation, newly idle balancing isn't picking tasks and mostly finding the domains to be balanced. find_busiest_queue() is under rcu. So where and how are we getting these performance gains? Is it that tasks are getting woken up and queued while the cpu is doing newly idle load balance? Or is it that the regular CPU_IDLE balancing which follows idle_balance() does a more aggressive balancing and hence is able to find a task to balance? -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On 07/18/2013 05:32 AM, Peter Zijlstra wrote: On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote: I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed to set N to more than 20 in order to get the big performance gains. One thing that I thought of was to have N be based on how often idle balance attempts does not pull task(s). For example, N can be calculated based on the number of idle balance attempts for the CPU since the last "successful" idle balance attempt. So if the previous 30 idle balance attempts resulted in no tasks moved, then n = 30 / 5. So idle balance gets less time to run as the number of unneeded idle balance attempts increases, and thus N will not be set too high during situations where idle balancing is "successful" more often. Any comments on this idea? It would be good to get a solid explanation for why we need such high N. But yes that might work. I have some idea, though no proof :) I suspect a lot of the idle balancing time is spent waiting for and acquiring the runqueue locks of remote CPUs. If we spend half our idle time causing contention to remote runqueue locks, we could be a big factor in keeping those other CPUs from getting work done. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote: > I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed > to set N to more than 20 in order to get the big performance gains. > > One thing that I thought of was to have N be based on how often idle > balance attempts does not pull task(s). > > For example, N can be calculated based on the number of idle balance > attempts for the CPU since the last "successful" idle balance attempt. > So if the previous 30 idle balance attempts resulted in no tasks moved, > then n = 30 / 5. So idle balance gets less time to run as the number of > unneeded idle balance attempts increases, and thus N will not be set too > high during situations where idle balancing is "successful" more often. > Any comments on this idea? It would be good to get a solid explanation for why we need such high N. But yes that might work. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote: I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed to set N to more than 20 in order to get the big performance gains. One thing that I thought of was to have N be based on how often idle balance attempts does not pull task(s). For example, N can be calculated based on the number of idle balance attempts for the CPU since the last successful idle balance attempt. So if the previous 30 idle balance attempts resulted in no tasks moved, then n = 30 / 5. So idle balance gets less time to run as the number of unneeded idle balance attempts increases, and thus N will not be set too high during situations where idle balancing is successful more often. Any comments on this idea? It would be good to get a solid explanation for why we need such high N. But yes that might work. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On 07/18/2013 05:32 AM, Peter Zijlstra wrote: On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote: I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed to set N to more than 20 in order to get the big performance gains. One thing that I thought of was to have N be based on how often idle balance attempts does not pull task(s). For example, N can be calculated based on the number of idle balance attempts for the CPU since the last successful idle balance attempt. So if the previous 30 idle balance attempts resulted in no tasks moved, then n = 30 / 5. So idle balance gets less time to run as the number of unneeded idle balance attempts increases, and thus N will not be set too high during situations where idle balancing is successful more often. Any comments on this idea? It would be good to get a solid explanation for why we need such high N. But yes that might work. I have some idea, though no proof :) I suspect a lot of the idle balancing time is spent waiting for and acquiring the runqueue locks of remote CPUs. If we spend half our idle time causing contention to remote runqueue locks, we could be a big factor in keeping those other CPUs from getting work done. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
idle_balance(u64 idle_duration) { u64 cost = 0; for_each_domain(sd) { if (cost + sd-cost idle_duration/N) break; ... sd-cost = (sd-cost + this_cost) / 2; cost += this_cost; } } I would've initially suggested using something like N=2 since we're dealing with averages and half should ensure we don't run over except for the worst peaks. But we could easily use a bigger N. I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed to set N to more than 20 in order to get the big performance gains. As per your observation, newly idle balancing isn't picking tasks and mostly finding the domains to be balanced. find_busiest_queue() is under rcu. So where and how are we getting these performance gains? Is it that tasks are getting woken up and queued while the cpu is doing newly idle load balance? Or is it that the regular CPU_IDLE balancing which follows idle_balance() does a more aggressive balancing and hence is able to find a task to balance? -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
* Rik van Riel r...@redhat.com [2013-07-18 07:59:22]: On 07/18/2013 05:32 AM, Peter Zijlstra wrote: On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote: I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed to set N to more than 20 in order to get the big performance gains. One thing that I thought of was to have N be based on how often idle balance attempts does not pull task(s). For example, N can be calculated based on the number of idle balance attempts for the CPU since the last successful idle balance attempt. So if the previous 30 idle balance attempts resulted in no tasks moved, then n = 30 / 5. So idle balance gets less time to run as the number of unneeded idle balance attempts increases, and thus N will not be set too high during situations where idle balancing is successful more often. Any comments on this idea? It would be good to get a solid explanation for why we need such high N. But yes that might work. I have some idea, though no proof :) I suspect a lot of the idle balancing time is spent waiting for and acquiring the runqueue locks of remote CPUs. We take locks if and only if we see imbalance and want to pull the tasks. However if the newly idle balance is not finding an imbalance then this may not be an issue. Probably /proc/schedstats will give a better picture. If we spend half our idle time causing contention to remote runqueue locks, we could be a big factor in keeping those other CPUs from getting work done. -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Thu, Jul 18, 2013 at 05:45:46PM +0530, Srikar Dronamraju wrote: We take locks if and only if we see imbalance and want to pull the tasks. However if the newly idle balance is not finding an imbalance then this may not be an issue. Probably /proc/schedstats will give a better picture. Right, so we're interested in move_tasks() calls that fail to 'deliver'. There's a few conditions in there that can cause us to not move a task, most of them not counted. The few that are; are from can_mirgrate_task(): se.statistics.nr_failed_migrations_affine se.statistics.nr_failed_migrations_running se.statistics.nr_failed_migrations_hot If we see significant increments on those we'll be taking locks. The only one I can see a good way around is the hot one, we could ignore hotness in favour of newidle -- although I could see that being detrimental, we'll just have to try or so ;-) _running shouldn't be much of a problem since we don't bother if nr_running = 1. And _affine is out of our reach anyway. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
* Peter Zijlstra pet...@infradead.org [2013-07-18 14:35:31]: On Thu, Jul 18, 2013 at 05:45:46PM +0530, Srikar Dronamraju wrote: We take locks if and only if we see imbalance and want to pull the tasks. However if the newly idle balance is not finding an imbalance then this may not be an issue. Probably /proc/schedstats will give a better picture. Right, so we're interested in move_tasks() calls that fail to 'deliver'. There's a few conditions in there that can cause us to not move a task, most of them not counted. The few that are; are from can_mirgrate_task(): se.statistics.nr_failed_migrations_affine se.statistics.nr_failed_migrations_running se.statistics.nr_failed_migrations_hot If we see significant increments on those we'll be taking locks. Agree. Even I think number of times no busy group was found, number of times no busy queue was found also will tell us that locks are not being taken. In schedstats, I generally see them as overwhelming majority. The only one I can see a good way around is the hot one, we could ignore hotness in favour of newidle -- although I could see that being detrimental, we'll just have to try or so ;-) _running shouldn't be much of a problem since we don't bother if nr_running = 1. And _affine is out of our reach anyway. -- Thanks and Regards Srikar Dronamraju -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Thu, 2013-07-18 at 17:42 +0530, Srikar Dronamraju wrote: idle_balance(u64 idle_duration) { u64 cost = 0; for_each_domain(sd) { if (cost + sd-cost idle_duration/N) break; ... sd-cost = (sd-cost + this_cost) / 2; cost += this_cost; } } I would've initially suggested using something like N=2 since we're dealing with averages and half should ensure we don't run over except for the worst peaks. But we could easily use a bigger N. I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed to set N to more than 20 in order to get the big performance gains. As per your observation, newly idle balancing isn't picking tasks and mostly finding the domains to be balanced. find_busiest_queue() is under rcu. So where and how are we getting these performance gains? I actually just ran fserver on 8 sockets (which idle balance lowers the performance in this workload at this socket count), and for this workload, idle balancing is finding tasks to move fairly often on a per-cpu basis. So I guess it is not always the case that idle balancing isn't moving tasks on this box. Jason -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Thu, 2013-07-18 at 07:59 -0400, Rik van Riel wrote: On 07/18/2013 05:32 AM, Peter Zijlstra wrote: On Wed, Jul 17, 2013 at 09:02:24PM -0700, Jason Low wrote: I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed to set N to more than 20 in order to get the big performance gains. One thing that I thought of was to have N be based on how often idle balance attempts does not pull task(s). For example, N can be calculated based on the number of idle balance attempts for the CPU since the last successful idle balance attempt. So if the previous 30 idle balance attempts resulted in no tasks moved, then n = 30 / 5. So idle balance gets less time to run as the number of unneeded idle balance attempts increases, and thus N will not be set too high during situations where idle balancing is successful more often. Any comments on this idea? It would be good to get a solid explanation for why we need such high N. But yes that might work. I have some idea, though no proof :) I suspect a lot of the idle balancing time is spent waiting for and acquiring the runqueue locks of remote CPUs. If we spend half our idle time causing contention to remote runqueue locks, we could be a big factor in keeping those other CPUs from getting work done. I collected some perf samples when running fserver when N=1 and N=60. N = 1 - 19.21% reaim [k] __read_lock_failed 14.79% reaim [k] mspin_lock 12.19% reaim [k] __write_lock_failed 7.87% reaim [k] _raw_spin_lock 2.03% reaim [k] start_this_handle 1.98% reaim [k] update_sd_lb_stats 1.92% reaim [k] mutex_spin_on_owner 1.86% reaim [k] update_cfs_rq_blocked_load 1.14% swapper [k] intel_idle 1.10% reaim [.] add_long 1.09% reaim [.] add_int 1.08% reaim [k] load_balance N = 60 -- 7.70% reaim [k] _raw_spin_lock 7.25% reaim [k] mspin_lock 6.30% reaim [.] add_long 6.26% reaim [.] add_int 4.05% reaim [.] strncat 3.81% reaim [.] string_rtns_1 3.66% reaim [.] div_long 3.44% reaim [k] mutex_spin_on_owner 2.91% reaim [.] add_short 2.73% swapper [k] intel_idle 2.65% reaim [k] __read_lock_failed With idle_balance(), we get more contention in kernel functions such as update_sd_lb_stats(), load_balance(), and spin_lock() for the rq lock. Additionally, it increases the time spent in mutex's mspin_lock(), __read_lock_failed() and __write_lock_failed() by a lot. N needs to be large because avg_idle time is still a lot higher than the avg time spent in each load_balance() call per sched domain. Despite the high ratio of avg_idle time to time spent in load_balance(), load_balance() still increases the time spent in the kernel by quite a bit, probably because of how often it is being used. Jason -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, 2013-07-17 at 20:01 +0200, Peter Zijlstra wrote: > On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote: > > On 07/17/2013 12:18 PM, Peter Zijlstra wrote: > > > >So the way I see things is that the only way newidle balance can slow down > > >things is if it runs when we could have ran something useful. > > > > Due to contention on the runqueue locks of other CPUs, > > newidle also has the potential to keep _others_ from > > running something useful. > > Right, although that should only happen when we do have an imbalance and want > to go move something. Which in Jason's case is 'rare'. But yes, I suppose > there's other scenarios where this is far more likely. > > > Could we prevent that downside by measuring both the > > time spent idle, and the time spent in idle balancing, > > and making sure the idle balancing time never exceeds > > more than N% of the idle time? > > Sure: > > idle_balance(u64 idle_duration) > { > u64 cost = 0; > > for_each_domain(sd) { > if (cost + sd->cost > idle_duration/N) > break; > > ... > > sd->cost = (sd->cost + this_cost) / 2; > cost += this_cost; > } > } > > I would've initially suggested using something like N=2 since we're dealing > with averages and half should ensure we don't run over except for the worst > peaks. But we could easily use a bigger N. I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed to set N to more than 20 in order to get the big performance gains. One thing that I thought of was to have N be based on how often idle balance attempts does not pull task(s). For example, N can be calculated based on the number of idle balance attempts for the CPU since the last "successful" idle balance attempt. So if the previous 30 idle balance attempts resulted in no tasks moved, then n = 30 / 5. So idle balance gets less time to run as the number of unneeded idle balance attempts increases, and thus N will not be set too high during situations where idle balancing is "successful" more often. Any comments on this idea? Thanks, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, 2013-07-17 at 20:01 +0200, Peter Zijlstra wrote: > On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote: > > On 07/17/2013 12:18 PM, Peter Zijlstra wrote: > > > >So the way I see things is that the only way newidle balance can slow down > > >things is if it runs when we could have ran something useful. > > > > Due to contention on the runqueue locks of other CPUs, > > newidle also has the potential to keep _others_ from > > running something useful. > > Right, although that should only happen when we do have an imbalance and want > to go move something. Which in Jason's case is 'rare'. But yes, I suppose > there's other scenarios where this is far more likely. > > > Could we prevent that downside by measuring both the > > time spent idle, and the time spent in idle balancing, > > and making sure the idle balancing time never exceeds > > more than N% of the idle time? > > Sure: > > idle_balance(u64 idle_duration) > { > u64 cost = 0; > > for_each_domain(sd) { > if (cost + sd->cost > idle_duration/N) > break; > > ... > > sd->cost = (sd->cost + this_cost) / 2; > cost += this_cost; > } > } > > I would've initially suggested using something like N=2 since we're dealing > with averages and half should ensure we don't run over except for the worst > peaks. But we could easily use a bigger N. Okay, I'll try this out. Thank you for your suggestions. Jason. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote: > On 07/17/2013 12:18 PM, Peter Zijlstra wrote: > >So the way I see things is that the only way newidle balance can slow down > >things is if it runs when we could have ran something useful. > > Due to contention on the runqueue locks of other CPUs, > newidle also has the potential to keep _others_ from > running something useful. Right, although that should only happen when we do have an imbalance and want to go move something. Which in Jason's case is 'rare'. But yes, I suppose there's other scenarios where this is far more likely. > Could we prevent that downside by measuring both the > time spent idle, and the time spent in idle balancing, > and making sure the idle balancing time never exceeds > more than N% of the idle time? Sure: idle_balance(u64 idle_duration) { u64 cost = 0; for_each_domain(sd) { if (cost + sd->cost > idle_duration/N) break; ... sd->cost = (sd->cost + this_cost) / 2; cost += this_cost; } } I would've initially suggested using something like N=2 since we're dealing with averages and half should ensure we don't run over except for the worst peaks. But we could easily use a bigger N. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On 07/17/2013 12:18 PM, Peter Zijlstra wrote: On Wed, Jul 17, 2013 at 08:59:01AM -0700, Jason Low wrote: Do you think its worth a try to consider each newidle balance attempt as the total load_balance attempts until it is able to move a task, and then skip balancing within the domain if a CPU's avg idle time is less than that avg time doing newidle balance? So the way I see things is that the only way newidle balance can slow down things is if it runs when we could have ran something useful. Due to contention on the runqueue locks of other CPUs, newidle also has the potential to keep _others_ from running something useful. So all we need to ensure is to not run longer than we expect to be idle for and things should be 'free', right? So the problem I have with your proposal is that supposing we're successful once every 10 newidle balances. Then the sd->newidle_balance_cost gets inflated by a factor 10 -- for we'd count 10 of them before 'success'. However when we're idle for that amount of time (10 times longer than it takes to do a single newidle balance) we'd still only do a single newidle balance, not 10. Could we prevent that downside by measuring both the time spent idle, and the time spent in idle balancing, and making sure the idle balancing time never exceeds more than N% of the idle time? Say, have the CPU never spend more than 10% of its idle time in the idle balancing code, as averaged over some time period? That way we might still do idle balancing every X idle periods, even if the idle periods themselves are relatively short. It might also be enough to prevent excessive lock contention triggered by the idle balancing code, though I have to admit I did not really think that part through :) -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, Jul 17, 2013 at 08:59:01AM -0700, Jason Low wrote: > > So if we have the following: > > for_each_domain(sd) > before = sched_clock_cpu > load_balance(sd) > after = sched_clock_cpu > idle_balance_completion_time = after - before > > At this point, the "idle_balance_completion_time" is usually a very > small value and is usually a lot smaller than the avg CPU idle time. > However, the vast majority of the time, load_balance returns 0. I think the interesting question here is: is it significantly more when we do find a task? I would also expect sd->newidle_balance_cost (less typing there) to scale with the number of CPUs in the domain - thus larger domains will take longer etc. And (obviously) the cost of the entire newidle balance is the direct sum of individual domain costs. > Do you think its worth a try to consider each newidle balance attempt as > the total load_balance attempts until it is able to move a task, and > then skip balancing within the domain if a CPU's avg idle time is less > than that avg time doing newidle balance? So the way I see things is that the only way newidle balance can slow down things is if it runs when we could have ran something useful. So all we need to ensure is to not run longer than we expect to be idle for and things should be 'free', right? So the problem I have with your proposal is that supposing we're successful once every 10 newidle balances. Then the sd->newidle_balance_cost gets inflated by a factor 10 -- for we'd count 10 of them before 'success'. However when we're idle for that amount of time (10 times longer than it takes to do a single newidle balance) we'd still only do a single newidle balance, not 10. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
Hi Peter, On Wed, 2013-07-17 at 11:39 +0200, Peter Zijlstra wrote: > On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote: > > For the more complex model, are you suggesting that each completion time > > is the time it takes to complete 1 iteration of the for_each_domain() > > loop? > > Per sd, yes? So higher domains (or lower depending on how you model the thing > in you head) have bigger CPU spans, and thus take longer to complete. Imagine > the top domain of a 4096 cpu system, it would go look at all cpus to see if it > could find a task. > > > Based on some of the data I collected, a single iteration of the > > for_each_domain() loop is almost always significantly lower than the > > approximate CPU idle time, even in workloads where idle_balance is > > lowering performance. The bigger issue is that it takes so many of these > > attempts before idle_balance actually "worked" and pulls a tasks. > > I'm confused, so: > > schedule() > if (!rq->nr_running) > idle_balance() > for_each_domain(sd) > load_balance(sd) > > is the entire thing, there's no other loop in there. So if we have the following: for_each_domain(sd) before = sched_clock_cpu load_balance(sd) after = sched_clock_cpu idle_balance_completion_time = after - before At this point, the "idle_balance_completion_time" is usually a very small value and is usually a lot smaller than the avg CPU idle time. However, the vast majority of the time, load_balance returns 0. > > I initially was thinking about each "completion time" of an idle balance > > as the sum total of the times of all iterations to complete until a task > > is successfully pulled within each domain. > > So you're saying that normally idle_balance() won't find a task to pull? And > we > need many times going newidle before we do get something? Yes, a while ago, I collected some data on the rate in which idle_balance() does not pull tasks, and it was a very high number. > Wouldn't this mean that there simply weren't enough tasks to keep all cpus > busy? If I remember correctly, in a lot of those load_balance attempts when the machine is under a high Java load, there were no "imbalance" between the groups in each sched_domain. > If there were tasks we could've pulled, we might need to look at why they > weren't and maybe fix that. Now it could be that it things this cpu, even with > the (little) idle time it has is sufficiently loaded and we'll get a 'local' > wakeup soon enough. That's perfectly fine. > > What we should avoid is spending more time looking for tasks then we have > idle, > since that reduces the total time we can spend doing useful work. So that is I > think the critical cut-off point. Do you think its worth a try to consider each newidle balance attempt as the total load_balance attempts until it is able to move a task, and then skip balancing within the domain if a CPU's avg idle time is less than that avg time doing newidle balance? Thanks, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote: > For the more complex model, are you suggesting that each completion time > is the time it takes to complete 1 iteration of the for_each_domain() > loop? Per sd, yes? So higher domains (or lower depending on how you model the thing in you head) have bigger CPU spans, and thus take longer to complete. Imagine the top domain of a 4096 cpu system, it would go look at all cpus to see if it could find a task. > Based on some of the data I collected, a single iteration of the > for_each_domain() loop is almost always significantly lower than the > approximate CPU idle time, even in workloads where idle_balance is > lowering performance. The bigger issue is that it takes so many of these > attempts before idle_balance actually "worked" and pulls a tasks. I'm confused, so: schedule() if (!rq->nr_running) idle_balance() for_each_domain(sd) load_balance(sd) is the entire thing, there's no other loop in there. > I initially was thinking about each "completion time" of an idle balance > as the sum total of the times of all iterations to complete until a task > is successfully pulled within each domain. So you're saying that normally idle_balance() won't find a task to pull? And we need many times going newidle before we do get something? Wouldn't this mean that there simply weren't enough tasks to keep all cpus busy? If there were tasks we could've pulled, we might need to look at why they weren't and maybe fix that. Now it could be that it things this cpu, even with the (little) idle time it has is sufficiently loaded and we'll get a 'local' wakeup soon enough. That's perfectly fine. What we should avoid is spending more time looking for tasks then we have idle, since that reduces the total time we can spend doing useful work. So that is I think the critical cut-off point. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, 2013-07-17 at 09:25 +0200, Peter Zijlstra wrote: > On Tue, Jul 16, 2013 at 03:48:01PM -0700, Jason Low wrote: > > On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote: > > > On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote: > > > > When running benchmarks on an 8 socket 80 core machine with a 3.10 > > > > kernel, > > > > there can be a lot of contention in idle_balance() and related > > > > functions. > > > > On many AIM7 workloads in which CPUs go idle very often and idle balance > > > > gets called a lot, it is actually lowering performance. > > > > > > > > Since idle balance often helps performance (when it is not overused), I > > > > looked into trying to avoid attempting idle balance only when it is > > > > occurring too frequently. > > > > > > > > This RFC patch attempts to keep track of the approximate "average" time > > > > between > > > > idle balance attempts per CPU. Each time the idle_balance() function is > > > > invoked, it will compute the duration since the last idle_balance() for > > > > the current CPU. The avg time between idle balance attempts is then > > > > updated > > > > using a very similar method as how rq->avg_idle is computed. > > > > > > > > Once the average time between idle balance attempts drops below a > > > > certain > > > > value (which in this patch is sysctl_sched_idle_balance_limit), > > > > idle_balance > > > > for that CPU will be skipped. The average time between idle balances > > > > will > > > > continue to be updated, even if it ends up getting skipped. The > > > > initial/maximum average is set a lot higher though to make sure that the > > > > avg doesn't fall below the threshold until the sample size is large and > > > > to > > > > prevent the avg from being overestimated. > > > > > > One of the things I've been talking about for a while now is how I'd > > > like to use the idle guestimator used for cpuidle for newidle balance. > > > > > > Basically based on the estimated idle time limit how far/wide you'll > > > search for tasks to run. > > > > > > You can remove the sysctl and auto-tune by measuring how long it takes > > > on avg to do a newidle balance. > > > > Hi Peter, > > > > When you say how long it takes on avg to do a newidle balance, are you > > referring to the avg time it takes for each call to CPU_NEWLY_IDLE > > load_balance() to complete, or the avg time it takes for newidle balance > > attempts within a domain to eventually successfully pull/move a task(s)? > > Both :-), being as the completion time would be roughly equivalent for the > top domain and the entire call. > > So I suppose I was somewhat unclear :-) I initially started out with a > simpler model, where you measure the avg time of the entire > idle_balance() call and measure the avg idle time and compare the two. > > Then I progressed to the more complex model where you measure the > completion time of each domain in the for_each_domain() iteration of > idle_balance() and compare that against the estimated idle time, bailing > out of the domain iteration when the avg completion time exceeds the > expected idle time. Hi Peter, For the more complex model, are you suggesting that each completion time is the time it takes to complete 1 iteration of the for_each_domain() loop? Based on some of the data I collected, a single iteration of the for_each_domain() loop is almost always significantly lower than the approximate CPU idle time, even in workloads where idle_balance is lowering performance. The bigger issue is that it takes so many of these attempts before idle_balance actually "worked" and pulls a tasks. I initially was thinking about each "completion time" of an idle balance as the sum total of the times of all iterations to complete until a task is successfully pulled within each domain. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, Jul 17, 2013 at 09:25:04AM +0200, Peter Zijlstra wrote: > One thing that I thought of since is that we need to consider what > happens for people with a low resolution sched_clock. IIRC there are > still platforms that are jiffy based. Ignore that, they're all UP. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Tue, Jul 16, 2013 at 03:48:01PM -0700, Jason Low wrote: > On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote: > > On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote: > > > When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, > > > there can be a lot of contention in idle_balance() and related functions. > > > On many AIM7 workloads in which CPUs go idle very often and idle balance > > > gets called a lot, it is actually lowering performance. > > > > > > Since idle balance often helps performance (when it is not overused), I > > > looked into trying to avoid attempting idle balance only when it is > > > occurring too frequently. > > > > > > This RFC patch attempts to keep track of the approximate "average" time > > > between > > > idle balance attempts per CPU. Each time the idle_balance() function is > > > invoked, it will compute the duration since the last idle_balance() for > > > the current CPU. The avg time between idle balance attempts is then > > > updated > > > using a very similar method as how rq->avg_idle is computed. > > > > > > Once the average time between idle balance attempts drops below a certain > > > value (which in this patch is sysctl_sched_idle_balance_limit), > > > idle_balance > > > for that CPU will be skipped. The average time between idle balances will > > > continue to be updated, even if it ends up getting skipped. The > > > initial/maximum average is set a lot higher though to make sure that the > > > avg doesn't fall below the threshold until the sample size is large and to > > > prevent the avg from being overestimated. > > > > One of the things I've been talking about for a while now is how I'd > > like to use the idle guestimator used for cpuidle for newidle balance. > > > > Basically based on the estimated idle time limit how far/wide you'll > > search for tasks to run. > > > > You can remove the sysctl and auto-tune by measuring how long it takes > > on avg to do a newidle balance. > > Hi Peter, > > When you say how long it takes on avg to do a newidle balance, are you > referring to the avg time it takes for each call to CPU_NEWLY_IDLE > load_balance() to complete, or the avg time it takes for newidle balance > attempts within a domain to eventually successfully pull/move a task(s)? Both :-), being as the completion time would be roughly equivalent for the top domain and the entire call. So I suppose I was somewhat unclear :-) I initially started out with a simpler model, where you measure the avg time of the entire idle_balance() call and measure the avg idle time and compare the two. Then I progressed to the more complex model where you measure the completion time of each domain in the for_each_domain() iteration of idle_balance() and compare that against the estimated idle time, bailing out of the domain iteration when the avg completion time exceeds the expected idle time. One thing that I thought of since is that we need to consider what happens for people with a low resolution sched_clock. IIRC there are still platforms that are jiffy based. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Tue, Jul 16, 2013 at 03:48:01PM -0700, Jason Low wrote: On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote: On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote: When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, there can be a lot of contention in idle_balance() and related functions. On many AIM7 workloads in which CPUs go idle very often and idle balance gets called a lot, it is actually lowering performance. Since idle balance often helps performance (when it is not overused), I looked into trying to avoid attempting idle balance only when it is occurring too frequently. This RFC patch attempts to keep track of the approximate average time between idle balance attempts per CPU. Each time the idle_balance() function is invoked, it will compute the duration since the last idle_balance() for the current CPU. The avg time between idle balance attempts is then updated using a very similar method as how rq-avg_idle is computed. Once the average time between idle balance attempts drops below a certain value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance for that CPU will be skipped. The average time between idle balances will continue to be updated, even if it ends up getting skipped. The initial/maximum average is set a lot higher though to make sure that the avg doesn't fall below the threshold until the sample size is large and to prevent the avg from being overestimated. One of the things I've been talking about for a while now is how I'd like to use the idle guestimator used for cpuidle for newidle balance. Basically based on the estimated idle time limit how far/wide you'll search for tasks to run. You can remove the sysctl and auto-tune by measuring how long it takes on avg to do a newidle balance. Hi Peter, When you say how long it takes on avg to do a newidle balance, are you referring to the avg time it takes for each call to CPU_NEWLY_IDLE load_balance() to complete, or the avg time it takes for newidle balance attempts within a domain to eventually successfully pull/move a task(s)? Both :-), being as the completion time would be roughly equivalent for the top domain and the entire call. So I suppose I was somewhat unclear :-) I initially started out with a simpler model, where you measure the avg time of the entire idle_balance() call and measure the avg idle time and compare the two. Then I progressed to the more complex model where you measure the completion time of each domain in the for_each_domain() iteration of idle_balance() and compare that against the estimated idle time, bailing out of the domain iteration when the avg completion time exceeds the expected idle time. One thing that I thought of since is that we need to consider what happens for people with a low resolution sched_clock. IIRC there are still platforms that are jiffy based. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, Jul 17, 2013 at 09:25:04AM +0200, Peter Zijlstra wrote: One thing that I thought of since is that we need to consider what happens for people with a low resolution sched_clock. IIRC there are still platforms that are jiffy based. Ignore that, they're all UP. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, 2013-07-17 at 09:25 +0200, Peter Zijlstra wrote: On Tue, Jul 16, 2013 at 03:48:01PM -0700, Jason Low wrote: On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote: On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote: When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, there can be a lot of contention in idle_balance() and related functions. On many AIM7 workloads in which CPUs go idle very often and idle balance gets called a lot, it is actually lowering performance. Since idle balance often helps performance (when it is not overused), I looked into trying to avoid attempting idle balance only when it is occurring too frequently. This RFC patch attempts to keep track of the approximate average time between idle balance attempts per CPU. Each time the idle_balance() function is invoked, it will compute the duration since the last idle_balance() for the current CPU. The avg time between idle balance attempts is then updated using a very similar method as how rq-avg_idle is computed. Once the average time between idle balance attempts drops below a certain value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance for that CPU will be skipped. The average time between idle balances will continue to be updated, even if it ends up getting skipped. The initial/maximum average is set a lot higher though to make sure that the avg doesn't fall below the threshold until the sample size is large and to prevent the avg from being overestimated. One of the things I've been talking about for a while now is how I'd like to use the idle guestimator used for cpuidle for newidle balance. Basically based on the estimated idle time limit how far/wide you'll search for tasks to run. You can remove the sysctl and auto-tune by measuring how long it takes on avg to do a newidle balance. Hi Peter, When you say how long it takes on avg to do a newidle balance, are you referring to the avg time it takes for each call to CPU_NEWLY_IDLE load_balance() to complete, or the avg time it takes for newidle balance attempts within a domain to eventually successfully pull/move a task(s)? Both :-), being as the completion time would be roughly equivalent for the top domain and the entire call. So I suppose I was somewhat unclear :-) I initially started out with a simpler model, where you measure the avg time of the entire idle_balance() call and measure the avg idle time and compare the two. Then I progressed to the more complex model where you measure the completion time of each domain in the for_each_domain() iteration of idle_balance() and compare that against the estimated idle time, bailing out of the domain iteration when the avg completion time exceeds the expected idle time. Hi Peter, For the more complex model, are you suggesting that each completion time is the time it takes to complete 1 iteration of the for_each_domain() loop? Based on some of the data I collected, a single iteration of the for_each_domain() loop is almost always significantly lower than the approximate CPU idle time, even in workloads where idle_balance is lowering performance. The bigger issue is that it takes so many of these attempts before idle_balance actually worked and pulls a tasks. I initially was thinking about each completion time of an idle balance as the sum total of the times of all iterations to complete until a task is successfully pulled within each domain. Jason -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote: For the more complex model, are you suggesting that each completion time is the time it takes to complete 1 iteration of the for_each_domain() loop? Per sd, yes? So higher domains (or lower depending on how you model the thing in you head) have bigger CPU spans, and thus take longer to complete. Imagine the top domain of a 4096 cpu system, it would go look at all cpus to see if it could find a task. Based on some of the data I collected, a single iteration of the for_each_domain() loop is almost always significantly lower than the approximate CPU idle time, even in workloads where idle_balance is lowering performance. The bigger issue is that it takes so many of these attempts before idle_balance actually worked and pulls a tasks. I'm confused, so: schedule() if (!rq-nr_running) idle_balance() for_each_domain(sd) load_balance(sd) is the entire thing, there's no other loop in there. I initially was thinking about each completion time of an idle balance as the sum total of the times of all iterations to complete until a task is successfully pulled within each domain. So you're saying that normally idle_balance() won't find a task to pull? And we need many times going newidle before we do get something? Wouldn't this mean that there simply weren't enough tasks to keep all cpus busy? If there were tasks we could've pulled, we might need to look at why they weren't and maybe fix that. Now it could be that it things this cpu, even with the (little) idle time it has is sufficiently loaded and we'll get a 'local' wakeup soon enough. That's perfectly fine. What we should avoid is spending more time looking for tasks then we have idle, since that reduces the total time we can spend doing useful work. So that is I think the critical cut-off point. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
Hi Peter, On Wed, 2013-07-17 at 11:39 +0200, Peter Zijlstra wrote: On Wed, Jul 17, 2013 at 01:11:41AM -0700, Jason Low wrote: For the more complex model, are you suggesting that each completion time is the time it takes to complete 1 iteration of the for_each_domain() loop? Per sd, yes? So higher domains (or lower depending on how you model the thing in you head) have bigger CPU spans, and thus take longer to complete. Imagine the top domain of a 4096 cpu system, it would go look at all cpus to see if it could find a task. Based on some of the data I collected, a single iteration of the for_each_domain() loop is almost always significantly lower than the approximate CPU idle time, even in workloads where idle_balance is lowering performance. The bigger issue is that it takes so many of these attempts before idle_balance actually worked and pulls a tasks. I'm confused, so: schedule() if (!rq-nr_running) idle_balance() for_each_domain(sd) load_balance(sd) is the entire thing, there's no other loop in there. So if we have the following: for_each_domain(sd) before = sched_clock_cpu load_balance(sd) after = sched_clock_cpu idle_balance_completion_time = after - before At this point, the idle_balance_completion_time is usually a very small value and is usually a lot smaller than the avg CPU idle time. However, the vast majority of the time, load_balance returns 0. I initially was thinking about each completion time of an idle balance as the sum total of the times of all iterations to complete until a task is successfully pulled within each domain. So you're saying that normally idle_balance() won't find a task to pull? And we need many times going newidle before we do get something? Yes, a while ago, I collected some data on the rate in which idle_balance() does not pull tasks, and it was a very high number. Wouldn't this mean that there simply weren't enough tasks to keep all cpus busy? If I remember correctly, in a lot of those load_balance attempts when the machine is under a high Java load, there were no imbalance between the groups in each sched_domain. If there were tasks we could've pulled, we might need to look at why they weren't and maybe fix that. Now it could be that it things this cpu, even with the (little) idle time it has is sufficiently loaded and we'll get a 'local' wakeup soon enough. That's perfectly fine. What we should avoid is spending more time looking for tasks then we have idle, since that reduces the total time we can spend doing useful work. So that is I think the critical cut-off point. Do you think its worth a try to consider each newidle balance attempt as the total load_balance attempts until it is able to move a task, and then skip balancing within the domain if a CPU's avg idle time is less than that avg time doing newidle balance? Thanks, Jason -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, Jul 17, 2013 at 08:59:01AM -0700, Jason Low wrote: So if we have the following: for_each_domain(sd) before = sched_clock_cpu load_balance(sd) after = sched_clock_cpu idle_balance_completion_time = after - before At this point, the idle_balance_completion_time is usually a very small value and is usually a lot smaller than the avg CPU idle time. However, the vast majority of the time, load_balance returns 0. I think the interesting question here is: is it significantly more when we do find a task? I would also expect sd-newidle_balance_cost (less typing there) to scale with the number of CPUs in the domain - thus larger domains will take longer etc. And (obviously) the cost of the entire newidle balance is the direct sum of individual domain costs. Do you think its worth a try to consider each newidle balance attempt as the total load_balance attempts until it is able to move a task, and then skip balancing within the domain if a CPU's avg idle time is less than that avg time doing newidle balance? So the way I see things is that the only way newidle balance can slow down things is if it runs when we could have ran something useful. So all we need to ensure is to not run longer than we expect to be idle for and things should be 'free', right? So the problem I have with your proposal is that supposing we're successful once every 10 newidle balances. Then the sd-newidle_balance_cost gets inflated by a factor 10 -- for we'd count 10 of them before 'success'. However when we're idle for that amount of time (10 times longer than it takes to do a single newidle balance) we'd still only do a single newidle balance, not 10. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On 07/17/2013 12:18 PM, Peter Zijlstra wrote: On Wed, Jul 17, 2013 at 08:59:01AM -0700, Jason Low wrote: Do you think its worth a try to consider each newidle balance attempt as the total load_balance attempts until it is able to move a task, and then skip balancing within the domain if a CPU's avg idle time is less than that avg time doing newidle balance? So the way I see things is that the only way newidle balance can slow down things is if it runs when we could have ran something useful. Due to contention on the runqueue locks of other CPUs, newidle also has the potential to keep _others_ from running something useful. So all we need to ensure is to not run longer than we expect to be idle for and things should be 'free', right? So the problem I have with your proposal is that supposing we're successful once every 10 newidle balances. Then the sd-newidle_balance_cost gets inflated by a factor 10 -- for we'd count 10 of them before 'success'. However when we're idle for that amount of time (10 times longer than it takes to do a single newidle balance) we'd still only do a single newidle balance, not 10. Could we prevent that downside by measuring both the time spent idle, and the time spent in idle balancing, and making sure the idle balancing time never exceeds more than N% of the idle time? Say, have the CPU never spend more than 10% of its idle time in the idle balancing code, as averaged over some time period? That way we might still do idle balancing every X idle periods, even if the idle periods themselves are relatively short. It might also be enough to prevent excessive lock contention triggered by the idle balancing code, though I have to admit I did not really think that part through :) -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote: On 07/17/2013 12:18 PM, Peter Zijlstra wrote: So the way I see things is that the only way newidle balance can slow down things is if it runs when we could have ran something useful. Due to contention on the runqueue locks of other CPUs, newidle also has the potential to keep _others_ from running something useful. Right, although that should only happen when we do have an imbalance and want to go move something. Which in Jason's case is 'rare'. But yes, I suppose there's other scenarios where this is far more likely. Could we prevent that downside by measuring both the time spent idle, and the time spent in idle balancing, and making sure the idle balancing time never exceeds more than N% of the idle time? Sure: idle_balance(u64 idle_duration) { u64 cost = 0; for_each_domain(sd) { if (cost + sd-cost idle_duration/N) break; ... sd-cost = (sd-cost + this_cost) / 2; cost += this_cost; } } I would've initially suggested using something like N=2 since we're dealing with averages and half should ensure we don't run over except for the worst peaks. But we could easily use a bigger N. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, 2013-07-17 at 20:01 +0200, Peter Zijlstra wrote: On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote: On 07/17/2013 12:18 PM, Peter Zijlstra wrote: So the way I see things is that the only way newidle balance can slow down things is if it runs when we could have ran something useful. Due to contention on the runqueue locks of other CPUs, newidle also has the potential to keep _others_ from running something useful. Right, although that should only happen when we do have an imbalance and want to go move something. Which in Jason's case is 'rare'. But yes, I suppose there's other scenarios where this is far more likely. Could we prevent that downside by measuring both the time spent idle, and the time spent in idle balancing, and making sure the idle balancing time never exceeds more than N% of the idle time? Sure: idle_balance(u64 idle_duration) { u64 cost = 0; for_each_domain(sd) { if (cost + sd-cost idle_duration/N) break; ... sd-cost = (sd-cost + this_cost) / 2; cost += this_cost; } } I would've initially suggested using something like N=2 since we're dealing with averages and half should ensure we don't run over except for the worst peaks. But we could easily use a bigger N. Okay, I'll try this out. Thank you for your suggestions. Jason. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Wed, 2013-07-17 at 20:01 +0200, Peter Zijlstra wrote: On Wed, Jul 17, 2013 at 01:51:51PM -0400, Rik van Riel wrote: On 07/17/2013 12:18 PM, Peter Zijlstra wrote: So the way I see things is that the only way newidle balance can slow down things is if it runs when we could have ran something useful. Due to contention on the runqueue locks of other CPUs, newidle also has the potential to keep _others_ from running something useful. Right, although that should only happen when we do have an imbalance and want to go move something. Which in Jason's case is 'rare'. But yes, I suppose there's other scenarios where this is far more likely. Could we prevent that downside by measuring both the time spent idle, and the time spent in idle balancing, and making sure the idle balancing time never exceeds more than N% of the idle time? Sure: idle_balance(u64 idle_duration) { u64 cost = 0; for_each_domain(sd) { if (cost + sd-cost idle_duration/N) break; ... sd-cost = (sd-cost + this_cost) / 2; cost += this_cost; } } I would've initially suggested using something like N=2 since we're dealing with averages and half should ensure we don't run over except for the worst peaks. But we could easily use a bigger N. I ran a few AIM7 workloads for the 8 socket HT enabled case and I needed to set N to more than 20 in order to get the big performance gains. One thing that I thought of was to have N be based on how often idle balance attempts does not pull task(s). For example, N can be calculated based on the number of idle balance attempts for the CPU since the last successful idle balance attempt. So if the previous 30 idle balance attempts resulted in no tasks moved, then n = 30 / 5. So idle balance gets less time to run as the number of unneeded idle balance attempts increases, and thus N will not be set too high during situations where idle balancing is successful more often. Any comments on this idea? Thanks, Jason -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote: > On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote: > > When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, > > there can be a lot of contention in idle_balance() and related functions. > > On many AIM7 workloads in which CPUs go idle very often and idle balance > > gets called a lot, it is actually lowering performance. > > > > Since idle balance often helps performance (when it is not overused), I > > looked into trying to avoid attempting idle balance only when it is > > occurring too frequently. > > > > This RFC patch attempts to keep track of the approximate "average" time > > between > > idle balance attempts per CPU. Each time the idle_balance() function is > > invoked, it will compute the duration since the last idle_balance() for > > the current CPU. The avg time between idle balance attempts is then updated > > using a very similar method as how rq->avg_idle is computed. > > > > Once the average time between idle balance attempts drops below a certain > > value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance > > for that CPU will be skipped. The average time between idle balances will > > continue to be updated, even if it ends up getting skipped. The > > initial/maximum average is set a lot higher though to make sure that the > > avg doesn't fall below the threshold until the sample size is large and to > > prevent the avg from being overestimated. > > One of the things I've been talking about for a while now is how I'd > like to use the idle guestimator used for cpuidle for newidle balance. > > Basically based on the estimated idle time limit how far/wide you'll > search for tasks to run. > > You can remove the sysctl and auto-tune by measuring how long it takes > on avg to do a newidle balance. Hi Peter, When you say how long it takes on avg to do a newidle balance, are you referring to the avg time it takes for each call to CPU_NEWLY_IDLE load_balance() to complete, or the avg time it takes for newidle balance attempts within a domain to eventually successfully pull/move a task(s)? Thanks, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote: > When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, > there can be a lot of contention in idle_balance() and related functions. > On many AIM7 workloads in which CPUs go idle very often and idle balance > gets called a lot, it is actually lowering performance. > > Since idle balance often helps performance (when it is not overused), I > looked into trying to avoid attempting idle balance only when it is > occurring too frequently. > > This RFC patch attempts to keep track of the approximate "average" time > between > idle balance attempts per CPU. Each time the idle_balance() function is > invoked, it will compute the duration since the last idle_balance() for > the current CPU. The avg time between idle balance attempts is then updated > using a very similar method as how rq->avg_idle is computed. > > Once the average time between idle balance attempts drops below a certain > value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance > for that CPU will be skipped. The average time between idle balances will > continue to be updated, even if it ends up getting skipped. The > initial/maximum average is set a lot higher though to make sure that the > avg doesn't fall below the threshold until the sample size is large and to > prevent the avg from being overestimated. One of the things I've been talking about for a while now is how I'd like to use the idle guestimator used for cpuidle for newidle balance. Basically based on the estimated idle time limit how far/wide you'll search for tasks to run. You can remove the sysctl and auto-tune by measuring how long it takes on avg to do a newidle balance. Also, its not the time between idles that's interesting, but how long you're going to be idle. If you track the time it takes to newidle balance per sched-domain you traverse, you're almost there. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On 07/16/2013 03:21 PM, Jason Low wrote: When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, there can be a lot of contention in idle_balance() and related functions. On many AIM7 workloads in which CPUs go idle very often and idle balance gets called a lot, it is actually lowering performance. Since idle balance often helps performance (when it is not overused), I looked into trying to avoid attempting idle balance only when it is occurring too frequently. This RFC patch attempts to keep track of the approximate "average" time between idle balance attempts per CPU. Each time the idle_balance() function is invoked, it will compute the duration since the last idle_balance() for the current CPU. The avg time between idle balance attempts is then updated using a very similar method as how rq->avg_idle is computed. Once the average time between idle balance attempts drops below a certain value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance for that CPU will be skipped. The average time between idle balances will continue to be updated, even if it ends up getting skipped. The initial/maximum average is set a lot higher though to make sure that the avg doesn't fall below the threshold until the sample size is large and to prevent the avg from being overestimated. This change improved the performance of many AIM7 workloads at 1, 2, 4, 8 sockets on the 3.10 kernel. The most significant differences were at 8 sockets HT-enabled. The table below compares the average jobs per minute at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this patch. I included data for both hyperthreading disabled and enabled. I used numactl to restrict AIM7 to run on certain number of nodes. I only included data in which the % difference was beyond a 2% noise range. Reviewed-by: Rik van Riel -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
[RFC] sched: Limit idle_balance() when it is being used too frequently
When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, there can be a lot of contention in idle_balance() and related functions. On many AIM7 workloads in which CPUs go idle very often and idle balance gets called a lot, it is actually lowering performance. Since idle balance often helps performance (when it is not overused), I looked into trying to avoid attempting idle balance only when it is occurring too frequently. This RFC patch attempts to keep track of the approximate "average" time between idle balance attempts per CPU. Each time the idle_balance() function is invoked, it will compute the duration since the last idle_balance() for the current CPU. The avg time between idle balance attempts is then updated using a very similar method as how rq->avg_idle is computed. Once the average time between idle balance attempts drops below a certain value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance for that CPU will be skipped. The average time between idle balances will continue to be updated, even if it ends up getting skipped. The initial/maximum average is set a lot higher though to make sure that the avg doesn't fall below the threshold until the sample size is large and to prevent the avg from being overestimated. This change improved the performance of many AIM7 workloads at 1, 2, 4, 8 sockets on the 3.10 kernel. The most significant differences were at 8 sockets HT-enabled. The table below compares the average jobs per minute at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this patch. I included data for both hyperthreading disabled and enabled. I used numactl to restrict AIM7 to run on certain number of nodes. I only included data in which the % difference was beyond a 2% noise range. -- 1 socket -- workload| HT-disabled | HT-enabled| | % improvement | % improvement | | with patch| with patch| -- disk| +17.7% | +4.7% | -- high_systime| +2.9% | - | -- -- 2 sockets -- workload| HT-disabled | HT-enabled| | % improvement | % improvement | | with patch| with patch| -- alltests| - | +2.3% | -- disk| +10.5% | - | -- fserver | +3.6% | - | -- new_fserver | +3.7% | - | -- -- 4 sockets -- workload| HT-disabled | HT-enabled| | % improvement | % improvement | | with patch| with patch| -- alltests| +3.7% | - | -- custom | -2.2% | +14.0% | -- fserver | +2.8% | - | -- high_systime| -3.6% | +18.7% | -- new_fserver | +3.4% | - | -- -- 8 sockets -- workload| HT-disabled | HT-enabled| | % improvement | % improvement | | with patch| with patch| -- alltests| +4.4% | +13.3% | -- custom | +8.1% | +15.2% |
[RFC] sched: Limit idle_balance() when it is being used too frequently
When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, there can be a lot of contention in idle_balance() and related functions. On many AIM7 workloads in which CPUs go idle very often and idle balance gets called a lot, it is actually lowering performance. Since idle balance often helps performance (when it is not overused), I looked into trying to avoid attempting idle balance only when it is occurring too frequently. This RFC patch attempts to keep track of the approximate average time between idle balance attempts per CPU. Each time the idle_balance() function is invoked, it will compute the duration since the last idle_balance() for the current CPU. The avg time between idle balance attempts is then updated using a very similar method as how rq-avg_idle is computed. Once the average time between idle balance attempts drops below a certain value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance for that CPU will be skipped. The average time between idle balances will continue to be updated, even if it ends up getting skipped. The initial/maximum average is set a lot higher though to make sure that the avg doesn't fall below the threshold until the sample size is large and to prevent the avg from being overestimated. This change improved the performance of many AIM7 workloads at 1, 2, 4, 8 sockets on the 3.10 kernel. The most significant differences were at 8 sockets HT-enabled. The table below compares the average jobs per minute at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this patch. I included data for both hyperthreading disabled and enabled. I used numactl to restrict AIM7 to run on certain number of nodes. I only included data in which the % difference was beyond a 2% noise range. -- 1 socket -- workload| HT-disabled | HT-enabled| | % improvement | % improvement | | with patch| with patch| -- disk| +17.7% | +4.7% | -- high_systime| +2.9% | - | -- -- 2 sockets -- workload| HT-disabled | HT-enabled| | % improvement | % improvement | | with patch| with patch| -- alltests| - | +2.3% | -- disk| +10.5% | - | -- fserver | +3.6% | - | -- new_fserver | +3.7% | - | -- -- 4 sockets -- workload| HT-disabled | HT-enabled| | % improvement | % improvement | | with patch| with patch| -- alltests| +3.7% | - | -- custom | -2.2% | +14.0% | -- fserver | +2.8% | - | -- high_systime| -3.6% | +18.7% | -- new_fserver | +3.4% | - | -- -- 8 sockets -- workload| HT-disabled | HT-enabled| | % improvement | % improvement | | with patch| with patch| -- alltests| +4.4% | +13.3% | -- custom | +8.1% | +15.2% |
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On 07/16/2013 03:21 PM, Jason Low wrote: When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, there can be a lot of contention in idle_balance() and related functions. On many AIM7 workloads in which CPUs go idle very often and idle balance gets called a lot, it is actually lowering performance. Since idle balance often helps performance (when it is not overused), I looked into trying to avoid attempting idle balance only when it is occurring too frequently. This RFC patch attempts to keep track of the approximate average time between idle balance attempts per CPU. Each time the idle_balance() function is invoked, it will compute the duration since the last idle_balance() for the current CPU. The avg time between idle balance attempts is then updated using a very similar method as how rq-avg_idle is computed. Once the average time between idle balance attempts drops below a certain value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance for that CPU will be skipped. The average time between idle balances will continue to be updated, even if it ends up getting skipped. The initial/maximum average is set a lot higher though to make sure that the avg doesn't fall below the threshold until the sample size is large and to prevent the avg from being overestimated. This change improved the performance of many AIM7 workloads at 1, 2, 4, 8 sockets on the 3.10 kernel. The most significant differences were at 8 sockets HT-enabled. The table below compares the average jobs per minute at 1100-2000 users between the vanilla 3.10 kernel and 3.10 kernel with this patch. I included data for both hyperthreading disabled and enabled. I used numactl to restrict AIM7 to run on certain number of nodes. I only included data in which the % difference was beyond a 2% noise range. Reviewed-by: Rik van Riel r...@redhat.com -- All rights reversed -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote: When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, there can be a lot of contention in idle_balance() and related functions. On many AIM7 workloads in which CPUs go idle very often and idle balance gets called a lot, it is actually lowering performance. Since idle balance often helps performance (when it is not overused), I looked into trying to avoid attempting idle balance only when it is occurring too frequently. This RFC patch attempts to keep track of the approximate average time between idle balance attempts per CPU. Each time the idle_balance() function is invoked, it will compute the duration since the last idle_balance() for the current CPU. The avg time between idle balance attempts is then updated using a very similar method as how rq-avg_idle is computed. Once the average time between idle balance attempts drops below a certain value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance for that CPU will be skipped. The average time between idle balances will continue to be updated, even if it ends up getting skipped. The initial/maximum average is set a lot higher though to make sure that the avg doesn't fall below the threshold until the sample size is large and to prevent the avg from being overestimated. One of the things I've been talking about for a while now is how I'd like to use the idle guestimator used for cpuidle for newidle balance. Basically based on the estimated idle time limit how far/wide you'll search for tasks to run. You can remove the sysctl and auto-tune by measuring how long it takes on avg to do a newidle balance. Also, its not the time between idles that's interesting, but how long you're going to be idle. If you track the time it takes to newidle balance per sched-domain you traverse, you're almost there. -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [RFC] sched: Limit idle_balance() when it is being used too frequently
On Tue, 2013-07-16 at 22:20 +0200, Peter Zijlstra wrote: On Tue, Jul 16, 2013 at 12:21:03PM -0700, Jason Low wrote: When running benchmarks on an 8 socket 80 core machine with a 3.10 kernel, there can be a lot of contention in idle_balance() and related functions. On many AIM7 workloads in which CPUs go idle very often and idle balance gets called a lot, it is actually lowering performance. Since idle balance often helps performance (when it is not overused), I looked into trying to avoid attempting idle balance only when it is occurring too frequently. This RFC patch attempts to keep track of the approximate average time between idle balance attempts per CPU. Each time the idle_balance() function is invoked, it will compute the duration since the last idle_balance() for the current CPU. The avg time between idle balance attempts is then updated using a very similar method as how rq-avg_idle is computed. Once the average time between idle balance attempts drops below a certain value (which in this patch is sysctl_sched_idle_balance_limit), idle_balance for that CPU will be skipped. The average time between idle balances will continue to be updated, even if it ends up getting skipped. The initial/maximum average is set a lot higher though to make sure that the avg doesn't fall below the threshold until the sample size is large and to prevent the avg from being overestimated. One of the things I've been talking about for a while now is how I'd like to use the idle guestimator used for cpuidle for newidle balance. Basically based on the estimated idle time limit how far/wide you'll search for tasks to run. You can remove the sysctl and auto-tune by measuring how long it takes on avg to do a newidle balance. Hi Peter, When you say how long it takes on avg to do a newidle balance, are you referring to the avg time it takes for each call to CPU_NEWLY_IDLE load_balance() to complete, or the avg time it takes for newidle balance attempts within a domain to eventually successfully pull/move a task(s)? Thanks, Jason -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/