Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
Hi, Peter Thanks for the reply :) On 06/23/2014 05:42 PM, Peter Zijlstra wrote: [snip] >> >> cpu 0 cpu 1 >> >> dbench task_sys >> dbench task_sys >> dbench >> dbench >> dbench >> dbench >> task_sys >> task_sys > > It might help if you prefix each task with the cgroup they're in; My bad... but I > think I get it, its like: > > cpu0 > > A/dbench > A/dbench > A/dbench > A/dbench > A/dbench > A/dbench > /task_sys > /task_sys Yeah, it's like that. > [snip] > > cpu0 > > A/B/dbench > A/B/dbench > A/B/dbench > A/B/dbench > A/B/dbench > A/B/dbench > /task_sys > /task_sys > > Right? My bad to missed the group symbol here... it's actually like: cpu0 /l1/A/dbench /l1/A/dbench /l1/A/dbench /l1/A/dbench /l1/A/dbench /task_sys /task_sys And we also have six: /l1/B/stress and six: /l1/C/stress running in system. A, B, C is the child groups of l1. > >> cpu 0 cpu 1 >> load1024/3 + 1024*2 1024*2 >> >> 2389 : 2048 imbalance %116 > > Which should still end up with 3072, because A is still 1024 in total, > and all its member tasks run on the one CPU. l1 have 3 child groups, each got 6 NICE 0 tasks, so ideally each task will got 1024/18, 6 dbench will means (1024/18)*6 == 1024/3. Previously each of the 3 group got 1024 shares, now they need to share 1024 shares, it will become less for each of them. > >> And it could be even less during my testing... > > Well, yes, up to 1024/nr_cpus I imagine. > >> This is just try to explain that when 'group_load : rq_load' become >> lower, it's influence to 'rq_load' become lower too, and if the system >> is balanced with only 'rq_load' there, it will be considered still >> balanced even 'group_load' gathered on one cpu. >> >> Please let me know if I missed something here... > > Yeah, what other tasks are these task_sys things? workqueue crap? There are some other tasks but mostly showup are the kworkers, yes the workqueue stuff. They rapidly showup on each CPU, in some period if they showup too much, they will eat some CPU% too, but not very much. > [snip] >> >> These are dbench and stress with less root-load when put into l2-groups, >> that make it harder to trigger root-group imbalance like in the case above. > > You're still not making sense here.. without the task_sys thingies in > you get something like: > > cpu0 cpu1 > > A/dbench A/dbench > B/stress B/stress > > And the total loads are: 512+512 vs 512+512. Without other task's influence, I believe the balance should be fine, but in our cases, at least these kworkers will join the battle anyway... > >>> Same with l2, total weight of 1024, giving a per task weight of ~56 and >>> a per-cpu weight of ~85, which is again significant. >> >> We have other tasks which has to running in the system, in order to >> serve dbench and others, and that also the case in real world, dbench >> and stress are not the only tasks on rq time to time. >> >> May be we could focus on the case above and see if it could make things >> more clear firstly? > > Well, this all smells like you need some cgroup affinity for whatever > system tasks are running. Not fuck up the scheduler for no sane reason. These kworkers are bind to their CPU already, I don't know how to handle them to prevent the issue, they just keep working on their CPU, and whenever they showup, dbench spreading inactively... We just want a way which could help workload like dbench to work normally with cpu-group when there are stress likely workload running in the system. We want dbench to gain more CPU% but cpu-shares doesn't work as expected... dbench can get no more than 100% whatever how big it's group's shares is, and we consider that cpu-group was broken in this cases... I agree that this is not a generic requirement and scheduler should only be responsible for general situation, but since it's really a too big regression, could we at least provide some way to stop the damage? After all, most of the cpu-group logic is insider scheduler... I'd like to list some real numbers in patch-thread, we really desired for some way to make cpu-group perform normally on workload like dbench, actually we also find some transaction workloads suffered from this issue too, in such cases, cpu-group just failed on managing the CPU resources... Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Wed, Jun 11, 2014 at 05:18:29PM +0800, Michael wang wrote: > On 06/11/2014 04:24 PM, Peter Zijlstra wrote: > [snip] > >> > >> IMHO, when we put tasks one group deeper, in other word the totally > >> weight of these tasks is 1024 (prev is 3072), the load become more > >> balancing in root, which make bl-routine consider the system is > >> balanced, which make we migrate less in lb-routine. > > > > But how? The absolute value (1024 vs 3072) is of no effect to the > > imbalance, the imbalance is computed from relative differences between > > cpus. > > Ok, forgive me for the confusion, please allow me to explain things > again, for gathered cases like: > > cpu 0 cpu 1 > > dbench task_sys > dbench task_sys > dbench > dbench > dbench > dbench > task_sys > task_sys It might help if you prefix each task with the cgroup they're in; but I think I get it, its like: cpu0 A/dbench A/dbench A/dbench A/dbench A/dbench A/dbench /task_sys /task_sys > task_sys is other tasks belong to root which is nice 0, so when dbench > in l1: > > cpu 0 cpu 1 > load1024 + 1024*2 1024*2 > > 3072: 2048 imbalance %150 > > now when they belong to l2: That would be: cpu0 A/B/dbench A/B/dbench A/B/dbench A/B/dbench A/B/dbench A/B/dbench /task_sys /task_sys Right? > cpu 0 cpu 1 > load1024/3 + 1024*2 1024*2 > > 2389 : 2048 imbalance %116 Which should still end up with 3072, because A is still 1024 in total, and all its member tasks run on the one CPU. > And it could be even less during my testing... Well, yes, up to 1024/nr_cpus I imagine. > This is just try to explain that when 'group_load : rq_load' become > lower, it's influence to 'rq_load' become lower too, and if the system > is balanced with only 'rq_load' there, it will be considered still > balanced even 'group_load' gathered on one cpu. > > Please let me know if I missed something here... Yeah, what other tasks are these task_sys things? workqueue crap? > >> Exactly, however, when group is deep, the chance of it to make root > >> imbalance reduced, in good case, gathered on cpu means 1024 load, while > >> in bad case it dropped to 1024/3 ideally, that make it harder to trigger > >> imbalance and gain help from the routine, please note that although > >> dbench and stress are the only workload in system, there are still other > >> tasks serve for the system need to be wakeup (some very actively since > >> the dbench...), compared to them, deep group load means nothing... > > > > What tasks are these? And is it their interference that disturbs > > load-balancing? > > These are dbench and stress with less root-load when put into l2-groups, > that make it harder to trigger root-group imbalance like in the case above. You're still not making sense here.. without the task_sys thingies in you get something like: cpu0 cpu1 A/dbench A/dbench B/stress B/stress And the total loads are: 512+512 vs 512+512. > > Same with l2, total weight of 1024, giving a per task weight of ~56 and > > a per-cpu weight of ~85, which is again significant. > > We have other tasks which has to running in the system, in order to > serve dbench and others, and that also the case in real world, dbench > and stress are not the only tasks on rq time to time. > > May be we could focus on the case above and see if it could make things > more clear firstly? Well, this all smells like you need some cgroup affinity for whatever system tasks are running. Not fuck up the scheduler for no sane reason. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 06/11/2014 04:24 PM, Peter Zijlstra wrote: [snip] >> >> IMHO, when we put tasks one group deeper, in other word the totally >> weight of these tasks is 1024 (prev is 3072), the load become more >> balancing in root, which make bl-routine consider the system is >> balanced, which make we migrate less in lb-routine. > > But how? The absolute value (1024 vs 3072) is of no effect to the > imbalance, the imbalance is computed from relative differences between > cpus. Ok, forgive me for the confusion, please allow me to explain things again, for gathered cases like: cpu 0 cpu 1 dbench task_sys dbench task_sys dbench dbench dbench dbench task_sys task_sys task_sys is other tasks belong to root which is nice 0, so when dbench in l1: cpu 0 cpu 1 load1024 + 1024*2 1024*2 3072: 2048 imbalance %150 now when they belong to l2: cpu 0 cpu 1 load1024/3 + 1024*2 1024*2 2389 : 2048 imbalance %116 And it could be even less during my testing... This is just try to explain that when 'group_load : rq_load' become lower, it's influence to 'rq_load' become lower too, and if the system is balanced with only 'rq_load' there, it will be considered still balanced even 'group_load' gathered on one cpu. Please let me know if I missed something here... > [snip] >> >> Although the l1-group gain the same resources (1200%), it doesn't assign >> to l2-ABC correctly like the root-group did. > > But in this case select_idle_sibling() should function identially, so > that cannot be the problem. Yes, it's clean, select_idle_sibling() just return curr or prev cpu in this case. > [snip] >> >> Exactly, however, when group is deep, the chance of it to make root >> imbalance reduced, in good case, gathered on cpu means 1024 load, while >> in bad case it dropped to 1024/3 ideally, that make it harder to trigger >> imbalance and gain help from the routine, please note that although >> dbench and stress are the only workload in system, there are still other >> tasks serve for the system need to be wakeup (some very actively since >> the dbench...), compared to them, deep group load means nothing... > > What tasks are these? And is it their interference that disturbs > load-balancing? These are dbench and stress with less root-load when put into l2-groups, that make it harder to trigger root-group imbalance like in the case above. > By which means even tasks in deep group all gathered on one CPU, the load could still balanced from the view of root group, and the tasks lost the only chances (balance) to spread when they already on the same CPU... >>> >>> Sure, but see above. >> >> The lb-routine could not provide enough help for deep group, since the >> imbalance happened inside the group could not cause imbalance in root, >> ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be >> easily ignored, but inside the l2-group, the gathered case could already >> means imbalance like (1024 * 5) : 1024. > > your explanation is not making sense, we have 3 cgroups, so the total > root weight is at least 3072, with 18 tasks you would get 3072/18 ~ 170. I mean the l2-groups case here... since l1 share is 1024, the total load of l2-groups will be 1024 by theory. > > And again, the absolute value doesn't matter, with (istr) 12 cpus the > avg cpu load would be 3072/12 ~ 256, and 170 is significant on that > scale. > > Same with l2, total weight of 1024, giving a per task weight of ~56 and > a per-cpu weight of ~85, which is again significant. We have other tasks which has to running in the system, in order to serve dbench and others, and that also the case in real world, dbench and stress are not the only tasks on rq time to time. May be we could focus on the case above and see if it could make things more clear firstly? Regards, Michael Wang > > Also, you said load-balance doesn't usually participate much because > dbench is too fast, so please make up your mind, does it or doesn't it > matter? > >>> So I think that approach is wrong, select_idle_siblings() works because >>> we want to keep CPUs from being idle, but if they're not actually idle, >>> pretending like they are (in a cgroup) is actively wrong and can skew >>> load pretty bad. >> >> We only choose the timing when no idle cpu located, and flips is >> somewhat high, also the group is deep. > > -enotmakingsense > >> In such cases, select_idle_siblings() doesn't works anyway, it return >> the target even it is very busy, we just check twice to prevent it from >> making some obviously bad decision ;-) > > -emakinglesssense > >>> Furthermore, if as I expect, dbench sucks on a busy system, then the >>> proposed cgroup thing
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Wed, Jun 11, 2014 at 02:13:42PM +0800, Michael wang wrote: > Hi, Peter > > Thanks for the reply :) > > On 06/10/2014 08:12 PM, Peter Zijlstra wrote: > [snip] > >> Wake-affine for sure pull tasks together for workload like dbench, what > >> make > >> it difference when put dbench into a group one level deeper is the > >> load-balance, which happened less. > > > > We load-balance less (frequently) or we migrate less tasks due to > > load-balancing ? > > IMHO, when we put tasks one group deeper, in other word the totally > weight of these tasks is 1024 (prev is 3072), the load become more > balancing in root, which make bl-routine consider the system is > balanced, which make we migrate less in lb-routine. But how? The absolute value (1024 vs 3072) is of no effect to the imbalance, the imbalance is computed from relative differences between cpus. > Our comparison is based on the same busy-system, all the two cases have > the same workload running, the only difference is that we put the same > workload (dbench + stress) one group deeper, it's like: > > Good case: > root > l1-Al1-Bl1-C > dbench stress stress > > results: > dbench got around 300% > each stress got around 450% > > Bad case: > root > l1 > l2-Al2-Bl2-C > dbench stress stress > > results: > dbench got around 100% (throughout dropped too) > each stress got around 550% > > Although the l1-group gain the same resources (1200%), it doesn't assign > to l2-ABC correctly like the root-group did. But in this case select_idle_sibling() should function identially, so that cannot be the problem. > > The second is adding the cgroup crap on. > > > >> However, in our cases the load balance could not help on that, since deeper > >> the group is, less the load effect it means to root group. > > > > But since all actual load is on the same depth, the relative threshold > > (imbalance pct) should work the same, the size of the values don't > > matter, the relative ratios do. > > Exactly, however, when group is deep, the chance of it to make root > imbalance reduced, in good case, gathered on cpu means 1024 load, while > in bad case it dropped to 1024/3 ideally, that make it harder to trigger > imbalance and gain help from the routine, please note that although > dbench and stress are the only workload in system, there are still other > tasks serve for the system need to be wakeup (some very actively since > the dbench...), compared to them, deep group load means nothing... What tasks are these? And is it their interference that disturbs load-balancing? > >> By which means even tasks in deep group all gathered on one CPU, the load > >> could still balanced from the view of root group, and the tasks lost the > >> only chances (balance) to spread when they already on the same CPU... > > > > Sure, but see above. > > The lb-routine could not provide enough help for deep group, since the > imbalance happened inside the group could not cause imbalance in root, > ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be > easily ignored, but inside the l2-group, the gathered case could already > means imbalance like (1024 * 5) : 1024. your explanation is not making sense, we have 3 cgroups, so the total root weight is at least 3072, with 18 tasks you would get 3072/18 ~ 170. And again, the absolute value doesn't matter, with (istr) 12 cpus the avg cpu load would be 3072/12 ~ 256, and 170 is significant on that scale. Same with l2, total weight of 1024, giving a per task weight of ~56 and a per-cpu weight of ~85, which is again significant. Also, you said load-balance doesn't usually participate much because dbench is too fast, so please make up your mind, does it or doesn't it matter? > > So I think that approach is wrong, select_idle_siblings() works because > > we want to keep CPUs from being idle, but if they're not actually idle, > > pretending like they are (in a cgroup) is actively wrong and can skew > > load pretty bad. > > We only choose the timing when no idle cpu located, and flips is > somewhat high, also the group is deep. -enotmakingsense > In such cases, select_idle_siblings() doesn't works anyway, it return > the target even it is very busy, we just check twice to prevent it from > making some obviously bad decision ;-) -emakinglesssense > > Furthermore, if as I expect, dbench sucks on a busy system, then the > > proposed cgroup thing is wrong, as a cgroup isn't supposed to radically > > alter behaviour like that. > > That's true and that's why we currently still need to shut down the > GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to > solve later... more confusion.. > What we currently expect is that the cgroup assign the resource > according to the share, it works well in l1-groups, so we expect it to > wor
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
Hi, Peter Thanks for the reply :) On 06/10/2014 08:12 PM, Peter Zijlstra wrote: [snip] >> Wake-affine for sure pull tasks together for workload like dbench, what make >> it difference when put dbench into a group one level deeper is the >> load-balance, which happened less. > > We load-balance less (frequently) or we migrate less tasks due to > load-balancing ? IMHO, when we put tasks one group deeper, in other word the totally weight of these tasks is 1024 (prev is 3072), the load become more balancing in root, which make bl-routine consider the system is balanced, which make we migrate less in lb-routine. > >> Usually, when system is busy, during the wakeup when we could not locate >> idle cpu, we pick the search point instead, whatever how busy it is since >> we count on the balance routine later to help balance the load. > > But above you said that dbench usually triggers the wake-affine logic, > but now you say it doesn't and we rely on select_idle_sibling? During wakeup, it triggered wake-affine, after that, go inside select_idle_sibling() and found no idle cpu, than pick the search point instead (curr cpu if wake-affine or prev cpu if not). > > Note that the comparison isn't fair, running dbench on an idle system vs > running dbench on a busy system is the first step. Our comparison is based on the same busy-system, all the two cases have the same workload running, the only difference is that we put the same workload (dbench + stress) one group deeper, it's like: Good case: root l1-Al1-Bl1-C dbench stress stress results: dbench got around 300% each stress got around 450% Bad case: root l1 l2-Al2-Bl2-C dbench stress stress results: dbench got around 100% (throughout dropped too) each stress got around 550% Although the l1-group gain the same resources (1200%), it doesn't assign to l2-ABC correctly like the root-group did. > > The second is adding the cgroup crap on. > >> However, in our cases the load balance could not help on that, since deeper >> the group is, less the load effect it means to root group. > > But since all actual load is on the same depth, the relative threshold > (imbalance pct) should work the same, the size of the values don't > matter, the relative ratios do. Exactly, however, when group is deep, the chance of it to make root imbalance reduced, in good case, gathered on cpu means 1024 load, while in bad case it dropped to 1024/3 ideally, that make it harder to trigger imbalance and gain help from the routine, please note that although dbench and stress are the only workload in system, there are still other tasks serve for the system need to be wakeup (some very actively since the dbench...), compared to them, deep group load means nothing... > >> By which means even tasks in deep group all gathered on one CPU, the load >> could still balanced from the view of root group, and the tasks lost the >> only chances (balance) to spread when they already on the same CPU... > > Sure, but see above. The lb-routine could not provide enough help for deep group, since the imbalance happened inside the group could not cause imbalance in root, ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be easily ignored, but inside the l2-group, the gathered case could already means imbalance like (1024 * 5) : 1024. > >> Furthermore, for tasks flip frequently like dbench, it'll become far more >> harder for load balance to help, it could even rarely catch them on rq. > > And I suspect that is the main problem; so see what it does on a busy > system: !cgroup: nr_cpus busy loops + dbench, because that's your > benchmark for adding cgroups, the cgroup can only shift that behaviour > around. There are busy loops in good case too, and dbench behaviour in l1-groups should not changed after put them to l2-group, what make things worse is the chance for them to spread after gathered become less. > [snip] >> Below patch has solved the problem during the testing, I'd like to do more >> testing on other benchmarks before send out the formal patch, any comments >> are welcomed ;-) > > So I think that approach is wrong, select_idle_siblings() works because > we want to keep CPUs from being idle, but if they're not actually idle, > pretending like they are (in a cgroup) is actively wrong and can skew > load pretty bad. We only choose the timing when no idle cpu located, and flips is somewhat high, also the group is deep. In such cases, select_idle_siblings() doesn't works anyway, it return the target even it is very busy, we just check twice to prevent it from making some obviously bad decision ;-) > > Furthermore, if as I expect, dbench sucks on a busy system, then the > proposed cgroup thing is wrong, as a cgroup isn't supposed to radically > alter behav
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Tue, Jun 10, 2014 at 04:56:12PM +0800, Michael wang wrote: > On 05/16/2014 03:54 PM, Peter Zijlstra wrote: > [snip] > > > > Hmm, that _should_ more or less work and does indeed suggest there's > > something iffy. > > > > I think we locate the reason why cpu-cgroup doesn't works well on dbench > now... finally. > > I'd like to link the reproduce way of the issue here since long time > passed... > > https://lkml.org/lkml/2014/5/16/4 > > Now here is the analysis: > > So our problem is when put tasks like dbench which sleep and wakeup each other > frequently into a deep-group, they will gathered on same CPU when workload > like > stress are running, which lead to that the whole group could gain no more than > one CPU. > > Basically there are two key points here, load-balance and wake-affine. > > Wake-affine for sure pull tasks together for workload like dbench, what make > it difference when put dbench into a group one level deeper is the > load-balance, which happened less. We load-balance less (frequently) or we migrate less tasks due to load-balancing ? > Usually, when system is busy, during the wakeup when we could not locate > idle cpu, we pick the search point instead, whatever how busy it is since > we count on the balance routine later to help balance the load. But above you said that dbench usually triggers the wake-affine logic, but now you say it doesn't and we rely on select_idle_sibling? Note that the comparison isn't fair, running dbench on an idle system vs running dbench on a busy system is the first step. The second is adding the cgroup crap on. > However, in our cases the load balance could not help on that, since deeper > the group is, less the load effect it means to root group. But since all actual load is on the same depth, the relative threshold (imbalance pct) should work the same, the size of the values don't matter, the relative ratios do. > By which means even tasks in deep group all gathered on one CPU, the load > could still balanced from the view of root group, and the tasks lost the > only chances (balance) to spread when they already on the same CPU... Sure, but see above. > Furthermore, for tasks flip frequently like dbench, it'll become far more > harder for load balance to help, it could even rarely catch them on rq. And I suspect that is the main problem; so see what it does on a busy system: !cgroup: nr_cpus busy loops + dbench, because that's your benchmark for adding cgroups, the cgroup can only shift that behaviour around. > So in such cases, the only chance to do balance for these tasks is during > the wakeup, however it will be expensive... > > Thus the cheaper way is something just like select_idle_sibling(), the only > difference is now we balance tasks inside the group to prevent them from > gathered. > > Below patch has solved the problem during the testing, I'd like to do more > testing on other benchmarks before send out the formal patch, any comments > are welcomed ;-) So I think that approach is wrong, select_idle_siblings() works because we want to keep CPUs from being idle, but if they're not actually idle, pretending like they are (in a cgroup) is actively wrong and can skew load pretty bad. Furthermore, if as I expect, dbench sucks on a busy system, then the proposed cgroup thing is wrong, as a cgroup isn't supposed to radically alter behaviour like that. More so, I suspect that patch will tend to overload cpu0 (and lower cpu numbers in general -- because its scanning in the same direction for each cgroup) for other workloads. You can't just go pile more and more work on cpu0 just because there's nothing running in this particular cgroup. So dbench is very sensitive to queueing, and select_idle_siblings() avoids a lot of queueing on an idle system. I don't think that's something we should fix with cgroups. pgprBZXWwId4x.pgp Description: PGP signature
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 05/16/2014 03:54 PM, Peter Zijlstra wrote: [snip] > > Hmm, that _should_ more or less work and does indeed suggest there's > something iffy. > I think we locate the reason why cpu-cgroup doesn't works well on dbench now... finally. I'd like to link the reproduce way of the issue here since long time passed... https://lkml.org/lkml/2014/5/16/4 Now here is the analysis: So our problem is when put tasks like dbench which sleep and wakeup each other frequently into a deep-group, they will gathered on same CPU when workload like stress are running, which lead to that the whole group could gain no more than one CPU. Basically there are two key points here, load-balance and wake-affine. Wake-affine for sure pull tasks together for workload like dbench, what make it difference when put dbench into a group one level deeper is the load-balance, which happened less. Usually, when system is busy, during the wakeup when we could not locate idle cpu, we pick the search point instead, whatever how busy it is since we count on the balance routine later to help balance the load. However, in our cases the load balance could not help on that, since deeper the group is, less the load effect it means to root group. By which means even tasks in deep group all gathered on one CPU, the load could still balanced from the view of root group, and the tasks lost the only chances (balance) to spread when they already on the same CPU... Furthermore, for tasks flip frequently like dbench, it'll become far more harder for load balance to help, it could even rarely catch them on rq. So in such cases, the only chance to do balance for these tasks is during the wakeup, however it will be expensive... Thus the cheaper way is something just like select_idle_sibling(), the only difference is now we balance tasks inside the group to prevent them from gathered. Below patch has solved the problem during the testing, I'd like to do more testing on other benchmarks before send out the formal patch, any comments are welcomed ;-) Regards, Michael Wang diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index fea7d33..e1381cd 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -4409,6 +4409,62 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu) return idlest; } +static inline int tg_idle_cpu(struct task_group *tg, int cpu) +{ + return !tg->cfs_rq[cpu]->nr_running; +} + +/* + * Try and locate an idle CPU in the sched_domain from tg's view. + * + * Although gathered on same CPU and spread accross CPUs could make + * no difference from highest group's view, this will cause the tasks + * starving, even they have enough share to fight for CPU, they only + * got one battle filed, which means whatever how big their weight is, + * they totally got one CPU at maximum. + * + * Thus when system is busy, we filtered out those tasks which couldn't + * gain help from balance routine, and try to balance them internally + * by this func, so they could stand a chance to show their power. + * + */ +static int tg_idle_sibling(struct task_struct *p, int target) +{ + struct sched_domain *sd; + struct sched_group *sg; + int i = task_cpu(p); + struct task_group *tg = task_group(p); + + if (tg_idle_cpu(tg, target)) + goto done; + + sd = rcu_dereference(per_cpu(sd_llc, target)); + for_each_lower_domain(sd) { + sg = sd->groups; + do { + if (!cpumask_intersects(sched_group_cpus(sg), + tsk_cpus_allowed(p))) + goto next; + + for_each_cpu(i, sched_group_cpus(sg)) { + if (i == target || !tg_idle_cpu(tg, i)) + goto next; + } + + target = cpumask_first_and(sched_group_cpus(sg), + tsk_cpus_allowed(p)); + + goto done; +next: + sg = sg->next; + } while (sg != sd->groups); + } + +done: + + return target; +} + /* * Try and locate an idle CPU in the sched_domain. */ @@ -4417,6 +4473,7 @@ static int select_idle_sibling(struct task_struct *p, int target) struct sched_domain *sd; struct sched_group *sg; int i = task_cpu(p); + struct sched_entity *se = task_group(p)->se[i]; if (idle_cpu(target)) return target; @@ -4451,6 +4508,30 @@ next: } while (sg != sd->groups); } done: + + if (!idle_cpu(target)) { + /* +* No idle cpu located imply the system is somewhat busy, +* usually we count on load balance routine's help and +* just pick the target whatever how busy it is. +* +* However, when task belong to a
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 05/16/2014 03:54 PM, Peter Zijlstra wrote: [snip] >>> Right. I played a little (sane groups), saw load balancing as well. >> >> Yeah, now we found that even l2 groups will face the same issue, allow >> me to re-list the details here: > > Hmm, that _should_ more or less work and does indeed suggest there's > something iffy. Yeah, sane group topology also issued... besides the sleeper bonus, it seems like the root cause is tasks starting to gather, I plan to check the difference on task load between two cases, see if there is a good way to solve this problem :) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Fri, May 16, 2014 at 12:24:35PM +0800, Michael wang wrote: > Hey, Mike :) > > On 05/16/2014 10:51 AM, Mike Galbraith wrote: > > On Fri, 2014-05-16 at 10:23 +0800, Michael wang wrote: > > > >> But we found that one difference when group get deeper is the tasks of > >> that group become to gathered on CPU more often, some time all the > >> dbench instances was running on the same CPU, this won't happen for l1 > >> group, may could explain why dbench could not get CPU more than 100% any > >> more. > > > > Right. I played a little (sane groups), saw load balancing as well. > > Yeah, now we found that even l2 groups will face the same issue, allow > me to re-list the details here: Hmm, that _should_ more or less work and does indeed suggest there's something iffy. pgpwYbQFDZiBM.pgp Description: PGP signature
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Fri, May 16, 2014 at 10:23:11AM +0800, Michael wang wrote: > On 05/15/2014 07:57 PM, Peter Zijlstra wrote: > [snip] > >> > >> It's like: > >> > >>/cgroup/cpu/l1/l2/l3/l4/l5/l6/A > >> > >> about level 7, the issue can not be solved any more. > > > > That's pretty retarded and yeah, that's way past the point where things > > make sense. You might be lucky and have l1-5 as empty/pointless > > hierarchy so the effective depth is less and then things will work, but > > *shees*.. > > Exactly, that's the simulation of cgroup topology setup by libvirt, > really doesn't make sense... rather torture than deployment, but they do > make things like that... I'm calling it broken and unfit for purpose if it does crazy shit like that. There's really not much we can do to fix it either, barring softfloat in the load-balancer and I'm sure everybody but virt wankers will complain about _that_. pgpzzd1dkLmgR.pgp Description: PGP signature
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
Hey, Mike :) On 05/16/2014 10:51 AM, Mike Galbraith wrote: > On Fri, 2014-05-16 at 10:23 +0800, Michael wang wrote: > >> But we found that one difference when group get deeper is the tasks of >> that group become to gathered on CPU more often, some time all the >> dbench instances was running on the same CPU, this won't happen for l1 >> group, may could explain why dbench could not get CPU more than 100% any >> more. > > Right. I played a little (sane groups), saw load balancing as well. Yeah, now we found that even l2 groups will face the same issue, allow me to re-list the details here: Firstly do workaround (10 times latency): echo 24000 > /proc/sys/kernel/sched_latency_ns echo NO_GENTLE_FAIR_SLEEPERS > /sys/kernel/debug/sched_features This workaround may related to another issue about vruntime bonus for sleeper, but let's put it down currently and focus on the gather issue. Create groups like: mkdir /sys/fs/cgroup/cpu/A mkdir /sys/fs/cgroup/cpu/B mkdir /sys/fs/cgroup/cpu/C mkdir /sys/fs/cgroup/cpu/l1 mkdir /sys/fs/cgroup/cpu/l1/A mkdir /sys/fs/cgroup/cpu/l1/B mkdir /sys/fs/cgroup/cpu/l1/C Run workload like (6 is half of the CPUS on my box): echo $$ > /sys/fs/cgroup/cpu/A/tasks ; dbench 6 echo $$ > /sys/fs/cgroup/cpu/B/tasks ; stress 6 echo $$ > /sys/fs/cgroup/cpu/C/tasks ; stress 6 Check top, each dbench instance got around 45%, totally around 270%, this is close to the case when only dbench running (300%) since we use the workaround, otherwise we will see it to be around 100%, but that's another issue... By sample /proc/sched_debug, rarely see more than 2 dbench instances on same rq. Now re-run workload like: echo $$ > /sys/fs/cgroup/cpu/l1/A/tasks ; dbench 6 echo $$ > /sys/fs/cgroup/cpu/l1/B/tasks ; stress 6 echo $$ > /sys/fs/cgroup/cpu/l1/C/tasks ; stress 6 Check top, each dbench instance got around 20%, totally around 120%, sometime dropped under 100%, and dbench throughput dropped. By sample /proc/sched_debug, frequently see 4 or 5 dbench instances on same rq. So just one level deeper from l1 to l2 and such a big difference, and groups with same shares not equally share the resources... BTW, by bind each dbench instances to different CPU, dbench in l2 groups will regain all the CPU% which is 300%. I'll keep investigation and try to figure out why l2 group's tasks starting to gather, please let me know if there are any suggestions ;-) Regards, Michael Wang > > -Mike > > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Fri, 2014-05-16 at 10:23 +0800, Michael wang wrote: > But we found that one difference when group get deeper is the tasks of > that group become to gathered on CPU more often, some time all the > dbench instances was running on the same CPU, this won't happen for l1 > group, may could explain why dbench could not get CPU more than 100% any > more. Right. I played a little (sane groups), saw load balancing as well. -Mike -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 05/15/2014 07:57 PM, Peter Zijlstra wrote: [snip] >> >> It's like: >> >> /cgroup/cpu/l1/l2/l3/l4/l5/l6/A >> >> about level 7, the issue can not be solved any more. > > That's pretty retarded and yeah, that's way past the point where things > make sense. You might be lucky and have l1-5 as empty/pointless > hierarchy so the effective depth is less and then things will work, but > *shees*.. Exactly, that's the simulation of cgroup topology setup by libvirt, really doesn't make sense... rather torture than deployment, but they do make things like that... > [snip] >> I'm not sure which account will turns to be huge when group get deeper, >> the load accumulation will suffer discount when passing up, isn't it? >> > > It'll use 20 bits for precision instead of 10, so it gives a little more > 'room' for deeper hierarchies/big cpu-count. Got it :) > > All assuming you're running 64bit kernels of course. Yes, it's 64bit, I tried the testing with this feature on, seems like haven't address the issue... But we found that one difference when group get deeper is the tasks of that group become to gathered on CPU more often, some time all the dbench instances was running on the same CPU, this won't happen for l1 group, may could explain why dbench could not get CPU more than 100% any more. But why the gather happen when group get deeper is unclear... will try to make it out :) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Thu, May 15, 2014 at 05:35:25PM +0800, Michael wang wrote: > On 05/15/2014 05:06 PM, Peter Zijlstra wrote: > [snip] > >> However, when the group level is too deep, that doesn't works any more... > >> > >> I'm not sure but seems like 'deep group level' and 'vruntime bonus for > >> sleeper' is the keep points here, will try to list the root cause after > >> more investigation, thanks for the hints and suggestions, really helpful > >> ;-) > > > > How deep is deep? You run into numerical problems quite quickly, esp. > > when you've got lots of CPUs. We've only got 64bit to play with, that > > said there were some patches... > > It's like: > > /cgroup/cpu/l1/l2/l3/l4/l5/l6/A > > about level 7, the issue can not be solved any more. That's pretty retarded and yeah, that's way past the point where things make sense. You might be lucky and have l1-5 as empty/pointless hierarchy so the effective depth is less and then things will work, but *shees*.. > > -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage > > under light load */ > > +#if 1 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage > > under light load */ > > That is trying to solve the load overflow issue, correct? > > I'm not sure which account will turns to be huge when group get deeper, > the load accumulation will suffer discount when passing up, isn't it? > It'll use 20 bits for precision instead of 10, so it gives a little more 'room' for deeper hierarchies/big cpu-count. All assuming you're running 64bit kernels of course. pgp_MTh34iF4b.pgp Description: PGP signature
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 05/15/2014 05:06 PM, Peter Zijlstra wrote: [snip] >> However, when the group level is too deep, that doesn't works any more... >> >> I'm not sure but seems like 'deep group level' and 'vruntime bonus for >> sleeper' is the keep points here, will try to list the root cause after >> more investigation, thanks for the hints and suggestions, really helpful ;-) > > How deep is deep? You run into numerical problems quite quickly, esp. > when you've got lots of CPUs. We've only got 64bit to play with, that > said there were some patches... It's like: /cgroup/cpu/l1/l2/l3/l4/l5/l6/A about level 7, the issue can not be solved any more. > > What happens if you do the below, Google has been running with that, and > nobody was ever able to reproduce the report that got it disabled. > > > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h > index b2cbe81308af..e40819d39c69 100644 > --- a/kernel/sched/sched.h > +++ b/kernel/sched/sched.h > @@ -40,7 +40,7 @@ extern void update_cpu_load_active(struct rq *this_rq); > * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify > the > * increased costs. > */ > -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage > under light load */ > +#if 1 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage > under light load */ That is trying to solve the load overflow issue, correct? I'm not sure which account will turns to be huge when group get deeper, the load accumulation will suffer discount when passing up, isn't it? Anyway, will give it a try and see what happened :) Regards, Michael Wang > # define SCHED_LOAD_RESOLUTION 10 > # define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION) > # define scale_load_down(w) ((w) >> SCHED_LOAD_RESOLUTION) > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Thu, May 15, 2014 at 04:46:28PM +0800, Michael wang wrote: > On 05/15/2014 04:35 PM, Peter Zijlstra wrote: > > On Thu, May 15, 2014 at 11:46:06AM +0800, Michael wang wrote: > >> But for the dbench, stress combination, that's not spin-wasted, dbench > >> throughput do dropped, how could we explain that one? > > > > I've no clue what dbench does.. At this point you'll have to > > expose/trace the per-task runtime accounting for these tasks and ideally > > also the things the cgroup code does with them to see if it still makes > > sense. > > I see :) > > BTW, some interesting thing we found during the dbench/stress testing > is, by doing: > > echo 24000 > /proc/sys/kernel/sched_latency_ns > echo NO_GENTLE_FAIR_SLEEPERS > /sys/kernel/debug/sched_features > > that is sched_latency_ns increased around 10 times and > GENTLE_FAIR_SLEEPERS was disabled, the dbench got it's CPU back. > > However, when the group level is too deep, that doesn't works any more... > > I'm not sure but seems like 'deep group level' and 'vruntime bonus for > sleeper' is the keep points here, will try to list the root cause after > more investigation, thanks for the hints and suggestions, really helpful ;-) How deep is deep? You run into numerical problems quite quickly, esp. when you've got lots of CPUs. We've only got 64bit to play with, that said there were some patches... What happens if you do the below, Google has been running with that, and nobody was ever able to reproduce the report that got it disabled. diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index b2cbe81308af..e40819d39c69 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -40,7 +40,7 @@ extern void update_cpu_load_active(struct rq *this_rq); * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the * increased costs. */ -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */ +#if 1 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load */ # define SCHED_LOAD_RESOLUTION 10 # define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION) # define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION) pgpqgxIvuijNh.pgp Description: PGP signature
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 05/15/2014 04:35 PM, Peter Zijlstra wrote: > On Thu, May 15, 2014 at 11:46:06AM +0800, Michael wang wrote: >> But for the dbench, stress combination, that's not spin-wasted, dbench >> throughput do dropped, how could we explain that one? > > I've no clue what dbench does.. At this point you'll have to > expose/trace the per-task runtime accounting for these tasks and ideally > also the things the cgroup code does with them to see if it still makes > sense. I see :) BTW, some interesting thing we found during the dbench/stress testing is, by doing: echo 24000 > /proc/sys/kernel/sched_latency_ns echo NO_GENTLE_FAIR_SLEEPERS > /sys/kernel/debug/sched_features that is sched_latency_ns increased around 10 times and GENTLE_FAIR_SLEEPERS was disabled, the dbench got it's CPU back. However, when the group level is too deep, that doesn't works any more... I'm not sure but seems like 'deep group level' and 'vruntime bonus for sleeper' is the keep points here, will try to list the root cause after more investigation, thanks for the hints and suggestions, really helpful ;-) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Thu, May 15, 2014 at 11:46:06AM +0800, Michael wang wrote: > But for the dbench, stress combination, that's not spin-wasted, dbench > throughput do dropped, how could we explain that one? I've no clue what dbench does.. At this point you'll have to expose/trace the per-task runtime accounting for these tasks and ideally also the things the cgroup code does with them to see if it still makes sense. pgpLqJ52uaDoi.pgp Description: PGP signature
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 05/14/2014 05:44 PM, Peter Zijlstra wrote: [snip] >> and then: >> echo $$ > /sys/fs/cgroup/cpu/A/tasks ; ./my_tool -l >> echo $$ > /sys/fs/cgroup/cpu/B/tasks ; ./my_tool -l >> echo $$ > /sys/fs/cgroup/cpu/C/tasks ; ./my_tool 50 >> >> the results in top is around: >> >> A B C >> CPU%550 550 100 > > top doesn't do per-cgroup accounting, so how do you get these numbers, > per the above all instances of the prog are also called the same, > further making it error prone and difficult to get sane numbers. Oh, my bad to make it confusing, I myself was checking the PID of my_tool instant inside top, like: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 24968 root 20 0 55600 720 648 S 558.1 0.0 2:08.76 my_tool 24984 root 20 0 55600 720 648 S 536.2 0.0 1:10.29 my_tool 25001 root 20 0 55600 720 648 S 88.6 0.0 0:04.39 my_tool By 'cat /sys/fs/cgroup/cpu/C/tasks' I got the PID of './my_tool 50' is 25001, and all it's pthread's %CPU was count in, could we check like that? > > [snip] >> void consume(int spin, int total) >> { >> unsigned long long begin, now; >> begin = stamp(); >> >> for (;;) { >> pthread_mutex_lock(&my_mutex); >> now = stamp(); >> if ((long long)(now - begin) > spin) { >> pthread_mutex_unlock(&my_mutex); >> usleep(total - spin); >> pthread_mutex_lock(&my_mutex); >> begin += total; >> } >> pthread_mutex_unlock(&my_mutex); >> } >> } > > Uh,.. that's just insane.. what's the point of having a multi-threaded > program do busy-wait loops if you then serialize the lot on a global > mutex such that only 1 thread can run at any one time? > > How can one such prog ever consume more than 100% cpu. That's a good point... however the top show that when only './my_tool 50' 25001 running, it used around 300%, like below: PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 25001 root 20 0 55600 720 648 S 284.3 0.0 5:18.00 my_tool 2376 root 20 0 950m 85m 29m S 4.4 0.2 163:47.94 python 1658 root 20 0 1013m 19m 11m S 3.0 0.1 97:06.11 libvirtd IMHO, if pthread-mutex was similar like the kernel one's behaviour, then it may not going to sleep when it's the only one running on CPU. Oh, I think we got the reason here, when there are other task running, mutex will going to sleep and the %CPU dropped to serialized case that is around 100%. But for the dbench, stress combination, that's not spin-wasted, dbench throughput do dropped, how could we explain that one? Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Wed, May 14, 2014 at 03:36:50PM +0800, Michael wang wrote: > distro mount cpu-subsys under '/sys/fs/cgroup/cpu', create group like: > mkdir /sys/fs/cgroup/cpu/A > mkdir /sys/fs/cgroup/cpu/B > mkdir /sys/fs/cgroup/cpu/C Yeah, distro is on crack, nobody sane mounts anything there. > and then: > echo $$ > /sys/fs/cgroup/cpu/A/tasks ; ./my_tool -l > echo $$ > /sys/fs/cgroup/cpu/B/tasks ; ./my_tool -l > echo $$ > /sys/fs/cgroup/cpu/C/tasks ; ./my_tool 50 > > the results in top is around: > > A B C > CPU%550 550 100 top doesn't do per-cgroup accounting, so how do you get these numbers, per the above all instances of the prog are also called the same, further making it error prone and difficult to get sane numbers. > #include > #include > #include > #include > > pthread_mutex_t my_mutex; > > unsigned long long stamp(void) > { > struct timeval tv; > gettimeofday(&tv, NULL); > > return (unsigned long long)tv.tv_sec * 100 + tv.tv_usec; > } > void consume(int spin, int total) > { > unsigned long long begin, now; > begin = stamp(); > > for (;;) { > pthread_mutex_lock(&my_mutex); > now = stamp(); > if ((long long)(now - begin) > spin) { > pthread_mutex_unlock(&my_mutex); > usleep(total - spin); > pthread_mutex_lock(&my_mutex); > begin += total; > } > pthread_mutex_unlock(&my_mutex); > } > } Uh,.. that's just insane.. what's the point of having a multi-threaded program do busy-wait loops if you then serialize the lot on a global mutex such that only 1 thread can run at any one time? How can one such prog ever consume more than 100% cpu. pgp3z1pUk2Ez9.pgp Description: PGP signature
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
Hi, Peter On 05/13/2014 10:23 PM, Peter Zijlstra wrote: [snip] > > I you want to investigate !spinners, replace the ABC with slightly more > complex loads like: https://lkml.org/lkml/2012/6/18/212 I've done a little reform, enabled multi-threads and add a mutex, please check the code below for details. I built it by: gcc -o my_tool cgroup_tool.c -lpthread distro mount cpu-subsys under '/sys/fs/cgroup/cpu', create group like: mkdir /sys/fs/cgroup/cpu/A mkdir /sys/fs/cgroup/cpu/B mkdir /sys/fs/cgroup/cpu/C and then: echo $$ > /sys/fs/cgroup/cpu/A/tasks ; ./my_tool -l echo $$ > /sys/fs/cgroup/cpu/B/tasks ; ./my_tool -l echo $$ > /sys/fs/cgroup/cpu/C/tasks ; ./my_tool 50 the results in top is around: A B C CPU%550 550 100 While only './my_tool 50' was running, it require around 300%. And this could also be reproduced by dbench, stress combination like: echo $$ > /sys/fs/cgroup/cpu/A/tasks ; dbench 6 echo $$ > /sys/fs/cgroup/cpu/B/tasks ; stress -c 6 echo $$ > /sys/fs/cgroup/cpu/C/tasks ; stress -c 6 Now it seems more like a generic problem... will keep investigating, please let me know if there are any suggestions :) Regards, Michael Wang #include #include #include #include pthread_mutex_t my_mutex; unsigned long long stamp(void) { struct timeval tv; gettimeofday(&tv, NULL); return (unsigned long long)tv.tv_sec * 100 + tv.tv_usec; } void consume(int spin, int total) { unsigned long long begin, now; begin = stamp(); for (;;) { pthread_mutex_lock(&my_mutex); now = stamp(); if ((long long)(now - begin) > spin) { pthread_mutex_unlock(&my_mutex); usleep(total - spin); pthread_mutex_lock(&my_mutex); begin += total; } pthread_mutex_unlock(&my_mutex); } } struct my_data { int spin; int total; }; void *my_fn_sleepy(void *arg) { struct my_data *data = (struct my_data *)arg; consume(data->spin, data->total); return NULL; } void *my_fn_loop(void *arg) { while (1) {}; return NULL; } int main(int argc, char **argv) { int period = 10; /* 100ms */ int frac; struct my_data data; pthread_t last_thread; int thread_num = sysconf(_SC_NPROCESSORS_ONLN) / 2; void *(*my_fn)(void *arg) = &my_fn_sleepy; if (thread_num <= 0 || thread_num > 1024) { fprintf(stderr, "insane processor(half) size %d\n", thread_num); return -1; } if (argc == 2 && !strcmp(argv[1], "-l")) { my_fn = &my_fn_loop; printf("loop mode enabled\n"); goto loop_mode; } if (argc < 2) { fprintf(stderr, "%s []\n" " frac -- [1-100] %% of time to burn\n" " period -- [usec] period of burn/sleep cycle\n", argv[0]); return -1; } frac = atoi(argv[1]); if (argc > 2) period = atoi(argv[2]); if (frac > 100) frac = 100; if (frac < 1) frac = 1; data.spin = (period * frac) / 100; data.total = period; loop_mode: pthread_mutex_init(&my_mutex, NULL); while (thread_num--) { if (pthread_create(&last_thread, NULL, my_fn, &data)) { fprintf(stderr, "Create thread failed\n"); return -1; } } printf("Threads never stop, CTRL + C to terminate\n"); pthread_join(last_thread, NULL); pthread_mutex_destroy(&my_mutex); //won't happen return 0; } > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > #include #include #include #include pthread_mutex_t my_mutex; unsigned long long stamp(void) { struct timeval tv; gettimeofday(&tv, NULL); return (unsigned long long)tv.tv_sec * 100 + tv.tv_usec; } void consume(int spin, int total) { unsigned long long begin, now; begin = stamp(); for (;;) { pthread_mutex_lock(&my_mutex); now = stamp(); if ((long long)(now - begin) > spin) { pthread_mutex_unlock(&my_mutex); usleep(total - spin); pthread_mutex_lock(&my_mutex); begin += total; } pthread_mutex_unlock(&my_mutex); } } struct my_data { int spin; int total; }; void *my_fn_sleepy(void *arg) { struct my_data *data = (struct my_data *)arg; consume(data->spin, data->total); return NU
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 05/13/2014 10:23 PM, Peter Zijlstra wrote: [snip] > > The point remains though, don't use massive and awkward software stacks > that are impossible to operate. > > I you want to investigate !spinners, replace the ABC with slightly more > complex loads like: https://lkml.org/lkml/2012/6/18/212 That's what we need, may be a little reform to enable multi-threads, or may be add some locks... anyway, will redo the test and see what we could found :) Regards, Michael Wang > -- > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Please read the FAQ at http://www.tux.org/lkml/ > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 05/13/2014 09:36 PM, Rik van Riel wrote: [snip] >> >> echo 2048 > /cgroup/c/cpu.shares >> >> Where [ABC].sh are spinners: > > I suspect the "are spinners" is key. > > Infinite loops can run all the time, while dbench spends a lot of > its time waiting for locks. That waiting may interfere with getting > as much CPU as it wants. That's what we are thinking, also we assume that by introducing load decay mechanism, it become harder for the sleepy tasks to gain enough slice, well, that currently just imagination, more investigation is needed ;-) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 05/13/2014 05:47 PM, Peter Zijlstra wrote: > On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote: >> During our testing, we found that the cpu.shares doesn't work as >> expected, the testing is: >> > > /me zaps all the kvm nonsense as that's non reproducable and only serves > to annoy. > > Pro-tip: never use kvm to report cpu-cgroup issues. Make sense. > [snip] > for i in A B C ; do ps -deo pcpu,cmd | grep "${i}\.sh" | awk '{t += $1} END > {print t}' ; done Enjoyable :) > 639.7 > 629.8 > 1127.4 > > That is of course not perfect, but it's close enough. Yeah, for cpu intensive work load, the share do work very well, the issue only appeared when workload start to become some kind of...sleepy. I will use the tool you mentioned for the following investigation, thanks for the suggestion. > > Now you again.. :-) And here I am ;-) Regards, Michael Wang > -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Tue, May 13, 2014 at 09:36:20AM -0400, Rik van Riel wrote: > On 05/13/2014 05:47 AM, Peter Zijlstra wrote: > > On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote: > >> During our testing, we found that the cpu.shares doesn't work as > >> expected, the testing is: > >> > > > > /me zaps all the kvm nonsense as that's non reproducable and only serves > > to annoy. > > > > Pro-tip: never use kvm to report cpu-cgroup issues. > > > >> So is this results expected (I really do not think so...)? > >> > >> Or that imply the cpu-cgroup got some issue to be fixed? > > > > So what I did (WSM-EP 2x6x2): > > > > mount none /cgroup -t cgroup -o cpu > > mkdir -p /cgroup/a > > mkdir -p /cgroup/b > > mkdir -p /cgroup/c > > > > echo $$ > /cgroup/a/tasks ; for ((i=0; i<12; i++)) ; do A.sh & done > > echo $$ > /cgroup/b/tasks ; for ((i=0; i<12; i++)) ; do B.sh & done > > echo $$ > /cgroup/c/tasks ; for ((i=0; i<12; i++)) ; do C.sh & done > > > > echo 2048 > /cgroup/c/cpu.shares > > > > Where [ABC].sh are spinners: > > I suspect the "are spinners" is key. > > Infinite loops can run all the time, while dbench spends a lot of > its time waiting for locks. That waiting may interfere with getting > as much CPU as it wants. At which point it becomes an entirely different problem and the weight things become far more 'interesting'. The point remains though, don't use massive and awkward software stacks that are impossible to operate. I you want to investigate !spinners, replace the ABC with slightly more complex loads like: https://lkml.org/lkml/2012/6/18/212 -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On 05/13/2014 05:47 AM, Peter Zijlstra wrote: > On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote: >> During our testing, we found that the cpu.shares doesn't work as >> expected, the testing is: >> > > /me zaps all the kvm nonsense as that's non reproducable and only serves > to annoy. > > Pro-tip: never use kvm to report cpu-cgroup issues. > >> So is this results expected (I really do not think so...)? >> >> Or that imply the cpu-cgroup got some issue to be fixed? > > So what I did (WSM-EP 2x6x2): > > mount none /cgroup -t cgroup -o cpu > mkdir -p /cgroup/a > mkdir -p /cgroup/b > mkdir -p /cgroup/c > > echo $$ > /cgroup/a/tasks ; for ((i=0; i<12; i++)) ; do A.sh & done > echo $$ > /cgroup/b/tasks ; for ((i=0; i<12; i++)) ; do B.sh & done > echo $$ > /cgroup/c/tasks ; for ((i=0; i<12; i++)) ; do C.sh & done > > echo 2048 > /cgroup/c/cpu.shares > > Where [ABC].sh are spinners: I suspect the "are spinners" is key. Infinite loops can run all the time, while dbench spends a lot of its time waiting for locks. That waiting may interfere with getting as much CPU as it wants. -- All rights reversed -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/
Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote: > During our testing, we found that the cpu.shares doesn't work as > expected, the testing is: > /me zaps all the kvm nonsense as that's non reproducable and only serves to annoy. Pro-tip: never use kvm to report cpu-cgroup issues. > So is this results expected (I really do not think so...)? > > Or that imply the cpu-cgroup got some issue to be fixed? So what I did (WSM-EP 2x6x2): mount none /cgroup -t cgroup -o cpu mkdir -p /cgroup/a mkdir -p /cgroup/b mkdir -p /cgroup/c echo $$ > /cgroup/a/tasks ; for ((i=0; i<12; i++)) ; do A.sh & done echo $$ > /cgroup/b/tasks ; for ((i=0; i<12; i++)) ; do B.sh & done echo $$ > /cgroup/c/tasks ; for ((i=0; i<12; i++)) ; do C.sh & done echo 2048 > /cgroup/c/cpu.shares Where [ABC].sh are spinners: --- #!/bin/bash while :; do :; done --- for i in A B C ; do ps -deo pcpu,cmd | grep "${i}\.sh" | awk '{t += $1} END {print t}' ; done 639.7 629.8 1127.4 That is of course not perfect, but it's close enough. Now you again.. :-) pgpAi1nMi4uyC.pgp Description: PGP signature
[ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?
During our testing, we found that the cpu.shares doesn't work as expected, the testing is: X86 HOST: 12 CPU GUEST(KVM): 6 VCPU We create 3 GUEST, each with 1024 shares, the workload inside them is: GUEST_1: dbench 6 GUEST_2: stress -c 6 GUEST_3: stress -c 6 So by theory, each GUEST will got (1024 / (3 * 1024)) * 1200% == 400% according to the group share (3 groups are created by virtual manager on same level, and they are the only groups heavily running in system). Now if only GUEST_1 running, it got 300% CPU, which is 1/4 of the whole CPU resource. So when all 3 GUEST running concurrently, we expect: GUEST_1 GUEST_2 GUEST_3 CPU%300%450%450% That is the GUEST_1 got the 300% it required, and the unused 100% was shared by the rest group. But the result is: GUEST_1 GUEST_2 GUEST_3 CPU%40% 580%580% GUEST_1 failed to gain the CPU it required, and the dbench inside it dropped a lot on performance. So is this results expected (I really do not think so...)? Or that imply the cpu-cgroup got some issue to be fixed? Any comments are welcomed :) Regards, Michael Wang -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/