Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-06-23 Thread Michael wang
Hi, Peter

Thanks for the reply :)

On 06/23/2014 05:42 PM, Peter Zijlstra wrote:
[snip]
>>
>>  cpu 0   cpu 1
>>
>>  dbench  task_sys
>>  dbench  task_sys
>>  dbench
>>  dbench
>>  dbench
>>  dbench
>>  task_sys
>>  task_sys
> 
> It might help if you prefix each task with the cgroup they're in;

My bad...

but I
> think I get it, its like:
> 
>   cpu0
> 
>   A/dbench
>   A/dbench
>   A/dbench
>   A/dbench
>   A/dbench
>   A/dbench
>   /task_sys
>   /task_sys

Yeah, it's like that.

> 
[snip]
> 
>   cpu0
> 
>   A/B/dbench
>   A/B/dbench
>   A/B/dbench
>   A/B/dbench
>   A/B/dbench
>   A/B/dbench
>   /task_sys
>   /task_sys
> 
> Right?

My bad to missed the group symbol here... it's actually like:

cpu0

/l1/A/dbench
/l1/A/dbench
/l1/A/dbench
/l1/A/dbench
/l1/A/dbench
/task_sys
/task_sys

And we also have six:

/l1/B/stress

and six:

/l1/C/stress

running in system.

A, B, C is the child groups of l1.

> 
>>  cpu 0   cpu 1
>>  load1024/3 + 1024*2 1024*2
>>
>>  2389 : 2048 imbalance %116
> 
> Which should still end up with 3072, because A is still 1024 in total,
> and all its member tasks run on the one CPU.

l1 have 3 child groups, each got 6 NICE 0 tasks, so ideally each task
will got 1024/18, 6 dbench will means (1024/18)*6 == 1024/3.

Previously each of the 3 group got 1024 shares, now they need to share
1024 shares, it will become less for each of them.

> 
>> And it could be even less during my testing...
> 
> Well, yes, up to 1024/nr_cpus I imagine.
> 
>> This is just try to explain that when 'group_load : rq_load' become
>> lower, it's influence to 'rq_load' become lower too, and if the system
>> is balanced with only 'rq_load' there, it will be considered still
>> balanced even 'group_load' gathered on one cpu.
>>
>> Please let me know if I missed something here...
> 
> Yeah, what other tasks are these task_sys things? workqueue crap?

There are some other tasks but mostly showup are the kworkers, yes the
workqueue stuff.

They rapidly showup on each CPU, in some period if they showup too much,
they will eat some CPU% too, but not very much.

> 
[snip]
>>
>> These are dbench and stress with less root-load when put into l2-groups,
>> that make it harder to trigger root-group imbalance like in the case above.
> 
> You're still not making sense here.. without the task_sys thingies in
> you get something like:
> 
>  cpu0 cpu1
> 
>  A/dbench A/dbench
>  B/stress B/stress
> 
> And the total loads are: 512+512 vs 512+512.

Without other task's influence, I believe the balance should be fine,
but in our cases, at least these kworkers will join the battle anyway...

> 
>>> Same with l2, total weight of 1024, giving a per task weight of ~56 and
>>> a per-cpu weight of ~85, which is again significant.
>>
>> We have other tasks which has to running in the system, in order to
>> serve dbench and others, and that also the case in real world, dbench
>> and stress are not the only tasks on rq time to time.
>>
>> May be we could focus on the case above and see if it could make things
>> more clear firstly?
> 
> Well, this all smells like you need some cgroup affinity for whatever
> system tasks are running. Not fuck up the scheduler for no sane reason.

These kworkers are bind to their CPU already, I don't know how to handle
them to prevent the issue, they just keep working on their CPU, and
whenever they showup, dbench spreading inactively...

We just want a way which could help workload like dbench to work
normally with cpu-group when there are stress likely workload running in
the system.

We want dbench to gain more CPU% but cpu-shares doesn't work as
expected... dbench can get no more than 100% whatever how big it's
group's shares is, and we consider that cpu-group was broken in this
cases...

I agree that this is not a generic requirement and scheduler should only
be responsible for general situation, but since it's really a too big
regression, could we at least provide some way to stop the damage? After
all, most of the cpu-group logic is insider scheduler...

I'd like to list some real numbers in patch-thread, we really desired
for some way to make cpu-group perform normally on workload like dbench,
actually we also find some transaction workloads suffered from this
issue too, in such cases, cpu-group just failed on managing the CPU
resources...

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-06-23 Thread Peter Zijlstra
On Wed, Jun 11, 2014 at 05:18:29PM +0800, Michael wang wrote:
> On 06/11/2014 04:24 PM, Peter Zijlstra wrote:
> [snip]
> >>
> >> IMHO, when we put tasks one group deeper, in other word the totally
> >> weight of these tasks is 1024 (prev is 3072), the load become more
> >> balancing in root, which make bl-routine consider the system is
> >> balanced, which make we migrate less in lb-routine.
> > 
> > But how? The absolute value (1024 vs 3072) is of no effect to the
> > imbalance, the imbalance is computed from relative differences between
> > cpus.
> 
> Ok, forgive me for the confusion, please allow me to explain things
> again, for gathered cases like:
> 
>   cpu 0   cpu 1
> 
>   dbench  task_sys
>   dbench  task_sys
>   dbench
>   dbench
>   dbench
>   dbench
>   task_sys
>   task_sys

It might help if you prefix each task with the cgroup they're in; but I
think I get it, its like:

cpu0

A/dbench
A/dbench
A/dbench
A/dbench
A/dbench
A/dbench
/task_sys
/task_sys

> task_sys is other tasks belong to root which is nice 0, so when dbench
> in l1:
> 
>   cpu 0   cpu 1
>   load1024 + 1024*2   1024*2
> 
>   3072: 2048  imbalance %150
> 
> now when they belong to l2:

That would be:

cpu0

A/B/dbench
A/B/dbench
A/B/dbench
A/B/dbench
A/B/dbench
A/B/dbench
/task_sys
/task_sys

Right?

>   cpu 0   cpu 1
>   load1024/3 + 1024*2 1024*2
> 
>   2389 : 2048 imbalance %116

Which should still end up with 3072, because A is still 1024 in total,
and all its member tasks run on the one CPU.

> And it could be even less during my testing...

Well, yes, up to 1024/nr_cpus I imagine.

> This is just try to explain that when 'group_load : rq_load' become
> lower, it's influence to 'rq_load' become lower too, and if the system
> is balanced with only 'rq_load' there, it will be considered still
> balanced even 'group_load' gathered on one cpu.
> 
> Please let me know if I missed something here...

Yeah, what other tasks are these task_sys things? workqueue crap?

> >> Exactly, however, when group is deep, the chance of it to make root
> >> imbalance reduced, in good case, gathered on cpu means 1024 load, while
> >> in bad case it dropped to 1024/3 ideally, that make it harder to trigger
> >> imbalance and gain help from the routine, please note that although
> >> dbench and stress are the only workload in system, there are still other
> >> tasks serve for the system need to be wakeup (some very actively since
> >> the dbench...), compared to them, deep group load means nothing...
> > 
> > What tasks are these? And is it their interference that disturbs
> > load-balancing?
> 
> These are dbench and stress with less root-load when put into l2-groups,
> that make it harder to trigger root-group imbalance like in the case above.

You're still not making sense here.. without the task_sys thingies in
you get something like:

 cpu0   cpu1

 A/dbench   A/dbench
 B/stress   B/stress

And the total loads are: 512+512 vs 512+512.

> > Same with l2, total weight of 1024, giving a per task weight of ~56 and
> > a per-cpu weight of ~85, which is again significant.
> 
> We have other tasks which has to running in the system, in order to
> serve dbench and others, and that also the case in real world, dbench
> and stress are not the only tasks on rq time to time.
> 
> May be we could focus on the case above and see if it could make things
> more clear firstly?

Well, this all smells like you need some cgroup affinity for whatever
system tasks are running. Not fuck up the scheduler for no sane reason.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-06-11 Thread Michael wang
On 06/11/2014 04:24 PM, Peter Zijlstra wrote:
[snip]
>>
>> IMHO, when we put tasks one group deeper, in other word the totally
>> weight of these tasks is 1024 (prev is 3072), the load become more
>> balancing in root, which make bl-routine consider the system is
>> balanced, which make we migrate less in lb-routine.
> 
> But how? The absolute value (1024 vs 3072) is of no effect to the
> imbalance, the imbalance is computed from relative differences between
> cpus.

Ok, forgive me for the confusion, please allow me to explain things
again, for gathered cases like:

cpu 0   cpu 1

dbench  task_sys
dbench  task_sys
dbench
dbench
dbench
dbench
task_sys
task_sys

task_sys is other tasks belong to root which is nice 0, so when dbench
in l1:

cpu 0   cpu 1
load1024 + 1024*2   1024*2

3072: 2048  imbalance %150

now when they belong to l2:

cpu 0   cpu 1
load1024/3 + 1024*2 1024*2

2389 : 2048 imbalance %116

And it could be even less during my testing...

This is just try to explain that when 'group_load : rq_load' become
lower, it's influence to 'rq_load' become lower too, and if the system
is balanced with only 'rq_load' there, it will be considered still
balanced even 'group_load' gathered on one cpu.

Please let me know if I missed something here...

> 
[snip]
>>
>> Although the l1-group gain the same resources (1200%), it doesn't assign
>> to l2-ABC correctly like the root-group did.
> 
> But in this case select_idle_sibling() should function identially, so
> that cannot be the problem.

Yes, it's clean, select_idle_sibling() just return curr or prev cpu in
this case.

> 
[snip]
>>
>> Exactly, however, when group is deep, the chance of it to make root
>> imbalance reduced, in good case, gathered on cpu means 1024 load, while
>> in bad case it dropped to 1024/3 ideally, that make it harder to trigger
>> imbalance and gain help from the routine, please note that although
>> dbench and stress are the only workload in system, there are still other
>> tasks serve for the system need to be wakeup (some very actively since
>> the dbench...), compared to them, deep group load means nothing...
> 
> What tasks are these? And is it their interference that disturbs
> load-balancing?

These are dbench and stress with less root-load when put into l2-groups,
that make it harder to trigger root-group imbalance like in the case above.

> 
 By which means even tasks in deep group all gathered on one CPU, the load
 could still balanced from the view of root group, and the tasks lost the
 only chances (balance) to spread when they already on the same CPU...
>>>
>>> Sure, but see above.
>>
>> The lb-routine could not provide enough help for deep group, since the
>> imbalance happened inside the group could not cause imbalance in root,
>> ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
>> easily ignored, but inside the l2-group, the gathered case could already
>> means imbalance like (1024 * 5) : 1024.
> 
> your explanation is not making sense, we have 3 cgroups, so the total
> root weight is at least 3072, with 18 tasks you would get 3072/18 ~ 170.

I mean the l2-groups case here... since l1 share is 1024, the total load
of l2-groups will be 1024 by theory.

> 
> And again, the absolute value doesn't matter, with (istr) 12 cpus the
> avg cpu load would be 3072/12 ~ 256, and 170 is significant on that
> scale.
> 
> Same with l2, total weight of 1024, giving a per task weight of ~56 and
> a per-cpu weight of ~85, which is again significant.

We have other tasks which has to running in the system, in order to
serve dbench and others, and that also the case in real world, dbench
and stress are not the only tasks on rq time to time.

May be we could focus on the case above and see if it could make things
more clear firstly?

Regards,
Michael Wang

> 
> Also, you said load-balance doesn't usually participate much because
> dbench is too fast, so please make up your mind, does it or doesn't it
> matter?
> 
>>> So I think that approach is wrong, select_idle_siblings() works because
>>> we want to keep CPUs from being idle, but if they're not actually idle,
>>> pretending like they are (in a cgroup) is actively wrong and can skew
>>> load pretty bad.
>>
>> We only choose the timing when no idle cpu located, and flips is
>> somewhat high, also the group is deep.
> 
> -enotmakingsense
> 
>> In such cases, select_idle_siblings() doesn't works anyway, it return
>> the target even it is very busy, we just check twice to prevent it from
>> making some obviously bad decision ;-)
> 
> -emakinglesssense
> 
>>> Furthermore, if as I expect, dbench sucks on a busy system, then the
>>> proposed cgroup thing 

Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-06-11 Thread Peter Zijlstra
On Wed, Jun 11, 2014 at 02:13:42PM +0800, Michael wang wrote:
> Hi, Peter
> 
> Thanks for the reply :)
> 
> On 06/10/2014 08:12 PM, Peter Zijlstra wrote:
> [snip]
> >> Wake-affine for sure pull tasks together for workload like dbench, what 
> >> make
> >> it difference when put dbench into a group one level deeper is the
> >> load-balance, which happened less.
> > 
> > We load-balance less (frequently) or we migrate less tasks due to
> > load-balancing ?
> 
> IMHO, when we put tasks one group deeper, in other word the totally
> weight of these tasks is 1024 (prev is 3072), the load become more
> balancing in root, which make bl-routine consider the system is
> balanced, which make we migrate less in lb-routine.

But how? The absolute value (1024 vs 3072) is of no effect to the
imbalance, the imbalance is computed from relative differences between
cpus.

> Our comparison is based on the same busy-system, all the two cases have
> the same workload running, the only difference is that we put the same
> workload (dbench + stress) one group deeper, it's like:
> 
> Good case:
>   root
>   l1-Al1-Bl1-C
>   dbench  stress  stress
> 
>   results:
>   dbench got around 300%
>   each stress got around 450%
> 
> Bad case:
>   root
>   l1
>   l2-Al2-Bl2-C
>   dbench  stress  stress
> 
>   results:
>   dbench got around 100% (throughout dropped too)
>   each stress got around 550%
> 
> Although the l1-group gain the same resources (1200%), it doesn't assign
> to l2-ABC correctly like the root-group did.

But in this case select_idle_sibling() should function identially, so
that cannot be the problem.

> > The second is adding the cgroup crap on.
> > 
> >> However, in our cases the load balance could not help on that, since deeper
> >> the group is, less the load effect it means to root group.
> > 
> > But since all actual load is on the same depth, the relative threshold
> > (imbalance pct) should work the same, the size of the values don't
> > matter, the relative ratios do.
> 
> Exactly, however, when group is deep, the chance of it to make root
> imbalance reduced, in good case, gathered on cpu means 1024 load, while
> in bad case it dropped to 1024/3 ideally, that make it harder to trigger
> imbalance and gain help from the routine, please note that although
> dbench and stress are the only workload in system, there are still other
> tasks serve for the system need to be wakeup (some very actively since
> the dbench...), compared to them, deep group load means nothing...

What tasks are these? And is it their interference that disturbs
load-balancing?

> >> By which means even tasks in deep group all gathered on one CPU, the load
> >> could still balanced from the view of root group, and the tasks lost the
> >> only chances (balance) to spread when they already on the same CPU...
> > 
> > Sure, but see above.
> 
> The lb-routine could not provide enough help for deep group, since the
> imbalance happened inside the group could not cause imbalance in root,
> ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
> easily ignored, but inside the l2-group, the gathered case could already
> means imbalance like (1024 * 5) : 1024.

your explanation is not making sense, we have 3 cgroups, so the total
root weight is at least 3072, with 18 tasks you would get 3072/18 ~ 170.

And again, the absolute value doesn't matter, with (istr) 12 cpus the
avg cpu load would be 3072/12 ~ 256, and 170 is significant on that
scale.

Same with l2, total weight of 1024, giving a per task weight of ~56 and
a per-cpu weight of ~85, which is again significant.

Also, you said load-balance doesn't usually participate much because
dbench is too fast, so please make up your mind, does it or doesn't it
matter?

> > So I think that approach is wrong, select_idle_siblings() works because
> > we want to keep CPUs from being idle, but if they're not actually idle,
> > pretending like they are (in a cgroup) is actively wrong and can skew
> > load pretty bad.
> 
> We only choose the timing when no idle cpu located, and flips is
> somewhat high, also the group is deep.

-enotmakingsense

> In such cases, select_idle_siblings() doesn't works anyway, it return
> the target even it is very busy, we just check twice to prevent it from
> making some obviously bad decision ;-)

-emakinglesssense

> > Furthermore, if as I expect, dbench sucks on a busy system, then the
> > proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
> > alter behaviour like that.
> 
> That's true and that's why we currently still need to shut down the
> GENTLE_FAIR_SLEEPERS feature, but that's another problem we need to
> solve later...

more confusion..

> What we currently expect is that the cgroup assign the resource
> according to the share, it works well in l1-groups, so we expect it to
> wor

Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-06-10 Thread Michael wang
Hi, Peter

Thanks for the reply :)

On 06/10/2014 08:12 PM, Peter Zijlstra wrote:
[snip]
>> Wake-affine for sure pull tasks together for workload like dbench, what make
>> it difference when put dbench into a group one level deeper is the
>> load-balance, which happened less.
> 
> We load-balance less (frequently) or we migrate less tasks due to
> load-balancing ?

IMHO, when we put tasks one group deeper, in other word the totally
weight of these tasks is 1024 (prev is 3072), the load become more
balancing in root, which make bl-routine consider the system is
balanced, which make we migrate less in lb-routine.

> 
>> Usually, when system is busy, during the wakeup when we could not locate
>> idle cpu, we pick the search point instead, whatever how busy it is since
>> we count on the balance routine later to help balance the load.
> 
> But above you said that dbench usually triggers the wake-affine logic,
> but now you say it doesn't and we rely on select_idle_sibling?

During wakeup, it triggered wake-affine, after that, go inside
select_idle_sibling() and found no idle cpu, than pick the search point
instead (curr cpu if wake-affine or prev cpu if not).

> 
> Note that the comparison isn't fair, running dbench on an idle system vs
> running dbench on a busy system is the first step.

Our comparison is based on the same busy-system, all the two cases have
the same workload running, the only difference is that we put the same
workload (dbench + stress) one group deeper, it's like:

Good case:
root
l1-Al1-Bl1-C
dbench  stress  stress

results:
dbench got around 300%
each stress got around 450%

Bad case:
root
l1
l2-Al2-Bl2-C
dbench  stress  stress

results:
dbench got around 100% (throughout dropped too)
each stress got around 550%

Although the l1-group gain the same resources (1200%), it doesn't assign
to l2-ABC correctly like the root-group did.

> 
> The second is adding the cgroup crap on.
> 
>> However, in our cases the load balance could not help on that, since deeper
>> the group is, less the load effect it means to root group.
> 
> But since all actual load is on the same depth, the relative threshold
> (imbalance pct) should work the same, the size of the values don't
> matter, the relative ratios do.

Exactly, however, when group is deep, the chance of it to make root
imbalance reduced, in good case, gathered on cpu means 1024 load, while
in bad case it dropped to 1024/3 ideally, that make it harder to trigger
imbalance and gain help from the routine, please note that although
dbench and stress are the only workload in system, there are still other
tasks serve for the system need to be wakeup (some very actively since
the dbench...), compared to them, deep group load means nothing...

> 
>> By which means even tasks in deep group all gathered on one CPU, the load
>> could still balanced from the view of root group, and the tasks lost the
>> only chances (balance) to spread when they already on the same CPU...
> 
> Sure, but see above.

The lb-routine could not provide enough help for deep group, since the
imbalance happened inside the group could not cause imbalance in root,
ideally each l2-task will gain 1024/18 ~= 56 root-load, which could be
easily ignored, but inside the l2-group, the gathered case could already
means imbalance like (1024 * 5) : 1024.

> 
>> Furthermore, for tasks flip frequently like dbench, it'll become far more
>> harder for load balance to help, it could even rarely catch them on rq.
> 
> And I suspect that is the main problem; so see what it does on a busy
> system: !cgroup: nr_cpus busy loops + dbench, because that's your
> benchmark for adding cgroups, the cgroup can only shift that behaviour
> around.

There are busy loops in good case too, and dbench behaviour in l1-groups
should not changed after put them to l2-group, what make things worse is
the chance for them to spread after gathered become less.

> 
[snip]
>> Below patch has solved the problem during the testing, I'd like to do more
>> testing on other benchmarks before send out the formal patch, any comments
>> are welcomed ;-)
> 
> So I think that approach is wrong, select_idle_siblings() works because
> we want to keep CPUs from being idle, but if they're not actually idle,
> pretending like they are (in a cgroup) is actively wrong and can skew
> load pretty bad.

We only choose the timing when no idle cpu located, and flips is
somewhat high, also the group is deep.

In such cases, select_idle_siblings() doesn't works anyway, it return
the target even it is very busy, we just check twice to prevent it from
making some obviously bad decision ;-)

> 
> Furthermore, if as I expect, dbench sucks on a busy system, then the
> proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
> alter behav

Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-06-10 Thread Peter Zijlstra
On Tue, Jun 10, 2014 at 04:56:12PM +0800, Michael wang wrote:
> On 05/16/2014 03:54 PM, Peter Zijlstra wrote:
> [snip]
> > 
> > Hmm, that _should_ more or less work and does indeed suggest there's
> > something iffy.
> > 
> 
> I think we locate the reason why cpu-cgroup doesn't works well on dbench
> now... finally.
> 
> I'd like to link the reproduce way of the issue here since long time
> passed...
> 
>   https://lkml.org/lkml/2014/5/16/4
> 
> Now here is the analysis:
> 
> So our problem is when put tasks like dbench which sleep and wakeup each other
> frequently into a deep-group, they will gathered on same CPU when workload 
> like
> stress are running, which lead to that the whole group could gain no more than
> one CPU.
> 
> Basically there are two key points here, load-balance and wake-affine.
> 
> Wake-affine for sure pull tasks together for workload like dbench, what make
> it difference when put dbench into a group one level deeper is the
> load-balance, which happened less.

We load-balance less (frequently) or we migrate less tasks due to
load-balancing ?

> Usually, when system is busy, during the wakeup when we could not locate
> idle cpu, we pick the search point instead, whatever how busy it is since
> we count on the balance routine later to help balance the load.

But above you said that dbench usually triggers the wake-affine logic,
but now you say it doesn't and we rely on select_idle_sibling?

Note that the comparison isn't fair, running dbench on an idle system vs
running dbench on a busy system is the first step.

The second is adding the cgroup crap on.

> However, in our cases the load balance could not help on that, since deeper
> the group is, less the load effect it means to root group.

But since all actual load is on the same depth, the relative threshold
(imbalance pct) should work the same, the size of the values don't
matter, the relative ratios do.

> By which means even tasks in deep group all gathered on one CPU, the load
> could still balanced from the view of root group, and the tasks lost the
> only chances (balance) to spread when they already on the same CPU...

Sure, but see above.

> Furthermore, for tasks flip frequently like dbench, it'll become far more
> harder for load balance to help, it could even rarely catch them on rq.

And I suspect that is the main problem; so see what it does on a busy
system: !cgroup: nr_cpus busy loops + dbench, because that's your
benchmark for adding cgroups, the cgroup can only shift that behaviour
around.

> So in such cases, the only chance to do balance for these tasks is during
> the wakeup, however it will be expensive...
> 
> Thus the cheaper way is something just like select_idle_sibling(), the only
> difference is now we balance tasks inside the group to prevent them from
> gathered.
> 
> Below patch has solved the problem during the testing, I'd like to do more
> testing on other benchmarks before send out the formal patch, any comments
> are welcomed ;-)

So I think that approach is wrong, select_idle_siblings() works because
we want to keep CPUs from being idle, but if they're not actually idle,
pretending like they are (in a cgroup) is actively wrong and can skew
load pretty bad.

Furthermore, if as I expect, dbench sucks on a busy system, then the
proposed cgroup thing is wrong, as a cgroup isn't supposed to radically
alter behaviour like that.

More so, I suspect that patch will tend to overload cpu0 (and lower cpu
numbers in general -- because its scanning in the same direction for
each cgroup) for other workloads. You can't just go pile more and more
work on cpu0 just because there's nothing running in this particular
cgroup.

So dbench is very sensitive to queueing, and select_idle_siblings()
avoids a lot of queueing on an idle system. I don't think that's
something we should fix with cgroups.



pgprBZXWwId4x.pgp
Description: PGP signature


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-06-10 Thread Michael wang
On 05/16/2014 03:54 PM, Peter Zijlstra wrote:
[snip]
> 
> Hmm, that _should_ more or less work and does indeed suggest there's
> something iffy.
> 

I think we locate the reason why cpu-cgroup doesn't works well on dbench
now... finally.

I'd like to link the reproduce way of the issue here since long time
passed...

https://lkml.org/lkml/2014/5/16/4

Now here is the analysis:

So our problem is when put tasks like dbench which sleep and wakeup each other
frequently into a deep-group, they will gathered on same CPU when workload like
stress are running, which lead to that the whole group could gain no more than
one CPU.

Basically there are two key points here, load-balance and wake-affine.

Wake-affine for sure pull tasks together for workload like dbench, what make
it difference when put dbench into a group one level deeper is the
load-balance, which happened less.

Usually, when system is busy, during the wakeup when we could not locate
idle cpu, we pick the search point instead, whatever how busy it is since
we count on the balance routine later to help balance the load.

However, in our cases the load balance could not help on that, since deeper
the group is, less the load effect it means to root group.

By which means even tasks in deep group all gathered on one CPU, the load
could still balanced from the view of root group, and the tasks lost the
only chances (balance) to spread when they already on the same CPU...

Furthermore, for tasks flip frequently like dbench, it'll become far more
harder for load balance to help, it could even rarely catch them on rq.

So in such cases, the only chance to do balance for these tasks is during
the wakeup, however it will be expensive...

Thus the cheaper way is something just like select_idle_sibling(), the only
difference is now we balance tasks inside the group to prevent them from
gathered.

Below patch has solved the problem during the testing, I'd like to do more
testing on other benchmarks before send out the formal patch, any comments
are welcomed ;-)

Regards,
Michael Wang



diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fea7d33..e1381cd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4409,6 +4409,62 @@ find_idlest_cpu(struct sched_group *group, struct 
task_struct *p, int this_cpu)
return idlest;
 }
 
+static inline int tg_idle_cpu(struct task_group *tg, int cpu)
+{
+   return !tg->cfs_rq[cpu]->nr_running;
+}
+
+/*
+ * Try and locate an idle CPU in the sched_domain from tg's view.
+ *
+ * Although gathered on same CPU and spread accross CPUs could make
+ * no difference from highest group's view, this will cause the tasks
+ * starving, even they have enough share to fight for CPU, they only
+ * got one battle filed, which means whatever how big their weight is,
+ * they totally got one CPU at maximum.
+ *
+ * Thus when system is busy, we filtered out those tasks which couldn't
+ * gain help from balance routine, and try to balance them internally
+ * by this func, so they could stand a chance to show their power.
+ *
+ */
+static int tg_idle_sibling(struct task_struct *p, int target)
+{
+   struct sched_domain *sd;
+   struct sched_group *sg;
+   int i = task_cpu(p);
+   struct task_group *tg = task_group(p);
+
+   if (tg_idle_cpu(tg, target))
+   goto done;
+
+   sd = rcu_dereference(per_cpu(sd_llc, target));
+   for_each_lower_domain(sd) {
+   sg = sd->groups;
+   do {
+   if (!cpumask_intersects(sched_group_cpus(sg),
+   tsk_cpus_allowed(p)))
+   goto next;
+
+   for_each_cpu(i, sched_group_cpus(sg)) {
+   if (i == target || !tg_idle_cpu(tg, i))
+   goto next;
+   }
+
+   target = cpumask_first_and(sched_group_cpus(sg),
+   tsk_cpus_allowed(p));
+
+   goto done;
+next:
+   sg = sg->next;
+   } while (sg != sd->groups);
+   }
+
+done:
+
+   return target;
+}
+
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
@@ -4417,6 +4473,7 @@ static int select_idle_sibling(struct task_struct *p, int 
target)
struct sched_domain *sd;
struct sched_group *sg;
int i = task_cpu(p);
+   struct sched_entity *se = task_group(p)->se[i];
 
if (idle_cpu(target))
return target;
@@ -4451,6 +4508,30 @@ next:
} while (sg != sd->groups);
}
 done:
+
+   if (!idle_cpu(target)) {
+   /*
+* No idle cpu located imply the system is somewhat busy,
+* usually we count on load balance routine's help and
+* just pick the target whatever how busy it is.
+*
+* However, when task belong to a 

Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-16 Thread Michael wang
On 05/16/2014 03:54 PM, Peter Zijlstra wrote:
[snip]
>>> Right.  I played a little (sane groups), saw load balancing as well.
>>
>> Yeah, now we found that even l2 groups will face the same issue, allow
>> me to re-list the details here:
> 
> Hmm, that _should_ more or less work and does indeed suggest there's
> something iffy.

Yeah, sane group topology also issued... besides the sleeper bonus, it
seems like the root cause is tasks starting to gather, I plan to check
the difference on task load between two cases, see if there is a good
way to solve this problem :)

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-16 Thread Peter Zijlstra
On Fri, May 16, 2014 at 12:24:35PM +0800, Michael wang wrote:
> Hey, Mike :)
> 
> On 05/16/2014 10:51 AM, Mike Galbraith wrote:
> > On Fri, 2014-05-16 at 10:23 +0800, Michael wang wrote:
> > 
> >> But we found that one difference when group get deeper is the tasks of
> >> that group become to gathered on CPU more often, some time all the
> >> dbench instances was running on the same CPU, this won't happen for l1
> >> group, may could explain why dbench could not get CPU more than 100% any
> >> more.
> > 
> > Right.  I played a little (sane groups), saw load balancing as well.
> 
> Yeah, now we found that even l2 groups will face the same issue, allow
> me to re-list the details here:

Hmm, that _should_ more or less work and does indeed suggest there's
something iffy.


pgpwYbQFDZiBM.pgp
Description: PGP signature


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-16 Thread Peter Zijlstra
On Fri, May 16, 2014 at 10:23:11AM +0800, Michael wang wrote:
> On 05/15/2014 07:57 PM, Peter Zijlstra wrote:
> [snip]
> >>
> >> It's like:
> >>
> >>/cgroup/cpu/l1/l2/l3/l4/l5/l6/A
> >>
> >> about level 7, the issue can not be solved any more.
> > 
> > That's pretty retarded and yeah, that's way past the point where things
> > make sense. You might be lucky and have l1-5 as empty/pointless
> > hierarchy so the effective depth is less and then things will work, but
> > *shees*..
> 
> Exactly, that's the simulation of cgroup topology setup by libvirt,
> really doesn't make sense... rather torture than deployment, but they do
> make things like that...

I'm calling it broken and unfit for purpose if it does crazy shit like
that.

There's really not much we can do to fix it either, barring softfloat in
the load-balancer and I'm sure everybody but virt wankers will complain
about _that_.


pgpzzd1dkLmgR.pgp
Description: PGP signature


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-15 Thread Michael wang
Hey, Mike :)

On 05/16/2014 10:51 AM, Mike Galbraith wrote:
> On Fri, 2014-05-16 at 10:23 +0800, Michael wang wrote:
> 
>> But we found that one difference when group get deeper is the tasks of
>> that group become to gathered on CPU more often, some time all the
>> dbench instances was running on the same CPU, this won't happen for l1
>> group, may could explain why dbench could not get CPU more than 100% any
>> more.
> 
> Right.  I played a little (sane groups), saw load balancing as well.

Yeah, now we found that even l2 groups will face the same issue, allow
me to re-list the details here:

Firstly do workaround (10 times latency):
echo 24000 > /proc/sys/kernel/sched_latency_ns
echo NO_GENTLE_FAIR_SLEEPERS > /sys/kernel/debug/sched_features

This workaround may related to another issue about vruntime bonus for
sleeper, but let's put it down currently and focus on the gather issue.

Create groups like:
mkdir /sys/fs/cgroup/cpu/A
mkdir /sys/fs/cgroup/cpu/B
mkdir /sys/fs/cgroup/cpu/C

mkdir /sys/fs/cgroup/cpu/l1
mkdir /sys/fs/cgroup/cpu/l1/A
mkdir /sys/fs/cgroup/cpu/l1/B
mkdir /sys/fs/cgroup/cpu/l1/C

Run workload like (6 is half of the CPUS on my box):
echo $$ > /sys/fs/cgroup/cpu/A/tasks ; dbench 6
echo $$ > /sys/fs/cgroup/cpu/B/tasks ; stress 6
echo $$ > /sys/fs/cgroup/cpu/C/tasks ; stress 6

Check top, each dbench instance got around 45%, totally around 270%,
this is close to the case when only dbench running (300%) since we use
the workaround, otherwise we will see it to be around 100%, but that's
another issue...

By sample /proc/sched_debug, rarely see more than 2 dbench instances on
same rq.

Now re-run workload like:
echo $$ > /sys/fs/cgroup/cpu/l1/A/tasks ; dbench 6
echo $$ > /sys/fs/cgroup/cpu/l1/B/tasks ; stress 6
echo $$ > /sys/fs/cgroup/cpu/l1/C/tasks ; stress 6

Check top, each dbench instance got around 20%, totally around 120%,
sometime dropped under 100%, and dbench throughput dropped.

By sample /proc/sched_debug, frequently see 4 or 5 dbench instances on
same rq.

So just one level deeper from l1 to l2 and such a big difference, and
groups with same shares not equally share the resources...

BTW, by bind each dbench instances to different CPU, dbench in l2 groups
will regain all the CPU% which is 300%.

I'll keep investigation and try to figure out why l2 group's tasks
starting to gather, please let me know if there are any suggestions ;-)

Regards,
Michael Wang

> 
> -Mike
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-15 Thread Mike Galbraith
On Fri, 2014-05-16 at 10:23 +0800, Michael wang wrote:

> But we found that one difference when group get deeper is the tasks of
> that group become to gathered on CPU more often, some time all the
> dbench instances was running on the same CPU, this won't happen for l1
> group, may could explain why dbench could not get CPU more than 100% any
> more.

Right.  I played a little (sane groups), saw load balancing as well.

-Mike

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-15 Thread Michael wang
On 05/15/2014 07:57 PM, Peter Zijlstra wrote:
[snip]
>>
>> It's like:
>>
>>  /cgroup/cpu/l1/l2/l3/l4/l5/l6/A
>>
>> about level 7, the issue can not be solved any more.
> 
> That's pretty retarded and yeah, that's way past the point where things
> make sense. You might be lucky and have l1-5 as empty/pointless
> hierarchy so the effective depth is less and then things will work, but
> *shees*..

Exactly, that's the simulation of cgroup topology setup by libvirt,
really doesn't make sense... rather torture than deployment, but they do
make things like that...

> 
[snip]
>> I'm not sure which account will turns to be huge when group get deeper,
>> the load accumulation will suffer discount when passing up, isn't it?
>>
> 
> It'll use 20 bits for precision instead of 10, so it gives a little more
> 'room' for deeper hierarchies/big cpu-count.

Got it :)

> 
> All assuming you're running 64bit kernels of course.

Yes, it's 64bit, I tried the testing with this feature on, seems like
haven't address the issue...

But we found that one difference when group get deeper is the tasks of
that group become to gathered on CPU more often, some time all the
dbench instances was running on the same CPU, this won't happen for l1
group, may could explain why dbench could not get CPU more than 100% any
more.

But why the gather happen when group get deeper is unclear... will try
to make it out :)

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-15 Thread Peter Zijlstra
On Thu, May 15, 2014 at 05:35:25PM +0800, Michael wang wrote:
> On 05/15/2014 05:06 PM, Peter Zijlstra wrote:
> [snip]
> >> However, when the group level is too deep, that doesn't works any more...
> >>
> >> I'm not sure but seems like 'deep group level' and 'vruntime bonus for
> >> sleeper' is the keep points here, will try to list the root cause after
> >> more investigation, thanks for the hints and suggestions, really helpful 
> >> ;-)
> > 
> > How deep is deep? You run into numerical problems quite quickly, esp.
> > when you've got lots of CPUs. We've only got 64bit to play with, that
> > said there were some patches...
> 
> It's like:
> 
>   /cgroup/cpu/l1/l2/l3/l4/l5/l6/A
> 
> about level 7, the issue can not be solved any more.

That's pretty retarded and yeah, that's way past the point where things
make sense. You might be lucky and have l1-5 as empty/pointless
hierarchy so the effective depth is less and then things will work, but
*shees*..

> > -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
> > under light load  */
> > +#if 1 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
> > under light load  */
> 
> That is trying to solve the load overflow issue, correct?
> 
> I'm not sure which account will turns to be huge when group get deeper,
> the load accumulation will suffer discount when passing up, isn't it?
> 

It'll use 20 bits for precision instead of 10, so it gives a little more
'room' for deeper hierarchies/big cpu-count.

All assuming you're running 64bit kernels of course.


pgp_MTh34iF4b.pgp
Description: PGP signature


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-15 Thread Michael wang
On 05/15/2014 05:06 PM, Peter Zijlstra wrote:
[snip]
>> However, when the group level is too deep, that doesn't works any more...
>>
>> I'm not sure but seems like 'deep group level' and 'vruntime bonus for
>> sleeper' is the keep points here, will try to list the root cause after
>> more investigation, thanks for the hints and suggestions, really helpful ;-)
> 
> How deep is deep? You run into numerical problems quite quickly, esp.
> when you've got lots of CPUs. We've only got 64bit to play with, that
> said there were some patches...

It's like:

/cgroup/cpu/l1/l2/l3/l4/l5/l6/A

about level 7, the issue can not be solved any more.

> 
> What happens if you do the below, Google has been running with that, and
> nobody was ever able to reproduce the report that got it disabled.
> 
> 
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b2cbe81308af..e40819d39c69 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -40,7 +40,7 @@ extern void update_cpu_load_active(struct rq *this_rq);
>   * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify 
> the
>   * increased costs.
>   */
> -#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
> under light load  */
> +#if 1 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
> under light load  */

That is trying to solve the load overflow issue, correct?

I'm not sure which account will turns to be huge when group get deeper,
the load accumulation will suffer discount when passing up, isn't it?

Anyway, will give it a try and see what happened :)

Regards,
Michael Wang

>  # define SCHED_LOAD_RESOLUTION   10
>  # define scale_load(w)   ((w) << SCHED_LOAD_RESOLUTION)
>  # define scale_load_down(w)  ((w) >> SCHED_LOAD_RESOLUTION)
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-15 Thread Peter Zijlstra
On Thu, May 15, 2014 at 04:46:28PM +0800, Michael wang wrote:
> On 05/15/2014 04:35 PM, Peter Zijlstra wrote:
> > On Thu, May 15, 2014 at 11:46:06AM +0800, Michael wang wrote:
> >> But for the dbench, stress combination, that's not spin-wasted, dbench
> >> throughput do dropped, how could we explain that one?
> > 
> > I've no clue what dbench does.. At this point you'll have to
> > expose/trace the per-task runtime accounting for these tasks and ideally
> > also the things the cgroup code does with them to see if it still makes
> > sense.
> 
> I see :)
> 
> BTW, some interesting thing we found during the dbench/stress testing
> is, by doing:
> 
>   echo 24000 > /proc/sys/kernel/sched_latency_ns
> echo NO_GENTLE_FAIR_SLEEPERS > /sys/kernel/debug/sched_features
> 
> that is sched_latency_ns increased around 10 times and
> GENTLE_FAIR_SLEEPERS was disabled, the dbench got it's CPU back.
> 
> However, when the group level is too deep, that doesn't works any more...
> 
> I'm not sure but seems like 'deep group level' and 'vruntime bonus for
> sleeper' is the keep points here, will try to list the root cause after
> more investigation, thanks for the hints and suggestions, really helpful ;-)

How deep is deep? You run into numerical problems quite quickly, esp.
when you've got lots of CPUs. We've only got 64bit to play with, that
said there were some patches...

What happens if you do the below, Google has been running with that, and
nobody was ever able to reproduce the report that got it disabled.



diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b2cbe81308af..e40819d39c69 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -40,7 +40,7 @@ extern void update_cpu_load_active(struct rq *this_rq);
  * when BITS_PER_LONG <= 32 are pretty high and the returns do not justify the
  * increased costs.
  */
-#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
+#if 1 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage 
under light load  */
 # define SCHED_LOAD_RESOLUTION 10
 # define scale_load(w) ((w) << SCHED_LOAD_RESOLUTION)
 # define scale_load_down(w)((w) >> SCHED_LOAD_RESOLUTION)


pgpqgxIvuijNh.pgp
Description: PGP signature


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-15 Thread Michael wang
On 05/15/2014 04:35 PM, Peter Zijlstra wrote:
> On Thu, May 15, 2014 at 11:46:06AM +0800, Michael wang wrote:
>> But for the dbench, stress combination, that's not spin-wasted, dbench
>> throughput do dropped, how could we explain that one?
> 
> I've no clue what dbench does.. At this point you'll have to
> expose/trace the per-task runtime accounting for these tasks and ideally
> also the things the cgroup code does with them to see if it still makes
> sense.

I see :)

BTW, some interesting thing we found during the dbench/stress testing
is, by doing:

echo 24000 > /proc/sys/kernel/sched_latency_ns
echo NO_GENTLE_FAIR_SLEEPERS > /sys/kernel/debug/sched_features

that is sched_latency_ns increased around 10 times and
GENTLE_FAIR_SLEEPERS was disabled, the dbench got it's CPU back.

However, when the group level is too deep, that doesn't works any more...

I'm not sure but seems like 'deep group level' and 'vruntime bonus for
sleeper' is the keep points here, will try to list the root cause after
more investigation, thanks for the hints and suggestions, really helpful ;-)

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-15 Thread Peter Zijlstra
On Thu, May 15, 2014 at 11:46:06AM +0800, Michael wang wrote:
> But for the dbench, stress combination, that's not spin-wasted, dbench
> throughput do dropped, how could we explain that one?

I've no clue what dbench does.. At this point you'll have to
expose/trace the per-task runtime accounting for these tasks and ideally
also the things the cgroup code does with them to see if it still makes
sense.


pgpLqJ52uaDoi.pgp
Description: PGP signature


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-14 Thread Michael wang
On 05/14/2014 05:44 PM, Peter Zijlstra wrote:
[snip]
>> and then:
>>  echo $$ > /sys/fs/cgroup/cpu/A/tasks ; ./my_tool -l
>>  echo $$ > /sys/fs/cgroup/cpu/B/tasks ; ./my_tool -l
>>  echo $$ > /sys/fs/cgroup/cpu/C/tasks ; ./my_tool 50
>>
>> the results in top is around:
>>
>>  A   B   C
>>  CPU%550 550 100
> 
> top doesn't do per-cgroup accounting, so how do you get these numbers,
> per the above all instances of the prog are also called the same,
> further making it error prone and difficult to get sane numbers.

Oh, my bad to make it confusing, I myself was checking the PID of my_tool
instant inside top, like:

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
24968 root  20   0 55600  720  648 S 558.1  0.0   2:08.76 my_tool   
24984 root  20   0 55600  720  648 S 536.2  0.0   1:10.29 my_tool   
25001 root  20   0 55600  720  648 S 88.6  0.0   0:04.39 my_tool

By 'cat /sys/fs/cgroup/cpu/C/tasks' I got the PID of './my_tool 50' is
25001, and all it's pthread's %CPU was count in, could we check like
that?

> 
> 
[snip]
>> void consume(int spin, int total)
>> {
>>  unsigned long long begin, now;
>>  begin = stamp();
>>
>>  for (;;) {
>>  pthread_mutex_lock(&my_mutex);
>>  now = stamp();
>>  if ((long long)(now - begin) > spin) {
>>  pthread_mutex_unlock(&my_mutex);
>>  usleep(total - spin);
>>  pthread_mutex_lock(&my_mutex);
>>  begin += total;
>>  }
>>  pthread_mutex_unlock(&my_mutex);
>>  }
>> }
> 
> Uh,.. that's just insane.. what's the point of having a multi-threaded
> program do busy-wait loops if you then serialize the lot on a global
> mutex such that only 1 thread can run at any one time?
> 
> How can one such prog ever consume more than 100% cpu.

That's a good point... however the top show that when only './my_tool 50'
25001 running, it used around 300%, like below:

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND
25001 root  20   0 55600  720  648 S 284.3  0.0   5:18.00 my_tool   
 2376 root  20   0  950m  85m  29m S  4.4  0.2 163:47.94 python 
 1658 root  20   0 1013m  19m  11m S  3.0  0.1  97:06.11 libvirtd

IMHO, if pthread-mutex was similar like the kernel one's behaviour, then
it may not going to sleep when it's the only one running on CPU.

Oh, I think we got the reason here, when there are other task running,
mutex will going to sleep and the %CPU dropped to serialized case that is
around 100%.

But for the dbench, stress combination, that's not spin-wasted, dbench
throughput do dropped, how could we explain that one?

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-14 Thread Peter Zijlstra
On Wed, May 14, 2014 at 03:36:50PM +0800, Michael wang wrote:
> distro mount cpu-subsys under '/sys/fs/cgroup/cpu', create group like:
>   mkdir /sys/fs/cgroup/cpu/A
>   mkdir /sys/fs/cgroup/cpu/B
>   mkdir /sys/fs/cgroup/cpu/C

Yeah, distro is on crack, nobody sane mounts anything there.

> and then:
>   echo $$ > /sys/fs/cgroup/cpu/A/tasks ; ./my_tool -l
>   echo $$ > /sys/fs/cgroup/cpu/B/tasks ; ./my_tool -l
>   echo $$ > /sys/fs/cgroup/cpu/C/tasks ; ./my_tool 50
> 
> the results in top is around:
> 
>   A   B   C
>   CPU%550 550 100

top doesn't do per-cgroup accounting, so how do you get these numbers,
per the above all instances of the prog are also called the same,
further making it error prone and difficult to get sane numbers.


> #include 
> #include 
> #include 
> #include 
> 
> pthread_mutex_t my_mutex;
> 
> unsigned long long stamp(void)
> {
>   struct timeval tv;
>   gettimeofday(&tv, NULL);
> 
>   return (unsigned long long)tv.tv_sec * 100 + tv.tv_usec;
> }
> void consume(int spin, int total)
> {
>   unsigned long long begin, now;
>   begin = stamp();
> 
>   for (;;) {
>   pthread_mutex_lock(&my_mutex);
>   now = stamp();
>   if ((long long)(now - begin) > spin) {
>   pthread_mutex_unlock(&my_mutex);
>   usleep(total - spin);
>   pthread_mutex_lock(&my_mutex);
>   begin += total;
>   }
>   pthread_mutex_unlock(&my_mutex);
>   }
> }

Uh,.. that's just insane.. what's the point of having a multi-threaded
program do busy-wait loops if you then serialize the lot on a global
mutex such that only 1 thread can run at any one time?

How can one such prog ever consume more than 100% cpu.


pgp3z1pUk2Ez9.pgp
Description: PGP signature


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-14 Thread Michael wang
Hi, Peter

On 05/13/2014 10:23 PM, Peter Zijlstra wrote:
[snip]
> 
> I you want to investigate !spinners, replace the ABC with slightly more
> complex loads like: https://lkml.org/lkml/2012/6/18/212

I've done a little reform, enabled multi-threads and add a mutex,
please check the code below for details.

I built it by:
gcc -o my_tool cgroup_tool.c -lpthread

distro mount cpu-subsys under '/sys/fs/cgroup/cpu', create group like:
mkdir /sys/fs/cgroup/cpu/A
mkdir /sys/fs/cgroup/cpu/B
mkdir /sys/fs/cgroup/cpu/C

and then:
echo $$ > /sys/fs/cgroup/cpu/A/tasks ; ./my_tool -l
echo $$ > /sys/fs/cgroup/cpu/B/tasks ; ./my_tool -l
echo $$ > /sys/fs/cgroup/cpu/C/tasks ; ./my_tool 50

the results in top is around:

A   B   C
CPU%550 550 100

While only './my_tool 50' was running, it require around 300%.

And this could also be reproduced by dbench, stress combination like:
echo $$ > /sys/fs/cgroup/cpu/A/tasks ; dbench 6
echo $$ > /sys/fs/cgroup/cpu/B/tasks ; stress -c 6
echo $$ > /sys/fs/cgroup/cpu/C/tasks ; stress -c 6

Now it seems more like a generic problem... will keep investigating, please
let me know if there are any suggestions :)

Regards,
Michael Wang



#include 
#include 
#include 
#include 

pthread_mutex_t my_mutex;

unsigned long long stamp(void)
{
struct timeval tv;
gettimeofday(&tv, NULL);

return (unsigned long long)tv.tv_sec * 100 + tv.tv_usec;
}
void consume(int spin, int total)
{
unsigned long long begin, now;
begin = stamp();

for (;;) {
pthread_mutex_lock(&my_mutex);
now = stamp();
if ((long long)(now - begin) > spin) {
pthread_mutex_unlock(&my_mutex);
usleep(total - spin);
pthread_mutex_lock(&my_mutex);
begin += total;
}
pthread_mutex_unlock(&my_mutex);
}
}

struct my_data {
int spin;
int total;
};

void *my_fn_sleepy(void *arg)
{
struct my_data *data = (struct my_data *)arg;
consume(data->spin, data->total);
return NULL;
}

void *my_fn_loop(void *arg)
{
while (1) {};
return NULL;
}

int main(int argc, char **argv)
{
int period = 10; /* 100ms */
int frac;
struct my_data data;
pthread_t last_thread;
int thread_num = sysconf(_SC_NPROCESSORS_ONLN) / 2;
void *(*my_fn)(void *arg) = &my_fn_sleepy;

if (thread_num <= 0 || thread_num > 1024) {
fprintf(stderr, "insane processor(half) size %d\n", thread_num);
return -1;
}

if (argc == 2 && !strcmp(argv[1], "-l")) {
my_fn = &my_fn_loop;
printf("loop mode enabled\n");
goto loop_mode;
}

if (argc < 2) {
fprintf(stderr, "%s  []\n"
"  frac   -- [1-100] %% of time to burn\n"
"  period -- [usec] period of burn/sleep 
cycle\n",
argv[0]);
return -1;
}

frac = atoi(argv[1]);
if (argc > 2)
period = atoi(argv[2]);
if (frac > 100)
frac = 100;
if (frac < 1)
frac = 1;

data.spin = (period * frac) / 100;
data.total = period;

loop_mode:
pthread_mutex_init(&my_mutex, NULL);
while (thread_num--) {
if (pthread_create(&last_thread, NULL, my_fn, &data)) {
fprintf(stderr, "Create thread failed\n");
return -1;
}
}

printf("Threads never stop, CTRL + C to terminate\n");

pthread_join(last_thread, NULL);
pthread_mutex_destroy(&my_mutex);   //won't happen
return 0;
}

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

#include 
#include 
#include 
#include 

pthread_mutex_t my_mutex;

unsigned long long stamp(void)
{
	struct timeval tv;
	gettimeofday(&tv, NULL);

	return (unsigned long long)tv.tv_sec * 100 + tv.tv_usec;
}
void consume(int spin, int total)
{
	unsigned long long begin, now;
	begin = stamp();

	for (;;) {
		pthread_mutex_lock(&my_mutex);
		now = stamp();
		if ((long long)(now - begin) > spin) {
			pthread_mutex_unlock(&my_mutex);
			usleep(total - spin);
			pthread_mutex_lock(&my_mutex);
			begin += total;
		}
		pthread_mutex_unlock(&my_mutex);
	}
}

struct my_data {
	int spin;
	int total;
};

void *my_fn_sleepy(void *arg)
{
	struct my_data *data = (struct my_data *)arg;
	consume(data->spin, data->total);
	return NU

Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-13 Thread Michael wang
On 05/13/2014 10:23 PM, Peter Zijlstra wrote:
[snip]
> 
> The point remains though, don't use massive and awkward software stacks
> that are impossible to operate.
> 
> I you want to investigate !spinners, replace the ABC with slightly more
> complex loads like: https://lkml.org/lkml/2012/6/18/212

That's what we need, may be a little reform to enable multi-threads, or
may be add some locks... anyway, will redo the test and see what we
could found :)

Regards,
Michael Wang

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-13 Thread Michael wang
On 05/13/2014 09:36 PM, Rik van Riel wrote:
[snip]
>>
>> echo 2048 > /cgroup/c/cpu.shares
>>
>> Where [ABC].sh are spinners:
> 
> I suspect the "are spinners" is key.
> 
> Infinite loops can run all the time, while dbench spends a lot of
> its time waiting for locks. That waiting may interfere with getting
> as much CPU as it wants.

That's what we are thinking, also we assume that by introducing load
decay mechanism, it become harder for the sleepy tasks to gain enough
slice, well, that currently just imagination, more investigation is
needed ;-)

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-13 Thread Michael wang
On 05/13/2014 05:47 PM, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote:
>> During our testing, we found that the cpu.shares doesn't work as
>> expected, the testing is:
>>
> 
> /me zaps all the kvm nonsense as that's non reproducable and only serves
> to annoy.
> 
> Pro-tip: never use kvm to report cpu-cgroup issues.

Make sense.

> 
[snip]
> for i in A B C ; do ps -deo pcpu,cmd | grep "${i}\.sh" | awk '{t += $1} END 
> {print t}' ; done

Enjoyable :)

> 639.7
> 629.8
> 1127.4
> 
> That is of course not perfect, but it's close enough.

Yeah, for cpu intensive work load, the share do work very well, the
issue only appeared when workload start to become some kind of...sleepy.

I will use the tool you mentioned for the following investigation,
thanks for the suggestion.

> 
> Now you again.. :-)

And here I am ;-)

Regards,
Michael Wang

> 

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-13 Thread Peter Zijlstra
On Tue, May 13, 2014 at 09:36:20AM -0400, Rik van Riel wrote:
> On 05/13/2014 05:47 AM, Peter Zijlstra wrote:
> > On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote:
> >> During our testing, we found that the cpu.shares doesn't work as
> >> expected, the testing is:
> >>
> > 
> > /me zaps all the kvm nonsense as that's non reproducable and only serves
> > to annoy.
> > 
> > Pro-tip: never use kvm to report cpu-cgroup issues.
> > 
> >> So is this results expected (I really do not think so...)?
> >>
> >> Or that imply the cpu-cgroup got some issue to be fixed?
> > 
> > So what I did (WSM-EP 2x6x2):
> > 
> > mount none /cgroup -t cgroup -o cpu
> > mkdir -p /cgroup/a
> > mkdir -p /cgroup/b
> > mkdir -p /cgroup/c
> > 
> > echo $$ > /cgroup/a/tasks ; for ((i=0; i<12; i++)) ; do A.sh & done
> > echo $$ > /cgroup/b/tasks ; for ((i=0; i<12; i++)) ; do B.sh & done
> > echo $$ > /cgroup/c/tasks ; for ((i=0; i<12; i++)) ; do C.sh & done
> > 
> > echo 2048 > /cgroup/c/cpu.shares
> > 
> > Where [ABC].sh are spinners:
> 
> I suspect the "are spinners" is key.
> 
> Infinite loops can run all the time, while dbench spends a lot of
> its time waiting for locks. That waiting may interfere with getting
> as much CPU as it wants.

At which point it becomes an entirely different problem and the weight
things become far more 'interesting'.

The point remains though, don't use massive and awkward software stacks
that are impossible to operate.

I you want to investigate !spinners, replace the ABC with slightly more
complex loads like: https://lkml.org/lkml/2012/6/18/212
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-13 Thread Rik van Riel
On 05/13/2014 05:47 AM, Peter Zijlstra wrote:
> On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote:
>> During our testing, we found that the cpu.shares doesn't work as
>> expected, the testing is:
>>
> 
> /me zaps all the kvm nonsense as that's non reproducable and only serves
> to annoy.
> 
> Pro-tip: never use kvm to report cpu-cgroup issues.
> 
>> So is this results expected (I really do not think so...)?
>>
>> Or that imply the cpu-cgroup got some issue to be fixed?
> 
> So what I did (WSM-EP 2x6x2):
> 
> mount none /cgroup -t cgroup -o cpu
> mkdir -p /cgroup/a
> mkdir -p /cgroup/b
> mkdir -p /cgroup/c
> 
> echo $$ > /cgroup/a/tasks ; for ((i=0; i<12; i++)) ; do A.sh & done
> echo $$ > /cgroup/b/tasks ; for ((i=0; i<12; i++)) ; do B.sh & done
> echo $$ > /cgroup/c/tasks ; for ((i=0; i<12; i++)) ; do C.sh & done
> 
> echo 2048 > /cgroup/c/cpu.shares
> 
> Where [ABC].sh are spinners:

I suspect the "are spinners" is key.

Infinite loops can run all the time, while dbench spends a lot of
its time waiting for locks. That waiting may interfere with getting
as much CPU as it wants.

-- 
All rights reversed
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-13 Thread Peter Zijlstra
On Tue, May 13, 2014 at 11:34:43AM +0800, Michael wang wrote:
> During our testing, we found that the cpu.shares doesn't work as
> expected, the testing is:
> 

/me zaps all the kvm nonsense as that's non reproducable and only serves
to annoy.

Pro-tip: never use kvm to report cpu-cgroup issues.

> So is this results expected (I really do not think so...)?
> 
> Or that imply the cpu-cgroup got some issue to be fixed?

So what I did (WSM-EP 2x6x2):

mount none /cgroup -t cgroup -o cpu
mkdir -p /cgroup/a
mkdir -p /cgroup/b
mkdir -p /cgroup/c

echo $$ > /cgroup/a/tasks ; for ((i=0; i<12; i++)) ; do A.sh & done
echo $$ > /cgroup/b/tasks ; for ((i=0; i<12; i++)) ; do B.sh & done
echo $$ > /cgroup/c/tasks ; for ((i=0; i<12; i++)) ; do C.sh & done

echo 2048 > /cgroup/c/cpu.shares

Where [ABC].sh are spinners:

---
#!/bin/bash

while :; do :; done
---

for i in A B C ; do ps -deo pcpu,cmd | grep "${i}\.sh" | awk '{t += $1} END 
{print t}' ; done
639.7
629.8
1127.4

That is of course not perfect, but it's close enough.

Now you again.. :-)


pgpAi1nMi4uyC.pgp
Description: PGP signature


[ISSUE] sched/cgroup: Does cpu-cgroup still works fine nowadays?

2014-05-12 Thread Michael wang
During our testing, we found that the cpu.shares doesn't work as
expected, the testing is:

X86 HOST:
12 CPU
GUEST(KVM):
6 VCPU

We create 3 GUEST, each with 1024 shares, the workload inside them is:

GUEST_1:
dbench 6
GUEST_2:
stress -c 6
GUEST_3:
stress -c 6

So by theory, each GUEST will got (1024 / (3 * 1024)) * 1200% == 400%
according to the group share (3 groups are created by virtual manager on
same level, and they are the only groups heavily running in system).

Now if only GUEST_1 running, it got 300% CPU, which is 1/4 of the whole
CPU resource.

So when all 3 GUEST running concurrently, we expect:

GUEST_1 GUEST_2 GUEST_3
CPU%300%450%450%

That is the GUEST_1 got the 300% it required, and the unused 100% was
shared by the rest group.

But the result is:

GUEST_1 GUEST_2 GUEST_3
CPU%40% 580%580%

GUEST_1 failed to gain the CPU it required, and the dbench inside it
dropped a lot on performance.

So is this results expected (I really do not think so...)?

Or that imply the cpu-cgroup got some issue to be fixed?

Any comments are welcomed :)

Regards,
Michael Wang

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/