On 2020/5/29 上午2:34, Andrii Nakryiko wrote:
[snip]
>>>
>>> With CO-RE, it also will allow to compile this tool once and run it on
>>> many different kernels without recompilation. Please do take a look
>>> and submit a PR there, it will be a good addition to the toolkit (and
>>> will force you w
Hi, Andrii
Thanks for your comments :-)
On 2020/5/28 下午2:36, Andrii Nakryiko wrote:
[snip]
>> ---
>
> I haven't looked through implementation thoroughly yet. But I have few
> general remarks.
>
> This looks like a useful and generic tool. I think it will get most
> attention and be most useful
This is a tool to trace the related schedule events of a
specified task, eg the migration, sched in/out, wakeup and
sleep/block.
The event was translated into sentence to be more readable,
by execute command 'task_detector -p 49870' we continually
tracing the schedule events related to 'top' like:
Hi, Folks
Please feel free to comment if you got any concerns :-)
Hi, Peter
How do you think about this version?
Please let us know if it's still not good enough to be accepted :-)
Regards,
Michael Wang
On 2019/7/16 上午11:38, 王贇 wrote:
> During our torturing on numa stuff, we found
rsion :-)
Regards,
Michael Wang
On 2019/7/16 上午11:38, 王贇 wrote:
> During our torturing on numa stuff, we found problems like:
>
> * missing per-cgroup information about the per-node execution status
> * missing per-cgroup information about the numa locality
>
> That is wh
On 2019/7/12 下午4:58, 王贇 wrote:
[snip]
>
> I see, we should not override the decision of select_idle_sibling().
>
> Actually the original design we try to achieve is:
>
> let wake affine select the target
> try find idle sibling of target
> if got one
>
Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.
When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be reall
On 2019/7/20 上午12:39, Michal Koutný wrote:
> On Tue, Jul 16, 2019 at 11:40:35AM +0800, 王贇
> wrote:
>> By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new
>> output line heading with 'exectime', like:
>>
>> exectime 311
Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.
When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be reall
This patch introduced numa execution time information, to imply the numa
efficiency.
By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we see new
output line heading with 'exectime', like:
exectime 311900 407166
which means the tasks of this cgroup executed 311900 micro seconds on
n
By tracing numa page faults, we recognize tasks sharing the same page,
and try pack them together into a single numa group.
However when two task share lot's of cache pages while not much
anonymous pages, since numa balancing do not tracing cache page, they
have no chance to join into the same gro
This patch introduced numa locality statistic, which try to imply
the numa balancing efficiency per memory cgroup.
On numa balancing, we trace the local page accessing ratio of tasks,
which we call the locality.
By doing 'cat /sys/fs/cgroup/cpu/CGROUP_PATH/cpu.numa_stat', we
see output line headi
During our torturing on numa stuff, we found problems like:
* missing per-cgroup information about the per-node execution status
* missing per-cgroup information about the numa locality
That is when we have a cpu cgroup running with bunch of tasks, no good
way to tell how it's tasks are deali
Hi Michal,
Thx for the comments :-)
On 2019/7/15 下午8:10, Michal Koutný wrote:
> Hello Yun.
>
> On Fri, Jul 12, 2019 at 06:10:24PM +0800, 王贇
> wrote:
>> Forgive me but I have no idea on how to combined this
>> with memory cgroup's locality hierarchical update...
&
On 2019/7/12 下午6:10, 王贇 wrote:
[snip]
>>
>> Documentation/cgroup-v1/cpusets.txt
>>
>> Look for mems_allowed.
>
> This is the attribute belong to cpuset cgroup isn't it?
>
> Forgive me but I have no idea on how to combined this
> with memory cgrou
On 2019/7/12 下午5:42, Peter Zijlstra wrote:
> On Fri, Jul 12, 2019 at 05:11:25PM +0800, 王贇 wrote:
>>
>>
>> On 2019/7/12 下午3:58, Peter Zijlstra wrote:
>> [snip]
>>>>>
>>>>> Then our task t1 should be accounted to B (as you do), but also
On 2019/7/12 下午3:58, Peter Zijlstra wrote:
[snip]
>>>
>>> Then our task t1 should be accounted to B (as you do), but also to A and
>>> R.
>>
>> I get the point but not quite sure about this...
>>
>> Not like pages there are no hierarchical limitation on locality, also tasks
>
> You can use cpus
On 2019/7/12 下午3:53, Peter Zijlstra wrote:
[snip]
return target;
}
>>>
>>> Select idle sibling should never cross node boundaries and is thus the
>>> entirely wrong place to fix anything.
>>
>> Hmm.. in our early testing the printk show both select_task_rq_fair() and
>> task_numa_f
On 2019/7/11 下午10:10, Peter Zijlstra wrote:
> On Wed, Jul 03, 2019 at 11:32:32AM +0800, 王贇 wrote:
>> By tracing numa page faults, we recognize tasks sharing the same page,
>> and try pack them together into a single numa group.
>>
>> However when two task share lot
On 2019/7/11 下午9:47, Peter Zijlstra wrote:
[snip]
>> +rcu_read_lock();
>> +memcg = mem_cgroup_from_task(p);
>> +if (idx != -1)
>> +this_cpu_inc(memcg->stat_numa->locality[idx]);
>
> I thought cgroups were supposed to be hierarchical. That is, if we have:
>
> R
On 2019/7/11 下午9:45, Peter Zijlstra wrote:
> On Wed, Jul 03, 2019 at 11:29:15AM +0800, 王贇 wrote:
>
>> +++ b/include/linux/memcontrol.h
>> @@ -190,6 +190,7 @@ enum memcg_numa_locality_interval {
>>
>> struct memcg_stat_numa {
>> u64 locality[N
On 2019/7/11 下午9:43, Peter Zijlstra wrote:
> On Wed, Jul 03, 2019 at 11:28:10AM +0800, 王贇 wrote:
>> +#ifdef CONFIG_NUMA_BALANCING
>> +
>> +enum memcg_numa_locality_interval {
>> +PERCENT_0_29,
>> +PERCENT_30_39,
>> +PERCENT_40_49,
>&
On 2019/7/11 下午10:27, Peter Zijlstra wrote:
[snip]
>> Thus we introduce the numa cling, which try to prevent tasks leaving
>> the preferred node on wakeup fast path.
>
>
>> @@ -6195,6 +6447,13 @@ static int select_idle_sibling(struct task_struct *p,
>> int prev, int target)
>> if ((unsig
Hi folks,
How do you think about these patches?
During most of our tests the results show stable improvements, thus
we consider this as a generic problem and proposed this solution,
hope to help address the issue.
Comments are sincerely welcome :-)
Regards,
Michael Wang
On 2019/7/3 上午11:26, 王
Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.
When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be reall
On 2019/7/8 下午4:07, Hillf Danton wrote:
>
> On Mon, 8 Jul 2019 10:25:27 +0800 Michael Wang wrote:
>> /* Attempt to migrate a task to a CPU on the preferred node. */
>> static void numa_migrate_preferred(struct task_struct *p)
>> {
>> +bool failed, target;
>> unsigned long interval = HZ;
>
Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.
When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be reall
Although we paid so many effort to settle down task on a particular
node, there are still chances for a task to leave it's preferred
node, that is by wakeup, numa swap migrations or load balance.
When we are using cpu cgroup in share way, since all the workloads
see all the cpus, it could be reall
By tracing numa page faults, we recognize tasks sharing the same page,
and try pack them together into a single numa group.
However when two task share lot's of cache pages while not much
anonymous pages, since numa balancing do not tracing cache page, they
have no chance to join into the same gro
This patch introduced numa execution information, to imply the numa
efficiency.
By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
see new output line heading with 'exectime', like:
exectime 311900 407166
which means the tasks of this cgroup executed 311900 micro seconds on
This patch introduced numa locality statistic, which try to imply
the numa balancing efficiency per memory cgroup.
By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
see new output line heading with 'locality', the format is:
locality 0%~29% 30%~39% 40%~49% 50%~59% 60%~69% 70
During our torturing on numa stuff, we found problems like:
* missing per-cgroup information about the per-node execution status
* missing per-cgroup information about the numa locality
That is when we have a cpu cgroup running with bunch of tasks, no good
way to tell how it's tasks are deali
On 2019/4/23 下午5:46, Peter Zijlstra wrote:
> On Tue, Apr 23, 2019 at 05:36:25PM +0800, 王贇 wrote:
>>
>>
>> On 2019/4/23 下午4:52, Peter Zijlstra wrote:
>>> On Mon, Apr 22, 2019 at 10:12:20AM +0800, 王贇 wrote:
>>>> This patch introduced numa execution info
On 2019/4/23 下午5:05, Peter Zijlstra wrote:
[snip]
>>
>> TODO:
>> * improve the logical to address the regression cases
>> * Find a way, maybe, to handle the page cache left on remote
>> * find more scenery which could gain benefit
>>
>> Signed-off-by: Michael Wang
>> ---
>> drivers/Makef
On 2019/4/23 下午4:55, Peter Zijlstra wrote:
> On Mon, Apr 22, 2019 at 10:13:36AM +0800, 王贇 wrote:
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index af171ccb56a2..6513504373b4 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -2031,6 +2031,10
On 2019/4/23 下午4:52, Peter Zijlstra wrote:
> On Mon, Apr 22, 2019 at 10:12:20AM +0800, 王贇 wrote:
>> This patch introduced numa execution information, to imply the numa
>> efficiency.
>>
>> By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', w
On 2019/4/23 下午4:47, Peter Zijlstra wrote:
> On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote:
>> +p->numa_faults_locality[mem_node == numa_node_id() ? 4 : 3] += pages;
>
> Possibly: 3 + !!(mem_node = numa_node_id()), generates better code.
Sounds good~ will app
On 2019/4/23 下午4:46, Peter Zijlstra wrote:
> On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote:
>> + * 0 -- remote faults
>> + * 1 -- local faults
>> + * 2 -- page migration failure
>> + * 3 -- remote page accessing after page migration
>> +
On 2019/4/23 下午4:44, Peter Zijlstra wrote:
> On Mon, Apr 22, 2019 at 10:11:24AM +0800, 王贇 wrote:
>> +#ifdef CONFIG_NUMA_BALANCING
>> +
>> +enum memcg_numa_locality_interval {
>> +PERCENT_0_9,
>> +PERCENT_10_19,
>> +PERCENT_20_29,
>&
w what's the problem is, we'll try to address them :-)
Regards,
Michael Wang
>
> Thanks
> Wind
>
>
> 王贇 mailto:yun.w...@linux.alibaba.com>>
> 于2019年4月22日周一 上午10:13写道:
>
> We have NUMA Balancing feature which always trying to move pages
>
numa balancer is a module which will try to automatically adjust numa
balancing stuff to gain numa bonus as much as possible.
For each memory cgroup, we process the work in two steps:
On stage 1 we check cgroup's exectime and memory topology to see
if there could be a candidate for settled down,
Now we have the way to estimate and adjust numa preferred node for each
memcg, next problem is how to use them.
Usually one will bind workloads with cpuset.cpus, combined with cpuset.mems
or maybe better the memory policy to achieve numa bonus, however in complicated
scenery like combined type of
This patch add a new entry 'numa_preferred' for each memory cgroup,
by which we can now override the memory policy of the tasks inside
a particular cgroup, combined with numa balancing, we now be able to
migrate the workloads of a cgroup to the specified numa node, in gentle
way.
The load balancin
This patch introduced numa execution information, to imply the numa
efficiency.
By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
see new output line heading with 'exectime', like:
exectime 24399843 27865444
which means the tasks of this cgroup executed 24399843 ticks on no
This patch introduced numa locality statistic, which try to imply
the numa balancing efficiency per memory cgroup.
By doing 'cat /sys/fs/cgroup/memory/CGROUP_PATH/memory.numa_stat', we
see new output line heading with 'locality', the format is:
locality 0~9% 10%~19% 20%~29% 30%~39% 40%~49% 50%~
We have NUMA Balancing feature which always trying to move pages
of a task to the node it executed more, while still got issues:
* page cache can't be handled
* no cgroup level balancing
Suppose we have a box with 4 cpu, two cgroup A & B each running 4 tasks,
below scenery could be easily observe
On 2019/1/28 下午3:21, 禹舟键 wrote:
[snip]
No offense but I'm afraid you misunderstand the problem we try to solve
by wait_sum, if your purpose is to have a way to tell whether there are
sufficient CPU inside a container, please try lxcfs + top, if there are
almost no idle and load is high, then t
ld be
in kernel.
Regards,
Michael Wang
The extra overhead for calculating the hierarchy wait_sum is traversing the
cfs_rq's se from the target task's se to the root_task_group children's se.
Regartds
Yuzhoujian
王贇 mailto:yun.w...@linux.alibaba.com>>
于2019年1月25日周五 上午11:12写道:
On 2019/1/23 下午5:46, ufo19890...@gmail.com wrote:
From: yuzhoujian
We can monitor the sum wait time of a task group since 'commit 3d6c50c27bd6
("sched/debug: Show the sum wait time of a task group")'. However this
wait_sum just represents the confilct between different task groups, since
it
Although we can rely on cpuacct to present the cpu usage of task
group, it is hard to tell how intense the competition is between
these groups on cpu resources.
Monitoring the wait time of each process or sched_debug could cost
too much, and there is no good way to accurately represent the
confli
On 2018/7/23 下午5:31, Peter Zijlstra wrote:
On Wed, Jul 04, 2018 at 11:27:27AM +0800, 王贇 wrote:
@@ -6788,6 +6790,12 @@ static int cpu_cfs_stat_show(struct seq_file *sf, void
*v)
seq_printf(sf, "nr_throttled %d\n", cfs_b->nr_throttled);
seq_printf(sf, "thr
Hi, folks
On 2018/7/4 上午11:27, 王贇 wrote:
Although we can rely on cpuacct to present the cpu usage of task
group, it is hard to tell how intense the competition is between
these groups on cpu resources.
Monitoring the wait time of each process or sched_debug could cost
too much, and there is no
On 2018/7/4 上午11:27, 王贇 wrote:
Although we can rely on cpuacct to present the cpu usage of task
group, it is hard to tell how intense the competition is between
these groups on cpu resources.
Monitoring the wait time of each process or sched_debug could cost
too much, and there is no good
Although we can rely on cpuacct to present the cpu usage of task
group, it is hard to tell how intense the competition is between
these groups on cpu resources.
Monitoring the wait time of each process or sched_debug could cost
too much, and there is no good way to accurately represent the
confli
Although we can rely on cpuacct to present the cpu usage of task
group, it is hard to tell how intense the competition is between
these groups on cpu resources.
Monitoring the wait time of each process or sched_debug could cost
too much, and there is no good way to accurately represent the
confli
Hi, Peter
On 2018/7/2 下午8:03, Peter Zijlstra wrote:
On Mon, Jul 02, 2018 at 03:29:39PM +0800, 王贇 wrote:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1866e64..ef82ceb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -862,6 +862,7 @@ static void update_curr_fair
Although we can rely on cpuacct to present the cpu usage of task
group, it is hard to tell how intense the competition is between
these groups on cpu resources.
Monitoring the wait time of each process could cost too much, and
there is no good way to accurately represent the conflict with
these i
57 matches
Mail list logo