Re: [RFC] perf/core: what is exclude_idle supposed to do
On Fri, Apr 20, 2018 at 9:51 AM Vince Weaver wrote: > On Fri, 20 Apr 2018, Vince Weaver wrote: > > > AFAICT it works on Power and possibly ARM. > > > > at least some ARMs are a bit more honest about it than x86 > > > > ivybridge: > > Performance counter stats for '/bin/ls': > > 1,368,162 instructions > > 1,368,162 instructions:I > > > > pi2/ARM cortex-A7 > > Performance counter stats for '/bin/ls': > > 1,910,083 instructions > > instructions:I > > > > I'd fire up my Power8 machine to see but not sure it's worth the hassle > > and/or having to get out the ear protection. > I did power up the Power8 machine in the end: > power8: > perf stat -e cycles,cycles:I sleep 5 > Performance counter stats for 'sleep 5': > 14,271,273 cycles > 14,271,273 cycles:I > ??? > But then if I try again on power8 > perf stat -a -e cycles,cycles:I sleep 5 > Performance counter stats for 'system wide': > 1,238,772,322,327 cycles > 1,238,674,771,713 cycles:I > there is a difference. > But then on ivybridge > perf stat -a -e cycles,cycles:I sleep 5 > Performance counter stats for 'system wide': > 589,598,104 cycles > 589,537,190 cycles:I This may be noise. The way the flag is named leads me to believe its goal is to not count when executing in the context of the idle task (pid 0 on each CPU). However, it does not seem to be implemented that way. If you were to implement this, then in system wide mode you'd have to check the incoming task on ctxsw, very much like we do in cgroup monitoring. So it would not be totally free. One can argue, in sampling mode you can eliminate the samples coming from PID=0 in the tool. But there would be nothing to cover counting mode. Interestingly, there is also code in perf tool to exclude known idle routines from reporting. But this is targeted to only some routines that the idle task may end up executing. so it is not quite the same. > So maybe exclude_idle does do something on x86? Or am I completely > misunderstanding what the flag is supposed to be indicating? > Vince
Re: [RFC] perf/core: what is exclude_idle supposed to do
On Fri, 20 Apr 2018, Vince Weaver wrote: > > AFAICT it works on Power and possibly ARM. > > at least some ARMs are a bit more honest about it than x86 > > ivybridge: > Performance counter stats for '/bin/ls': > 1,368,162 instructions > 1,368,162 instructions:I > > pi2/ARM cortex-A7 > Performance counter stats for '/bin/ls': > 1,910,083 instructions > instructions:I > > I'd fire up my Power8 machine to see but not sure it's worth the hassle > and/or having to get out the ear protection. I did power up the Power8 machine in the end: power8: perf stat -e cycles,cycles:I sleep 5 Performance counter stats for 'sleep 5': 14,271,273 cycles 14,271,273 cycles:I ??? But then if I try again on power8 perf stat -a -e cycles,cycles:I sleep 5 Performance counter stats for 'system wide': 1,238,772,322,327 cycles 1,238,674,771,713 cycles:I there is a difference. But then on ivybridge perf stat -a -e cycles,cycles:I sleep 5 Performance counter stats for 'system wide': 589,598,104 cycles 589,537,190 cycles:I there is also a different in system wide mode. So maybe exclude_idle does do something on x86? Or am I completely misunderstanding what the flag is supposed to be indicating? Vince
Re: [RFC] perf/core: what is exclude_idle supposed to do
On Fri, 20 Apr 2018, Peter Zijlstra wrote: > On Wed, Apr 18, 2018 at 11:10:20AM -0400, Vince Weaver wrote: > > On Tue, 17 Apr 2018, Jiri Olsa wrote: > > > > > On Mon, Apr 16, 2018 at 10:04:53PM +, Stephane Eranian wrote: > > > > Hi, > > > > > > > > I am trying to understand what the exclude_idle event attribute is > > > > supposed > > > > to accomplish. > > > > As per the definition in the header file: > > > > > > > > exclude_idle : 1, /* don't count when idle */ > > > > > > AFAICS it's not implemented > > > > so just to be completely clear hear, we're saying that the "exclude_idle" > > modifier has never done anything useful and still doesn't? > > AFAICT it works on Power and possibly ARM. at least some ARMs are a bit more honest about it than x86 ivybridge: Performance counter stats for '/bin/ls': 1,368,162 instructions 1,368,162 instructions:I pi2/ARM cortex-A7 Performance counter stats for '/bin/ls': 1,910,083 instructions instructions:I I'd fire up my Power8 machine to see but not sure it's worth the hassle and/or having to get out the ear protection. Vince
Re: [RFC] perf/core: what is exclude_idle supposed to do
On Wed, Apr 18, 2018 at 11:10:20AM -0400, Vince Weaver wrote: > On Tue, 17 Apr 2018, Jiri Olsa wrote: > > > On Mon, Apr 16, 2018 at 10:04:53PM +, Stephane Eranian wrote: > > > Hi, > > > > > > I am trying to understand what the exclude_idle event attribute is > > > supposed > > > to accomplish. > > > As per the definition in the header file: > > > > > > exclude_idle : 1, /* don't count when idle */ > > > > AFAICS it's not implemented > > so just to be completely clear hear, we're saying that the "exclude_idle" > modifier has never done anything useful and still doesn't? AFAICT it works on Power and possibly ARM.
Re: [RFC] perf/core: what is exclude_idle supposed to do
On Mon, Apr 16, 2018 at 10:04:53PM +, Stephane Eranian wrote: > Hi, > > I am trying to understand what the exclude_idle event attribute is supposed > to accomplish. > As per the definition in the header file: > > exclude_idle : 1, /* don't count when idle */ > > Naively, I thought it would simply stop the event when running in the > context of the idle task (swapper, pid 0) on any CPU. That would seem to > match the succinct description. > > However, running a simple: > > $ perf record -a -e cycles:I sleep 5 > perf_event_attr: > sample_type IP|TID|TIME|CPU|PERIOD >read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID >exclude_idle 1 > > on an idle machine, showed me that this is not what is actually happening: > $ perf script > swapper 0 [000] 1132634.287442: 1 cycles:I: > 8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms]) > swapper 0 [001] 1132634.287498: 1 cycles:I: > 8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms]) > > > After looking at the code, it all made sense, it seems to current > implementation is only relevant for sw events. I don't understand why. > > I am left wondering what is the goal of exclude_idle? A "git grep exclude_idle" seems to suggest powerpc/arm have it immplemented for their PMU. If we then look at commit: 2743a5b0fa6f ("perfcounters: provide expansion room in the ABI") It was Paul who introduced the bit. So I'm thinking that if x86 doesn't implement it, we should at least error out on it. Of course, so far we've allowed it, so who knows what all will break if we start asserting that :/
Re: [RFC] perf/core: what is exclude_idle supposed to do
On Tue, 17 Apr 2018, Jiri Olsa wrote: > On Mon, Apr 16, 2018 at 10:04:53PM +, Stephane Eranian wrote: > > Hi, > > > > I am trying to understand what the exclude_idle event attribute is supposed > > to accomplish. > > As per the definition in the header file: > > > > exclude_idle : 1, /* don't count when idle */ > > AFAICS it's not implemented so just to be completely clear hear, we're saying that the "exclude_idle" modifier has never done anything useful and still doesn't? If so I should update the perf_event_open manpage to spell this out. Vince
Re: [RFC] perf/core: what is exclude_idle supposed to do
Em Mon, Apr 16, 2018 at 10:04:53PM +, Stephane Eranian escreveu: > Hi, > > I am trying to understand what the exclude_idle event attribute is supposed > to accomplish. > As per the definition in the header file: > > exclude_idle : 1, /* don't count when idle */ > > Naively, I thought it would simply stop the event when running in the > context of the idle task (swapper, pid 0) on any CPU. That would seem to > match the succinct description. > > However, running a simple: > > $ perf record -a -e cycles:I sleep 5 > perf_event_attr: > sample_type IP|TID|TIME|CPU|PERIOD >read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID >exclude_idle 1 > > on an idle machine, showed me that this is not what is actually happening: > $ perf script > swapper 0 [000] 1132634.287442: 1 cycles:I: > 8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms]) > swapper 0 [001] 1132634.287498: 1 cycles:I: > 8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms]) > > > After looking at the code, it all made sense, it seems to current > implementation is only relevant for sw events. I don't understand why. > > I am left wondering what is the goal of exclude_idle? To do something it is not doing, i.e. to do what you expected it to do. There were messages exchanged recently where PeterZ said that this is just that needs fixing. - Arnaldo
Re: [RFC] perf/core: what is exclude_idle supposed to do
On Mon, Apr 16, 2018 at 10:04:53PM +, Stephane Eranian wrote: > Hi, > > I am trying to understand what the exclude_idle event attribute is supposed > to accomplish. > As per the definition in the header file: > > exclude_idle : 1, /* don't count when idle */ > > Naively, I thought it would simply stop the event when running in the > context of the idle task (swapper, pid 0) on any CPU. That would seem to > match the succinct description. > > However, running a simple: > > $ perf record -a -e cycles:I sleep 5 > perf_event_attr: > sample_type IP|TID|TIME|CPU|PERIOD >read_format TOTAL_TIME_ENABLED|TOTAL_TIME_RUNNING|ID >exclude_idle 1 > > on an idle machine, showed me that this is not what is actually happening: > $ perf script > swapper 0 [000] 1132634.287442: 1 cycles:I: > 8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms]) > swapper 0 [001] 1132634.287498: 1 cycles:I: > 8100b1fb __intel_pmu_enable_all.isra.17 ([kernel.kallsyms]) > > > After looking at the code, it all made sense, it seems to current > implementation is only relevant for sw events. I don't understand why. AFAICS it's not implemented Peter suggested some time ago change for cpu-clock (attached) I still have it in my queue, because it gives funny numbers sometimes.. and I haven't figured it out so far the idea so far was to use cpu-clock,cpu-clock:I events to count and display idle % in perf top/record and stat metrics jirka --- >From 7f20b047cf56f8086d79007175592633798656ce Mon Sep 17 00:00:00 2001 From: Peter Zijlstra Date: Thu, 23 Nov 2017 16:15:36 +0100 Subject: [PATCH] perf: Add support to monitor idle time on cpu-clock Adding support to use perf_event_attr::exclude_idle (the :I modifierr in perf tools)to monitor idle time on cpu-clock event. Link: http://lkml.kernel.org/n/tip-jw45kl6ydrhzqx4uxws0q...@git.kernel.org Signed-off-by: Jiri Olsa --- kernel/events/core.c | 22 ++ kernel/sched/idle_task.c | 17 + 2 files changed, 35 insertions(+), 4 deletions(-) diff --git a/kernel/events/core.c b/kernel/events/core.c index 2989e3b0fe8b..62c0dc75ed11 100644 --- a/kernel/events/core.c +++ b/kernel/events/core.c @@ -9391,19 +9391,33 @@ static void perf_swevent_init_hrtimer(struct perf_event *event) * Software event: cpu wall time clock */ +static u64 cpu_clock_count(struct perf_event *event) +{ + u64 now = local_clock(); + + if (event->attr.exclude_idle) + now -= idle_task(event->oncpu)->se.sum_exec_runtime; + + return now; +} + static void cpu_clock_event_update(struct perf_event *event) { - s64 prev; + s64 prev, delta; u64 now; - now = local_clock(); + now = cpu_clock_count(event); prev = local64_xchg(&event->hw.prev_count, now); - local64_add(now - prev, &event->count); + delta = now - prev; + if (delta > 0) + local64_add(delta, &event->count); } static void cpu_clock_event_start(struct perf_event *event, int flags) { - local64_set(&event->hw.prev_count, local_clock()); + u64 now = cpu_clock_count(event); + + local64_set(&event->hw.prev_count, now); perf_swevent_start_hrtimer(event); } diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c index d518664cce4f..e360ce5dd9a2 100644 --- a/kernel/sched/idle_task.c +++ b/kernel/sched/idle_task.c @@ -27,9 +27,14 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl static struct task_struct * pick_next_task_idle(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { + struct task_struct *idle = rq->idle; + put_prev_task(rq, prev); update_idle_core(rq); schedstat_inc(rq->sched_goidle); + + idle->se.exec_start = rq_clock_task(rq); + return rq->idle; } @@ -48,6 +53,15 @@ dequeue_task_idle(struct rq *rq, struct task_struct *p, int flags) static void put_prev_task_idle(struct rq *rq, struct task_struct *prev) { + struct task_struct *idle = rq->idle; + u64 delta, now; + + now = rq_clock_task(rq); + delta = now - idle->se.exec_start; + if (likely((s64)delta > 0)) + idle->se.sum_exec_runtime += delta; + idle->se.exec_start = now; + rq_last_tick_reset(rq); } @@ -57,6 +71,9 @@ static void task_tick_idle(struct rq *rq, struct task_struct *curr, int queued) static void set_curr_task_idle(struct rq *rq) { + struct task_struct *idle = rq->idle; + + idle->se.exec_start = rq_clock_task(rq); } static void switched_to_idle(struct rq *rq, struct task_struct *p) -- 2.13.6