Re: [RFC] perf: a different approach to perf_rotate_context()
> On Mar 3, 2018, at 7:26 AM, Peter Zijlstra wrote: > > On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote: > >> Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have >> exact same priority and equal chance to run. I am not sure whether this >> will change the behavior in some use cases. >> >> Please kindly let me know whether this approach makes sense. > > What you've not said is, and what is not at all clear, is if your scheme > preserved fairness. > > > In any case, there's a ton of conflict against the patches here: > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/testing > > And with those the idea was to move to a virtual time based scheduler > (basically schedule those flexible events that have the biggest lag -- > that also solves 1). While looking at these patches, I found it might not solve issue #1 (cpuctx->task_ctx->flexible_groups starvation). Here is an example on Intel CPU (where ref-cycle can only use one hardware counter): First, in one console start: perf stat -e ref-cycles -I 1 Second, in another console run: perf stat -e ref-cycles -- benchmark The second event will not run because the first event occupies the counter all the time. Maybe we can solve this by combining the two flexible_groups (cpuctx->ctx, and cpuctx->task_ctx), and rotate them together? If this sounds reasonable, I would draft a RFC for it. Thanks, Song
Re: [RFC] perf: a different approach to perf_rotate_context()
On 03.03.2018 20:48, Peter Zijlstra wrote: > On Sat, Mar 03, 2018 at 04:43:16PM +, Song Liu wrote: >>> In any case, there's a ton of conflict against the patches here: >>> >>> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/testing >>> >>> And with those the idea was to move to a virtual time based scheduler >>> (basically schedule those flexible events that have the biggest lag -- >>> that also solves 1). >> >> Thanks for these information. I will study this approach. Maybe that is >> our path to PMU sharing. > > So I'm really not convinced on that whole PMU sharing. > >> What's is the status of this work? Would it >> land in 4.17? > > These patches might make 4.17, they got held up because of the whole > meltdown/spectre crap and I need to get back to them. > That work is long desired and would bring performance boost, specifically on server systems in per-process profiling mode, accompanied by good speedup on context switches. Undoubtedly meltdown/spectre related activity substituted it at some point but that improvements would still bring significant value and is still awaited. BR, Alexey
Re: [RFC] perf: a different approach to perf_rotate_context()
On Sat, Mar 03, 2018 at 04:43:16PM +, Song Liu wrote: > > In any case, there's a ton of conflict against the patches here: > > > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/testing > > > > And with those the idea was to move to a virtual time based scheduler > > (basically schedule those flexible events that have the biggest lag -- > > that also solves 1). > > Thanks for these information. I will study this approach. Maybe that is > our path to PMU sharing. So I'm really not convinced on that whole PMU sharing. > What's is the status of this work? Would it > land in 4.17? These patches might make 4.17, they got held up because of the whole meltdown/spectre crap and I need to get back to them.
Re: [RFC] perf: a different approach to perf_rotate_context()
> On Mar 3, 2018, at 7:26 AM, Peter Zijlstra wrote: > > On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote: >> When there are more perf_event's than hardware PMCs, perf rotate events >> so that all events get chance to run. Currently, the rotation works as: >> sched_out flexible_groups in cpuctx->ctx and cpuctx->task_ctx; >> rotate_left flexible_groups in cpuctx->ctx and cpuctx->task_ctx; >> try sched_in flexible_groups in cpuctx->ctx; >> try sched_in flexible_groups in cpuctx->task_ctx. >> >> This approach has some potential issues: >> 1. if different rotations of flexible_groups in cpuctx->ctx occupy >> all hardware PMC, flexible_groups in cpuctx->task_ctx cannot run >> at all. >> 2. if pinned_groups occupy all hardware PMC, the rotation triggers per >> perf_event_mux_interval_ms. But it couldn't schedule any events. >> 3. since flexible_groups in cpuctx->ctx and cpuctx->task_ctx are >> rotated separately, there are N x M possible combinations. It is >> difficult to remember all the rotation combinations and reuse these >> combinations. As a result, it is necessary to try sched_in the >> flexible_groups on each rotation. >> >> This patch tries to do the rotation differently. Each perf_event in the >> cpuctx (ctx and task_ctx) is assigned a rotation_id. The rotation_id's >> are assigned during the first few rotations after any changes in >> perf_events attached to the cpuctx. Once all the rotation_id's are >> assigned for all events in the cpuctx, perf_rotate_context() simply >> picks the next rotation to use, so there is no more "try to sched_in" >> for future rotations. > > I'm not following. Let me try explain it in different words. The main goal of this approach is to pre-compute all the rotations, so perf_event scheduling is less expensive. It does change current scheduling mechanism, by introducing rotation_id to each event. With rotation_id, all events in the flexible_groups have exactly same chance to run: if num_rotations == 2, all flexible event runs 50% of time; if num_rotations == 3, all flexible event runs 33% of time; etc. >> Special rotation_id's are introduced to handle the issues above. >> flexible_groups that conflicts with pinned_groups are marked as >> ALWAYS_OFF, so they are not rotated (fixes issue 2). flexible_groups >> in cpuctx->ctx and cpuctx->task_ctx are rotated together, so they all get >> equal chance to run (improves issue 1). > > I can't say I particularly care about 2, that's a really daft situation > to get into. And changing 1 needs careful consideration. > >> With this approach, we only do complex scheduling of flexible_groups >> once. This enables us to do more complex schduling, for example, Sharing >> PMU counters across compatible events: >> https://lkml.org/lkml/2017/12/1/410. >> >> There are also some potential downsides of this approach. >> >> First, it gives all flexible_groups exactly same chance to run, so it >> may waste some PMC cycles. For examples, if 5 groups, ABCDE, are assigned >> to two rotations: rotation-0: ABCD and rotation-1: E, this approach will >> NOT try any of ABCD in rotation-1. > > Yeah, that doesn't look acceptable. In fact, people already complained > about the current scheme not being optimal, what you propose is _far_ > worse. I agree this is an issue. Our initial goal is to get more events running with PMU sharing (Tejun's RFC). However, I found the sharing is difficult to implement with current scheduling scheme. This RFC tries to pave the road for PMU sharing. If there are other ways (virtual time based scheduler, etc.) that make PMU sharing possible, we will be happy to start from those. >> Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have >> exact same priority and equal chance to run. I am not sure whether this >> will change the behavior in some use cases. >> >> Please kindly let me know whether this approach makes sense. > > What you've not said is, and what is not at all clear, is if your scheme > preserved fairness. This approach gives all events in ctx->flexible_groups and task_ctx->flexible_groups exactly same chance to run. I am not sure whether that is acceptable in term of fairness. > > > In any case, there's a ton of conflict against the patches here: > > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/testing > > And with those the idea was to move to a virtual time based scheduler > (basically schedule those flexible events that have the biggest lag -- > that also solves 1). Thanks for these information. I will study this approach. Maybe that is our path to PMU sharing. What's is the status of this work? Would it land in 4.17? Thanks again, Song
Re: [RFC] perf: a different approach to perf_rotate_context()
> On Mar 3, 2018, at 5:39 AM, Jiri Olsa wrote: > > On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote: >> When there are more perf_event's than hardware PMCs, perf rotate events >> so that all events get chance to run. Currently, the rotation works as: >> sched_out flexible_groups in cpuctx->ctx and cpuctx->task_ctx; >> rotate_left flexible_groups in cpuctx->ctx and cpuctx->task_ctx; >> try sched_in flexible_groups in cpuctx->ctx; >> try sched_in flexible_groups in cpuctx->task_ctx. >> >> This approach has some potential issues: >> 1. if different rotations of flexible_groups in cpuctx->ctx occupy >> all hardware PMC, flexible_groups in cpuctx->task_ctx cannot run >> at all. >> 2. if pinned_groups occupy all hardware PMC, the rotation triggers per >> perf_event_mux_interval_ms. But it couldn't schedule any events. >> 3. since flexible_groups in cpuctx->ctx and cpuctx->task_ctx are >> rotated separately, there are N x M possible combinations. It is >> difficult to remember all the rotation combinations and reuse these >> combinations. As a result, it is necessary to try sched_in the >> flexible_groups on each rotation. >> >> This patch tries to do the rotation differently. Each perf_event in the >> cpuctx (ctx and task_ctx) is assigned a rotation_id. The rotation_id's >> are assigned during the first few rotations after any changes in >> perf_events attached to the cpuctx. Once all the rotation_id's are >> assigned for all events in the cpuctx, perf_rotate_context() simply >> picks the next rotation to use, so there is no more "try to sched_in" >> for future rotations. >> >> Special rotation_id's are introduced to handle the issues above. >> flexible_groups that conflicts with pinned_groups are marked as >> ALWAYS_OFF, so they are not rotated (fixes issue 2). flexible_groups >> in cpuctx->ctx and cpuctx->task_ctx are rotated together, so they all get >> equal chance to run (improves issue 1). > > hum, so the improvement is that cpuctx could eventually give > up some space for task_ctx events, but both ctxs still rotate > separately no? you keep rotation ID per single context.. With this approach, both ctxs are rotated together. It is possible that cpuctx->ctx only has events for rotation 0, 1; while cpuctx->task_ctx only has events for rotation 2, 3. But both of them will rotate among 0, 1, 2, 3. num_rotations and curr_rotation could be part of cpuctx, as it is eventually shared among two contexts. >> >> With this approach, we only do complex scheduling of flexible_groups >> once. This enables us to do more complex schduling, for example, Sharing >> PMU counters across compatible events: >> https://lkml.org/lkml/2017/12/1/410. > > how could this code do that? We still need a lot more work to get PMU sharing work. I think one of the problem with Tejun's RFC is more expensive scheduling. This RFC tries to pre-compute all rotations, so we only need to these expensive scheduling once. > >> >> There are also some potential downsides of this approach. >> >> First, it gives all flexible_groups exactly same chance to run, so it >> may waste some PMC cycles. For examples, if 5 groups, ABCDE, are assigned >> to two rotations: rotation-0: ABCD and rotation-1: E, this approach will >> NOT try any of ABCD in rotation-1. >> >> Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have >> exact same priority and equal chance to run. I am not sure whether this >> will change the behavior in some use cases. >> >> Please kindly let me know whether this approach makes sense. > > SNIP > >> +/* >> + * identify always_on and always_off groups in flexible_groups, call >> + * group_sched_in() for always_on groups >> + */ >> +static void ctx_pick_always_on_off_groups(struct perf_event_context *ctx, >> + struct perf_cpu_context *cpuctx) >> +{ >> +struct perf_event *event; >> + >> +list_for_each_entry(event, &ctx->flexible_groups, group_entry) { >> +if (event->group_caps & PERF_EV_CAP_SOFTWARE) { >> +event->rotation_id = PERF_ROTATION_ID_ALWAYS_ON; >> +ctx->nr_sched++; >> +WARN_ON(group_sched_in(event, cpuctx, ctx)); >> +continue; >> +} >> +if (group_sched_in(event, cpuctx, ctx)) { >> +event->rotation_id = PERF_ROTATION_ID_ALWAYS_OFF; >> +ctx->nr_sched++; > > should ctx->nr_sched be incremented in the 'else' leg? The else leg means the event does not conflict with pinned groups, so it will be scheduled later in ctx_add_rotation(). This function only handles always_on and always_off events. > >> +} >> +group_sched_out(event, cpuctx, ctx); >> +} >> +} >> + >> +/* add unassigned flexible_groups to new rotation_id */ >> +static void ctx_add_rotation(struct perf_event_context *ctx, >> + struct perf_cpu_
Re: [RFC] perf: a different approach to perf_rotate_context()
On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote: > When there are more perf_event's than hardware PMCs, perf rotate events > so that all events get chance to run. Currently, the rotation works as: > sched_out flexible_groups in cpuctx->ctx and cpuctx->task_ctx; > rotate_left flexible_groups in cpuctx->ctx and cpuctx->task_ctx; > try sched_in flexible_groups in cpuctx->ctx; > try sched_in flexible_groups in cpuctx->task_ctx. > > This approach has some potential issues: > 1. if different rotations of flexible_groups in cpuctx->ctx occupy > all hardware PMC, flexible_groups in cpuctx->task_ctx cannot run > at all. > 2. if pinned_groups occupy all hardware PMC, the rotation triggers per > perf_event_mux_interval_ms. But it couldn't schedule any events. > 3. since flexible_groups in cpuctx->ctx and cpuctx->task_ctx are > rotated separately, there are N x M possible combinations. It is > difficult to remember all the rotation combinations and reuse these > combinations. As a result, it is necessary to try sched_in the > flexible_groups on each rotation. > > This patch tries to do the rotation differently. Each perf_event in the > cpuctx (ctx and task_ctx) is assigned a rotation_id. The rotation_id's > are assigned during the first few rotations after any changes in > perf_events attached to the cpuctx. Once all the rotation_id's are > assigned for all events in the cpuctx, perf_rotate_context() simply > picks the next rotation to use, so there is no more "try to sched_in" > for future rotations. I'm not following. > Special rotation_id's are introduced to handle the issues above. > flexible_groups that conflicts with pinned_groups are marked as > ALWAYS_OFF, so they are not rotated (fixes issue 2). flexible_groups > in cpuctx->ctx and cpuctx->task_ctx are rotated together, so they all get > equal chance to run (improves issue 1). I can't say I particularly care about 2, that's a really daft situation to get into. And changing 1 needs careful consideration. > With this approach, we only do complex scheduling of flexible_groups > once. This enables us to do more complex schduling, for example, Sharing > PMU counters across compatible events: >https://lkml.org/lkml/2017/12/1/410. > > There are also some potential downsides of this approach. > > First, it gives all flexible_groups exactly same chance to run, so it > may waste some PMC cycles. For examples, if 5 groups, ABCDE, are assigned > to two rotations: rotation-0: ABCD and rotation-1: E, this approach will > NOT try any of ABCD in rotation-1. Yeah, that doesn't look acceptable. In fact, people already complained about the current scheme not being optimal, what you propose is _far_ worse. > Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have > exact same priority and equal chance to run. I am not sure whether this > will change the behavior in some use cases. > > Please kindly let me know whether this approach makes sense. What you've not said is, and what is not at all clear, is if your scheme preserved fairness. In any case, there's a ton of conflict against the patches here: https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/testing And with those the idea was to move to a virtual time based scheduler (basically schedule those flexible events that have the biggest lag -- that also solves 1).
Re: [RFC] perf: a different approach to perf_rotate_context()
On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote: > When there are more perf_event's than hardware PMCs, perf rotate events > so that all events get chance to run. Currently, the rotation works as: > sched_out flexible_groups in cpuctx->ctx and cpuctx->task_ctx; > rotate_left flexible_groups in cpuctx->ctx and cpuctx->task_ctx; > try sched_in flexible_groups in cpuctx->ctx; > try sched_in flexible_groups in cpuctx->task_ctx. > > This approach has some potential issues: > 1. if different rotations of flexible_groups in cpuctx->ctx occupy > all hardware PMC, flexible_groups in cpuctx->task_ctx cannot run > at all. > 2. if pinned_groups occupy all hardware PMC, the rotation triggers per > perf_event_mux_interval_ms. But it couldn't schedule any events. > 3. since flexible_groups in cpuctx->ctx and cpuctx->task_ctx are > rotated separately, there are N x M possible combinations. It is > difficult to remember all the rotation combinations and reuse these > combinations. As a result, it is necessary to try sched_in the > flexible_groups on each rotation. > > This patch tries to do the rotation differently. Each perf_event in the > cpuctx (ctx and task_ctx) is assigned a rotation_id. The rotation_id's > are assigned during the first few rotations after any changes in > perf_events attached to the cpuctx. Once all the rotation_id's are > assigned for all events in the cpuctx, perf_rotate_context() simply > picks the next rotation to use, so there is no more "try to sched_in" > for future rotations. > > Special rotation_id's are introduced to handle the issues above. > flexible_groups that conflicts with pinned_groups are marked as > ALWAYS_OFF, so they are not rotated (fixes issue 2). flexible_groups > in cpuctx->ctx and cpuctx->task_ctx are rotated together, so they all get > equal chance to run (improves issue 1). hum, so the improvement is that cpuctx could eventually give up some space for task_ctx events, but both ctxs still rotate separately no? you keep rotation ID per single context.. > > With this approach, we only do complex scheduling of flexible_groups > once. This enables us to do more complex schduling, for example, Sharing > PMU counters across compatible events: >https://lkml.org/lkml/2017/12/1/410. how could this code do that? > > There are also some potential downsides of this approach. > > First, it gives all flexible_groups exactly same chance to run, so it > may waste some PMC cycles. For examples, if 5 groups, ABCDE, are assigned > to two rotations: rotation-0: ABCD and rotation-1: E, this approach will > NOT try any of ABCD in rotation-1. > > Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have > exact same priority and equal chance to run. I am not sure whether this > will change the behavior in some use cases. > > Please kindly let me know whether this approach makes sense. SNIP > +/* > + * identify always_on and always_off groups in flexible_groups, call > + * group_sched_in() for always_on groups > + */ > +static void ctx_pick_always_on_off_groups(struct perf_event_context *ctx, > + struct perf_cpu_context *cpuctx) > +{ > + struct perf_event *event; > + > + list_for_each_entry(event, &ctx->flexible_groups, group_entry) { > + if (event->group_caps & PERF_EV_CAP_SOFTWARE) { > + event->rotation_id = PERF_ROTATION_ID_ALWAYS_ON; > + ctx->nr_sched++; > + WARN_ON(group_sched_in(event, cpuctx, ctx)); > + continue; > + } > + if (group_sched_in(event, cpuctx, ctx)) { > + event->rotation_id = PERF_ROTATION_ID_ALWAYS_OFF; > + ctx->nr_sched++; should ctx->nr_sched be incremented in the 'else' leg? > + } > + group_sched_out(event, cpuctx, ctx); > + } > +} > + > +/* add unassigned flexible_groups to new rotation_id */ > +static void ctx_add_rotation(struct perf_event_context *ctx, > + struct perf_cpu_context *cpuctx) > { > struct perf_event *event; > + int group_added = 0; > int can_add_hw = 1; > > + ctx->curr_rotation = ctx->num_rotations; > + ctx->num_rotations++; > + > list_for_each_entry(event, &ctx->flexible_groups, group_entry) { > /* Ignore events in OFF or ERROR state */ > if (event->state <= PERF_EVENT_STATE_OFF) > @@ -3034,13 +3099,77 @@ ctx_flexible_sched_in(struct perf_event_context *ctx, > if (!event_filter_match(event)) > continue; > > + if (event->rotation_id != PERF_ROTATION_ID_NOT_ASSGINED) > + continue; > + > if (group_can_go_on(event, cpuctx, can_add_hw)) { > if (group_sched_in(event, cpuctx, ctx)) > can_add_hw = 0; > + else { >