Re: [RFC] perf: a different approach to perf_rotate_context()

2018-03-12 Thread Song Liu


> On Mar 3, 2018, at 7:26 AM, Peter Zijlstra  wrote:
> 
> On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote:
> 
>> Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have
>> exact same priority and equal chance to run. I am not sure whether this
>> will change the behavior in some use cases.
>> 
>> Please kindly let me know whether this approach makes sense.
> 
> What you've not said is, and what is not at all clear, is if your scheme
> preserved fairness.
> 
> 
> In any case, there's a ton of conflict against the patches here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/testing
> 
> And with those the idea was to move to a virtual time based scheduler
> (basically schedule those flexible events that have the biggest lag --
> that also solves 1).

While looking at these patches, I found it might not solve issue #1 
(cpuctx->task_ctx->flexible_groups starvation). Here is an example on Intel 
CPU (where ref-cycle can only use one hardware counter):

First, in one console start:
   perf stat -e ref-cycles -I 1

Second, in another console run: 
   perf stat -e ref-cycles -- benchmark

The second event will not run because the first event occupies the counter
all the time. 

Maybe we can solve this by combining the two flexible_groups (cpuctx->ctx, 
and cpuctx->task_ctx), and rotate them together? If this sounds reasonable, 
I would draft a RFC for it. 

Thanks,
Song




Re: [RFC] perf: a different approach to perf_rotate_context()

2018-03-05 Thread Alexey Budankov
On 03.03.2018 20:48, Peter Zijlstra wrote:
> On Sat, Mar 03, 2018 at 04:43:16PM +, Song Liu wrote:
>>> In any case, there's a ton of conflict against the patches here:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/testing
>>>
>>> And with those the idea was to move to a virtual time based scheduler
>>> (basically schedule those flexible events that have the biggest lag --
>>> that also solves 1).
>>
>> Thanks for these information. I will study this approach. Maybe that is 
>> our path to PMU sharing. 
> 
> So I'm really not convinced on that whole PMU sharing.
> 
>> What's is the status of this work? Would it 
>> land in 4.17? 
> 
> These patches might make 4.17, they got held up because of the whole
> meltdown/spectre crap and I need to get back to them.
> 

That work is long desired and would bring performance boost, specifically on 
server systems in per-process profiling mode, accompanied by good speedup on 
context switches. Undoubtedly meltdown/spectre related activity 
substituted it at some point but that improvements would still bring 
significant value and is still awaited.

BR,
Alexey


Re: [RFC] perf: a different approach to perf_rotate_context()

2018-03-03 Thread Peter Zijlstra
On Sat, Mar 03, 2018 at 04:43:16PM +, Song Liu wrote:
> > In any case, there's a ton of conflict against the patches here:
> > 
> > https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/testing
> > 
> > And with those the idea was to move to a virtual time based scheduler
> > (basically schedule those flexible events that have the biggest lag --
> > that also solves 1).
> 
> Thanks for these information. I will study this approach. Maybe that is 
> our path to PMU sharing. 

So I'm really not convinced on that whole PMU sharing.

> What's is the status of this work? Would it 
> land in 4.17? 

These patches might make 4.17, they got held up because of the whole
meltdown/spectre crap and I need to get back to them.


Re: [RFC] perf: a different approach to perf_rotate_context()

2018-03-03 Thread Song Liu


> On Mar 3, 2018, at 7:26 AM, Peter Zijlstra  wrote:
> 
> On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote:
>> When there are more perf_event's than hardware PMCs, perf rotate events
>> so that all events get chance to run. Currently, the rotation works as:
>>  sched_out flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
>>  rotate_left flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
>>  try sched_in flexible_groups in cpuctx->ctx;
>>  try sched_in flexible_groups in cpuctx->task_ctx.
>> 
>> This approach has some potential issues:
>>  1. if different rotations of flexible_groups in cpuctx->ctx occupy
>> all hardware PMC, flexible_groups in cpuctx->task_ctx cannot run
>> at all.
>>  2. if pinned_groups occupy all hardware PMC, the rotation triggers per
>> perf_event_mux_interval_ms. But it couldn't schedule any events.
>>  3. since flexible_groups in cpuctx->ctx and cpuctx->task_ctx are
>> rotated separately, there are N x M possible combinations. It is
>> difficult to remember all the rotation combinations and reuse these
>> combinations. As a result, it is necessary to try sched_in the
>> flexible_groups on each rotation.
>> 
>> This patch tries to do the rotation differently. Each perf_event in the
>> cpuctx (ctx and task_ctx) is assigned a rotation_id. The rotation_id's
>> are assigned during the first few rotations after any changes in
>> perf_events attached to the cpuctx. Once all the rotation_id's are
>> assigned for all events in the cpuctx, perf_rotate_context() simply
>> picks the next rotation to use, so there is no more "try to sched_in"
>> for future rotations.
> 
> I'm not following.

Let me try explain it in different words. The main goal of this approach 
is to pre-compute all the rotations, so perf_event scheduling is less
expensive. It does change current scheduling mechanism, by introducing 
rotation_id to each event. With rotation_id, all events in the 
flexible_groups have exactly same chance to run: 
   if num_rotations == 2, all flexible event runs 50% of time;
   if num_rotations == 3, all flexible event runs 33% of time;
etc.


>> Special rotation_id's are introduced to handle the issues above.
>> flexible_groups that conflicts with pinned_groups are marked as
>> ALWAYS_OFF, so they are not rotated (fixes issue 2). flexible_groups
>> in cpuctx->ctx and cpuctx->task_ctx are rotated together, so they all get
>> equal chance to run (improves issue 1).
> 
> I can't say I particularly care about 2, that's a really daft situation
> to get into. And changing 1 needs careful consideration.
> 
>> With this approach, we only do complex scheduling of flexible_groups
>> once. This enables us to do more complex schduling, for example, Sharing
>> PMU counters across compatible events:
>>   https://lkml.org/lkml/2017/12/1/410.
>> 
>> There are also some potential downsides of this approach.
>> 
>> First, it gives all flexible_groups exactly same chance to run, so it
>> may waste some PMC cycles. For examples, if 5 groups, ABCDE, are assigned
>> to two rotations: rotation-0: ABCD and rotation-1: E, this approach will
>> NOT try any of ABCD in rotation-1.
> 
> Yeah, that doesn't look acceptable. In fact, people already complained
> about the current scheme not being optimal, what you propose is _far_
> worse.

I agree this is an issue. Our initial goal is to get more events running
with PMU sharing (Tejun's RFC). However, I found the sharing is difficult 
to implement with current scheduling scheme. This RFC tries to pave the 
road for PMU sharing. If there are other ways (virtual time based 
scheduler, etc.) that make PMU sharing possible, we will be happy to start
from those. 

>> Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have
>> exact same priority and equal chance to run. I am not sure whether this
>> will change the behavior in some use cases.
>> 
>> Please kindly let me know whether this approach makes sense.
> 
> What you've not said is, and what is not at all clear, is if your scheme
> preserved fairness.

This approach gives all events in ctx->flexible_groups and 
task_ctx->flexible_groups exactly same chance to run. I am not sure whether 
that is acceptable in term of fairness. 

> 
> 
> In any case, there's a ton of conflict against the patches here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/testing
> 
> And with those the idea was to move to a virtual time based scheduler
> (basically schedule those flexible events that have the biggest lag --
> that also solves 1).

Thanks for these information. I will study this approach. Maybe that is 
our path to PMU sharing. What's is the status of this work? Would it 
land in 4.17? 

Thanks again,
Song



Re: [RFC] perf: a different approach to perf_rotate_context()

2018-03-03 Thread Song Liu


> On Mar 3, 2018, at 5:39 AM, Jiri Olsa  wrote:
> 
> On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote:
>> When there are more perf_event's than hardware PMCs, perf rotate events
>> so that all events get chance to run. Currently, the rotation works as:
>>  sched_out flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
>>  rotate_left flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
>>  try sched_in flexible_groups in cpuctx->ctx;
>>  try sched_in flexible_groups in cpuctx->task_ctx.
>> 
>> This approach has some potential issues:
>>  1. if different rotations of flexible_groups in cpuctx->ctx occupy
>> all hardware PMC, flexible_groups in cpuctx->task_ctx cannot run
>> at all.
>>  2. if pinned_groups occupy all hardware PMC, the rotation triggers per
>> perf_event_mux_interval_ms. But it couldn't schedule any events.
>>  3. since flexible_groups in cpuctx->ctx and cpuctx->task_ctx are
>> rotated separately, there are N x M possible combinations. It is
>> difficult to remember all the rotation combinations and reuse these
>> combinations. As a result, it is necessary to try sched_in the
>> flexible_groups on each rotation.
>> 
>> This patch tries to do the rotation differently. Each perf_event in the
>> cpuctx (ctx and task_ctx) is assigned a rotation_id. The rotation_id's
>> are assigned during the first few rotations after any changes in
>> perf_events attached to the cpuctx. Once all the rotation_id's are
>> assigned for all events in the cpuctx, perf_rotate_context() simply
>> picks the next rotation to use, so there is no more "try to sched_in"
>> for future rotations.
>> 
>> Special rotation_id's are introduced to handle the issues above.
>> flexible_groups that conflicts with pinned_groups are marked as
>> ALWAYS_OFF, so they are not rotated (fixes issue 2). flexible_groups
>> in cpuctx->ctx and cpuctx->task_ctx are rotated together, so they all get
>> equal chance to run (improves issue 1).
> 
> hum, so the improvement is that cpuctx could eventually give
> up some space for task_ctx events, but both ctxs still rotate
> separately no? you keep rotation ID per single context..

With this approach, both ctxs are rotated together. It is possible that
cpuctx->ctx only has events for rotation 0, 1; while cpuctx->task_ctx 
only has events for rotation 2, 3. But both of them will rotate among
0, 1, 2, 3. 

num_rotations and curr_rotation could be part of cpuctx, as it is 
eventually shared among two contexts. 

>> 
>> With this approach, we only do complex scheduling of flexible_groups
>> once. This enables us to do more complex schduling, for example, Sharing
>> PMU counters across compatible events:
>>   https://lkml.org/lkml/2017/12/1/410.
> 
> how could this code do that?

We still need a lot more work to get PMU sharing work. I think one of the 
problem with Tejun's RFC is more expensive scheduling. This RFC tries to 
pre-compute all rotations, so we only need to these expensive scheduling 
once. 

> 
>> 
>> There are also some potential downsides of this approach.
>> 
>> First, it gives all flexible_groups exactly same chance to run, so it
>> may waste some PMC cycles. For examples, if 5 groups, ABCDE, are assigned
>> to two rotations: rotation-0: ABCD and rotation-1: E, this approach will
>> NOT try any of ABCD in rotation-1.
>> 
>> Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have
>> exact same priority and equal chance to run. I am not sure whether this
>> will change the behavior in some use cases.
>> 
>> Please kindly let me know whether this approach makes sense.
> 
> SNIP
> 
>> +/*
>> + * identify always_on and always_off groups in flexible_groups, call
>> + * group_sched_in() for always_on groups
>> + */
>> +static void ctx_pick_always_on_off_groups(struct perf_event_context *ctx,
>> +  struct perf_cpu_context *cpuctx)
>> +{
>> +struct perf_event *event;
>> +
>> +list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
>> +if (event->group_caps & PERF_EV_CAP_SOFTWARE) {
>> +event->rotation_id = PERF_ROTATION_ID_ALWAYS_ON;
>> +ctx->nr_sched++;
>> +WARN_ON(group_sched_in(event, cpuctx, ctx));
>> +continue;
>> +}
>> +if (group_sched_in(event, cpuctx, ctx)) {
>> +event->rotation_id = PERF_ROTATION_ID_ALWAYS_OFF;
>> +ctx->nr_sched++;
> 
> should ctx->nr_sched be incremented in the 'else' leg?

The else leg means the event does not conflict with pinned groups, so it 
will be scheduled later in ctx_add_rotation(). This function only handles
always_on and always_off events. 

> 
>> +}
>> +group_sched_out(event, cpuctx, ctx);
>> +}
>> +}
>> +
>> +/* add unassigned flexible_groups to new rotation_id */
>> +static void ctx_add_rotation(struct perf_event_context *ctx,
>> + struct perf_cpu_

Re: [RFC] perf: a different approach to perf_rotate_context()

2018-03-03 Thread Peter Zijlstra
On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote:
> When there are more perf_event's than hardware PMCs, perf rotate events
> so that all events get chance to run. Currently, the rotation works as:
>   sched_out flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
>   rotate_left flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
>   try sched_in flexible_groups in cpuctx->ctx;
>   try sched_in flexible_groups in cpuctx->task_ctx.
> 
> This approach has some potential issues:
>   1. if different rotations of flexible_groups in cpuctx->ctx occupy
>  all hardware PMC, flexible_groups in cpuctx->task_ctx cannot run
>  at all.
>   2. if pinned_groups occupy all hardware PMC, the rotation triggers per
>  perf_event_mux_interval_ms. But it couldn't schedule any events.
>   3. since flexible_groups in cpuctx->ctx and cpuctx->task_ctx are
>  rotated separately, there are N x M possible combinations. It is
>  difficult to remember all the rotation combinations and reuse these
>  combinations. As a result, it is necessary to try sched_in the
>  flexible_groups on each rotation.
> 
> This patch tries to do the rotation differently. Each perf_event in the
> cpuctx (ctx and task_ctx) is assigned a rotation_id. The rotation_id's
> are assigned during the first few rotations after any changes in
> perf_events attached to the cpuctx. Once all the rotation_id's are
> assigned for all events in the cpuctx, perf_rotate_context() simply
> picks the next rotation to use, so there is no more "try to sched_in"
> for future rotations.

I'm not following.

> Special rotation_id's are introduced to handle the issues above.
> flexible_groups that conflicts with pinned_groups are marked as
> ALWAYS_OFF, so they are not rotated (fixes issue 2). flexible_groups
> in cpuctx->ctx and cpuctx->task_ctx are rotated together, so they all get
> equal chance to run (improves issue 1).

I can't say I particularly care about 2, that's a really daft situation
to get into. And changing 1 needs careful consideration.

> With this approach, we only do complex scheduling of flexible_groups
> once. This enables us to do more complex schduling, for example, Sharing
> PMU counters across compatible events:
>https://lkml.org/lkml/2017/12/1/410.
> 
> There are also some potential downsides of this approach.
> 
> First, it gives all flexible_groups exactly same chance to run, so it
> may waste some PMC cycles. For examples, if 5 groups, ABCDE, are assigned
> to two rotations: rotation-0: ABCD and rotation-1: E, this approach will
> NOT try any of ABCD in rotation-1.

Yeah, that doesn't look acceptable. In fact, people already complained
about the current scheme not being optimal, what you propose is _far_
worse.

> Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have
> exact same priority and equal chance to run. I am not sure whether this
> will change the behavior in some use cases.
> 
> Please kindly let me know whether this approach makes sense.

What you've not said is, and what is not at all clear, is if your scheme
preserved fairness.


In any case, there's a ton of conflict against the patches here:

https://git.kernel.org/pub/scm/linux/kernel/git/peterz/queue.git/log/?h=perf/testing

And with those the idea was to move to a virtual time based scheduler
(basically schedule those flexible events that have the biggest lag --
that also solves 1).






Re: [RFC] perf: a different approach to perf_rotate_context()

2018-03-03 Thread Jiri Olsa
On Thu, Mar 01, 2018 at 11:53:21AM -0800, Song Liu wrote:
> When there are more perf_event's than hardware PMCs, perf rotate events
> so that all events get chance to run. Currently, the rotation works as:
>   sched_out flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
>   rotate_left flexible_groups in cpuctx->ctx and cpuctx->task_ctx;
>   try sched_in flexible_groups in cpuctx->ctx;
>   try sched_in flexible_groups in cpuctx->task_ctx.
> 
> This approach has some potential issues:
>   1. if different rotations of flexible_groups in cpuctx->ctx occupy
>  all hardware PMC, flexible_groups in cpuctx->task_ctx cannot run
>  at all.
>   2. if pinned_groups occupy all hardware PMC, the rotation triggers per
>  perf_event_mux_interval_ms. But it couldn't schedule any events.
>   3. since flexible_groups in cpuctx->ctx and cpuctx->task_ctx are
>  rotated separately, there are N x M possible combinations. It is
>  difficult to remember all the rotation combinations and reuse these
>  combinations. As a result, it is necessary to try sched_in the
>  flexible_groups on each rotation.
> 
> This patch tries to do the rotation differently. Each perf_event in the
> cpuctx (ctx and task_ctx) is assigned a rotation_id. The rotation_id's
> are assigned during the first few rotations after any changes in
> perf_events attached to the cpuctx. Once all the rotation_id's are
> assigned for all events in the cpuctx, perf_rotate_context() simply
> picks the next rotation to use, so there is no more "try to sched_in"
> for future rotations.
> 
> Special rotation_id's are introduced to handle the issues above.
> flexible_groups that conflicts with pinned_groups are marked as
> ALWAYS_OFF, so they are not rotated (fixes issue 2). flexible_groups
> in cpuctx->ctx and cpuctx->task_ctx are rotated together, so they all get
> equal chance to run (improves issue 1).

hum, so the improvement is that cpuctx could eventually give
up some space for task_ctx events, but both ctxs still rotate
separately no? you keep rotation ID per single context..

> 
> With this approach, we only do complex scheduling of flexible_groups
> once. This enables us to do more complex schduling, for example, Sharing
> PMU counters across compatible events:
>https://lkml.org/lkml/2017/12/1/410.

how could this code do that?

> 
> There are also some potential downsides of this approach.
> 
> First, it gives all flexible_groups exactly same chance to run, so it
> may waste some PMC cycles. For examples, if 5 groups, ABCDE, are assigned
> to two rotations: rotation-0: ABCD and rotation-1: E, this approach will
> NOT try any of ABCD in rotation-1.
> 
> Second, flexible_groups in cpuctx->ctx and cpuctx->task_ctx now have
> exact same priority and equal chance to run. I am not sure whether this
> will change the behavior in some use cases.
> 
> Please kindly let me know whether this approach makes sense.

SNIP

> +/*
> + * identify always_on and always_off groups in flexible_groups, call
> + * group_sched_in() for always_on groups
> + */
> +static void ctx_pick_always_on_off_groups(struct perf_event_context *ctx,
> +   struct perf_cpu_context *cpuctx)
> +{
> + struct perf_event *event;
> +
> + list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
> + if (event->group_caps & PERF_EV_CAP_SOFTWARE) {
> + event->rotation_id = PERF_ROTATION_ID_ALWAYS_ON;
> + ctx->nr_sched++;
> + WARN_ON(group_sched_in(event, cpuctx, ctx));
> + continue;
> + }
> + if (group_sched_in(event, cpuctx, ctx)) {
> + event->rotation_id = PERF_ROTATION_ID_ALWAYS_OFF;
> + ctx->nr_sched++;

should ctx->nr_sched be incremented in the 'else' leg?

> + }
> + group_sched_out(event, cpuctx, ctx);
> + }
> +}
> +
> +/* add unassigned flexible_groups to new rotation_id */
> +static void ctx_add_rotation(struct perf_event_context *ctx,
> +  struct perf_cpu_context *cpuctx)
>  {
>   struct perf_event *event;
> + int group_added = 0;
>   int can_add_hw = 1;
>  
> + ctx->curr_rotation = ctx->num_rotations;
> + ctx->num_rotations++;
> +
>   list_for_each_entry(event, &ctx->flexible_groups, group_entry) {
>   /* Ignore events in OFF or ERROR state */
>   if (event->state <= PERF_EVENT_STATE_OFF)
> @@ -3034,13 +3099,77 @@ ctx_flexible_sched_in(struct perf_event_context *ctx,
>   if (!event_filter_match(event))
>   continue;
>  
> + if (event->rotation_id != PERF_ROTATION_ID_NOT_ASSGINED)
> + continue;
> +
>   if (group_can_go_on(event, cpuctx, can_add_hw)) {
>   if (group_sched_in(event, cpuctx, ctx))
>   can_add_hw = 0;
> + else {
>