Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-31 Thread Alexey Budankov
On 15.08.2017 20:28, Alexey Budankov wrote:
> Hi Peter,
> 
> On 07.08.2017 10:17, Alexey Budankov wrote:
>> On 04.08.2017 17:36, Peter Zijlstra wrote:
>>> On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
 On 03.08.2017 16:00, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
>>>
>> +/*
>> + * Find group list by a cpu key and rotate it.
>> + */
>> +static void
>> +perf_event_groups_rotate(struct rb_root *groups, int cpu)
>> +{
>> +struct rb_node *node;
>> +struct perf_event *node_event;
>> +
>> +node = groups->rb_node;
>> +
>> +while (node) {
>> +node_event = container_of(node,
>> +struct perf_event, group_node);
>> +
>> +if (cpu < node_event->cpu) {
>> +node = node->rb_left;
>> +} else if (cpu > node_event->cpu) {
>> +node = node->rb_right;
>> +} else {
>> +list_rotate_left(&node_event->group_list);
>> +break;
>> +}
>> +}
>> +}
>
> Ah, you worry about how to rotate inside a tree?

 Exactly.

>
> You can do that by adding (run)time based ordering, and you'll end up
> with a runtime based scheduler.

 Do you mean replacing a CPU indexed rb_tree of lists with 
 an CPU indexed rb_tree of counter indexed rb_trees?
>>>
>>> No, single tree, just more complicated ordering rules.
>>>
> A trivial variant keeps a simple counter per tree that is incremented
> for each rotation. That should end up with the events ordered exactly
> like the list. And if you have that comparator like above, expressing
> that additional ordering becomes simple ;-)
>
> Something like:
>
> struct group {
>   u64 vtime;
>   rb_tree tree;
> };
>
> bool event_less(left, right)
> {
>   if (left->cpu < right->cpu)
> return true;
>
>   if (left->cpu > right_cpu)
> return false;
>
>   if (left->vtime < right->vtime)
> return true;
>
>   return false;
> }
>
> insert_group(group, event, tail)
> {
>   if (tail)
> event->vtime = ++group->vtime;
>
>   tree_insert(&group->root, event);
> }
>
> Then every time you use insert_group(.tail=1) it goes to the end of that
> CPU's 'list'.
>

 Could you elaborate more on how to implement rotation?
>>>
>>> Its almost all there, but let me write a complete replacement for your
>>> perf_event_group_rotate() above.
>>>
>>> /* find the leftmost event matching @cpu */
>>> /* XXX not sure how to best parametrise a subtree search, */
>>> /* again, C sucks... */
>>> struct perf_event *__group_find_cpu(group, cpu)
>>> {
>>> struct rb_node *node = group->tree.rb_node;
>>> struct perf_event *event, *match = NULL;
>>>
>>> while (node) {
>>> event = container_of(node, struct perf_event, group_node);
>>>
>>> if (cpu > event->cpu) {
>>> node = node->rb_right;
>>> } else if (cpu < event->cpu) {
>>> node = node->rb_left;
>>> } else {
>>> /*
>>>  * subtree match, try left subtree for a
>>>  * 'smaller' match.
>>>  */
>>> match = event;
>>> node = node->rb_left;
>>> }
>>> }
>>>
>>> return match;
>>> }
>>>
>>> void perf_event_group_rotate(group, int cpu)
>>> {
>>> struct perf_event *event = __group_find_cpu(cpu);
>>>
>>> if (!event)
>>> return;
>>>
>>> tree_delete(&group->tree, event);
>>> insert_group(group, event, 1);
>>> }
>>>
>>> So we have a tree ordered by {cpu,vtime} and what we do is find the
>>> leftmost {cpu} entry, that is the smallest vtime entry for that cpu. We
>>> then take it out and re-insert it with a vtime number larger than any
>>> other, which places it as the rightmost entry for that cpu.
>>>
>>>
>>> So given:
>>>
>>>{1,1}
>>>/ \
>>> {0,5} {1,2}
>>>/ \\
>>> {0,1} {0,6}  {1,4}
>>>
>>>
>>> __group_find_cpu(.cpu=1) will return {1,1} as being the leftmost entry
>>> with cpu=1. We'll then remove it, update its vtime to 7 and re-insert.
>>> resulting in something like:
>>>
>>>{1,2}
>>>/ \
>>> {0,5} {1,4}
>>>/ \\
>>> {0,1} {0,6}  {1,7}
>>>
>>
>> Makes sense. The implementation becomes a bit simpler. The drawbacks 
>> may be several rotations of potentially big tree on the critical path, 
>> instead of updating four pointers in case of the tree of lists.
> 
> I implemented the approach you had suggested (as I understood it),
> tested it and got results that are drastically diffe

Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-31 Thread Alexey Budankov
On 30.08.2017 14:16, Alexey Budankov wrote:
> On 30.08.2017 11:30, Alexey Budankov wrote:
>> On 29.08.2017 16:51, Alexander Shishkin wrote:
>>> Alexey Budankov  writes:
>>>
 Now I figured that not all indexed events are always located under 
 the root with the same cpu, and it depends on the order of insertion
 e.g. with insertion order 01,02,03,14,15,16 we get this:

  02
 /  \
01  14
   /  \
  03  15
\
16

 and it is unclear how to iterate cpu==0 part of tree in this case.
>>>
>>> Using this example, rb_next() should take you through the nodes in this
>>> order (assuming you start with 01): 01, 02, 03, 14, etc. So you iterate
>>> while event->cpu==cpu using rb_next() and you should be fine.
>>
>> Well, indeed we get the most left leaf (03) in rb_next() for the case above.
>>
>>>
 Iterating cpu specific subtree like this:

 #define for_each_group_event(event, group, cpu, pmu, field) \
for (event = rb_entry_safe(group_first(group, cpu, pmu), \
   typeof(*event), field);   \
 event && event->cpu == cpu && event->pmu == pmu;\
 event = rb_entry_safe(rb_next(&event->field),   \
   typeof(*event), field))
>>>
>>> Afaict, this assumes that you are also ordering on event->pmu, which
>>> should be reflected in your _less function. And also assuming that
>>> group_first() is doing the right thing. Can we see the code?
>>
>> I didn't do ordering by PMU for this patch set. Yet more I implemented 
>> groups_first() like this:
>>
>> static struct perf_event *
>> perf_event_groups_first(struct perf_event_groups *groups, int cpu)
>> {
>>  struct perf_event *node_event = NULL;
>>  struct rb_node *node = NULL;
>>
>>  node = groups->tree.rb_node;
>>
>>  while (node) {
>>  node_event = container_of(node,
>>  struct perf_event, group_node);
>>
>>  if (cpu < node_event->cpu) {
>>  node = node->rb_left;
>>  } else if (cpu > node_event->cpu) {
>>  node = node->rb_right;
>>  } else {
>>  node = node->rb_left;
>>  }
>>  }
>>
>>  return node_event;
>> }
>>
>> and it doesn't work as expected for case above with cpu == 1.
>>
>> I corrected the code above to this:
>>
>> static struct perf_event *
>> perf_event_groups_first(struct perf_event_groups *groups, int cpu)
>> {
>>  struct perf_event *node_event = NULL, *match = NULL;
>>  struct rb_node *node = NULL;
>>
>>  node = groups->tree.rb_node;
>>
>>  while (node) {
>>  node_event = container_of(node,
>>  struct perf_event, group_node);
>>
>>  if (cpu < node_event->cpu) {
>>  node = node->rb_left;
>>  } else if (cpu > node_event->cpu) {
>>  node = node->rb_right;
>>  } else {
>>  match = node_event;
>>  node = node->rb_left;
>>  }
>>  }
>>
>>  return match;
>> }
>>
>> but now struggling with silent oopses which I guess are not 
>> related to multiplexing at all.
> 
> Added logging into the code and now see this in dmesg output:
> 
> [  175.743879] BUG: unable to handle kernel paging request at 7fe2a90d1a54
> [  175.743899] IP: __task_pid_nr_ns+0x3b/0x90
> [  175.743903] PGD 2f317ca067 
> [  175.743906] P4D 2f317ca067 
> [  175.743910] PUD 0 
> 
> [  175.743926] Oops:  [#1] SMP
> [  175.743931] Modules linked in: fuse xt_CHECKSUM iptable_mangle 
> ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat 
> nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c tun 
> bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables nfsv3 
> rpcsec_gss_krb5 nfsv4 cmac arc4 md4 nls_utf8 cifs nfs ccm dns_resolver 
> fscache rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi 
> scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp 
> ib_ipoib rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm intel_rapl 
> sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp hfi1 crct10dif_pclmul 
> crc32_pclmul ghash_clmulni_intel intel_cstate rdmavt joydev ipmi_ssif 
> intel_uncore ib_core ipmi_si ipmi_devintf intel_rapl_perf iTCO_wdt 
> iTCO_vendor_support pcspkr tpm_tis tpm_tis_core
> [  175.744088]  mei_me tpm i2c_i801 ipmi_msghandler lpc_ich mei shpchp wmi 
> acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc mgag200 
> drm_kms_helper ttm drm igb crc32c_intel ptp pps_core dca i2c_algo_bit
> [  175.744156] CPU: 12 PID: 8272 Comm: perf Not tainted 4.13.0-rc4-v7.3.3+ #13
> [  175.744160] Hardware name: Intel Corporation S7200AP/S7200AP, BIOS 
> S72C610.86B.01.01.0190.080520162104 08/05/2016
> [  175.744165] task: 90c47d4d task.stack: ae42d8fb

Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-30 Thread Alexey Budankov
On 30.08.2017 11:30, Alexey Budankov wrote:
> On 29.08.2017 16:51, Alexander Shishkin wrote:
>> Alexey Budankov  writes:
>>
>>> Now I figured that not all indexed events are always located under 
>>> the root with the same cpu, and it depends on the order of insertion
>>> e.g. with insertion order 01,02,03,14,15,16 we get this:
>>>
>>>  02
>>> /  \
>>>01  14
>>>   /  \
>>>  03  15
>>>\
>>>16
>>>
>>> and it is unclear how to iterate cpu==0 part of tree in this case.
>>
>> Using this example, rb_next() should take you through the nodes in this
>> order (assuming you start with 01): 01, 02, 03, 14, etc. So you iterate
>> while event->cpu==cpu using rb_next() and you should be fine.
> 
> Well, indeed we get the most left leaf (03) in rb_next() for the case above.
> 
>>
>>> Iterating cpu specific subtree like this:
>>>
>>> #define for_each_group_event(event, group, cpu, pmu, field)  \
>>> for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>>>typeof(*event), field);   \
>>>  event && event->cpu == cpu && event->pmu == pmu;\
>>>  event = rb_entry_safe(rb_next(&event->field),   \
>>>typeof(*event), field))
>>
>> Afaict, this assumes that you are also ordering on event->pmu, which
>> should be reflected in your _less function. And also assuming that
>> group_first() is doing the right thing. Can we see the code?
> 
> I didn't do ordering by PMU for this patch set. Yet more I implemented 
> groups_first() like this:
> 
> static struct perf_event *
> perf_event_groups_first(struct perf_event_groups *groups, int cpu)
> {
>   struct perf_event *node_event = NULL;
>   struct rb_node *node = NULL;
> 
>   node = groups->tree.rb_node;
> 
>   while (node) {
>   node_event = container_of(node,
>   struct perf_event, group_node);
> 
>   if (cpu < node_event->cpu) {
>   node = node->rb_left;
>   } else if (cpu > node_event->cpu) {
>   node = node->rb_right;
>   } else {
>   node = node->rb_left;
>   }
>   }
> 
>   return node_event;
> }
> 
> and it doesn't work as expected for case above with cpu == 1.
> 
> I corrected the code above to this:
> 
> static struct perf_event *
> perf_event_groups_first(struct perf_event_groups *groups, int cpu)
> {
>   struct perf_event *node_event = NULL, *match = NULL;
>   struct rb_node *node = NULL;
> 
>   node = groups->tree.rb_node;
> 
>   while (node) {
>   node_event = container_of(node,
>   struct perf_event, group_node);
> 
>   if (cpu < node_event->cpu) {
>   node = node->rb_left;
>   } else if (cpu > node_event->cpu) {
>   node = node->rb_right;
>   } else {
>   match = node_event;
>   node = node->rb_left;
>   }
>   }
> 
>   return match;
> }
> 
> but now struggling with silent oopses which I guess are not 
> related to multiplexing at all.

Added logging into the code and now see this in dmesg output:

[  175.743879] BUG: unable to handle kernel paging request at 7fe2a90d1a54
[  175.743899] IP: __task_pid_nr_ns+0x3b/0x90
[  175.743903] PGD 2f317ca067 
[  175.743906] P4D 2f317ca067 
[  175.743910] PUD 0 

[  175.743926] Oops:  [#1] SMP
[  175.743931] Modules linked in: fuse xt_CHECKSUM iptable_mangle 
ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat 
nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack libcrc32c tun bridge 
stp llc ebtable_filter ebtables ip6table_filter ip6_tables nfsv3 
rpcsec_gss_krb5 nfsv4 cmac arc4 md4 nls_utf8 cifs nfs ccm dns_resolver fscache 
rpcrdma ib_isert iscsi_target_mod ib_iser libiscsi scsi_transport_iscsi ib_srpt 
target_core_mod ib_srp scsi_transport_srp ib_ipoib rdma_ucm ib_ucm ib_uverbs 
ib_umad rdma_cm ib_cm iw_cm intel_rapl sb_edac x86_pkg_temp_thermal 
intel_powerclamp coretemp hfi1 crct10dif_pclmul crc32_pclmul 
ghash_clmulni_intel intel_cstate rdmavt joydev ipmi_ssif intel_uncore ib_core 
ipmi_si ipmi_devintf intel_rapl_perf iTCO_wdt iTCO_vendor_support pcspkr 
tpm_tis tpm_tis_core
[  175.744088]  mei_me tpm i2c_i801 ipmi_msghandler lpc_ich mei shpchp wmi 
acpi_power_meter acpi_pad nfsd auth_rpcgss nfs_acl lockd grace sunrpc mgag200 
drm_kms_helper ttm drm igb crc32c_intel ptp pps_core dca i2c_algo_bit
[  175.744156] CPU: 12 PID: 8272 Comm: perf Not tainted 4.13.0-rc4-v7.3.3+ #13
[  175.744160] Hardware name: Intel Corporation S7200AP/S7200AP, BIOS 
S72C610.86B.01.01.0190.080520162104 08/05/2016
[  175.744165] task: 90c47d4d task.stack: ae42d8fb
[  175.744177] RIP: 0010:__task_pid_nr_ns+0x3b/0x90
[  175.744181] RSP: 0018:90c4bbd05ae0 EFLAGS: 00010046
[  175.744190] RAX:  R

Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-30 Thread Alexander Shishkin
Alexey Budankov  writes:

> On 30.08.2017 13:18, Alexander Shishkin wrote:
>> Alexey Budankov  writes:
>> 
> Iterating cpu specific subtree like this:
>
> #define for_each_group_event(event, group, cpu, pmu, field)\
>   for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>  typeof(*event), field);   \
>event && event->cpu == cpu && event->pmu == pmu;\
>event = rb_entry_safe(rb_next(&event->field),   \
>  typeof(*event), field))

 Afaict, this assumes that you are also ordering on event->pmu, which
 should be reflected in your _less function. And also assuming that
 group_first() is doing the right thing. Can we see the code?
>>>
>>> I didn't do ordering by PMU for this patch set. Yet more I implemented 
>>> groups_first() like this:
>> 
>> Your iterator (quoted above) begs to differ.
>
> What do you specifically mean? I am doing iterations like this:

I mean the code that you've shown before, which is quoted above. It's
difficult to tell why something's not working if you don't show the
code.

Regards,
--
Alex


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-30 Thread Alexey Budankov
On 30.08.2017 13:18, Alexander Shishkin wrote:
> Alexey Budankov  writes:
> 
 Iterating cpu specific subtree like this:

 #define for_each_group_event(event, group, cpu, pmu, field) \
for (event = rb_entry_safe(group_first(group, cpu, pmu), \
   typeof(*event), field);   \
 event && event->cpu == cpu && event->pmu == pmu;\
 event = rb_entry_safe(rb_next(&event->field),   \
   typeof(*event), field))
>>>
>>> Afaict, this assumes that you are also ordering on event->pmu, which
>>> should be reflected in your _less function. And also assuming that
>>> group_first() is doing the right thing. Can we see the code?
>>
>> I didn't do ordering by PMU for this patch set. Yet more I implemented 
>> groups_first() like this:
> 
> Your iterator (quoted above) begs to differ.

What do you specifically mean? I am doing iterations like this:

/*
 * Iterate event groups thru the whole tree.
 */
#define perf_event_groups_for_each(event, groups, node) \
for (event = rb_entry_safe(rb_first(&((groups)->tree)), \
typeof(*event), node); event;   \
event = rb_entry_safe(rb_next(&event->node),\
typeof(*event), node))

/*
 * Iterate event groups with cpu == cpu_id.
 */
#define perf_event_groups_for_each_cpu(event, key, groups, node)\
for (event = perf_event_groups_first(groups, key);  \
event && event->cpu == key; \
event = rb_entry_safe(rb_next(&event->node),\
typeof(*event), node))

> 
> Regards,
> --
> Alex
> 

Thanks,
Alexey


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-30 Thread Alexander Shishkin
Alexey Budankov  writes:

>>> Iterating cpu specific subtree like this:
>>>
>>> #define for_each_group_event(event, group, cpu, pmu, field)  \
>>> for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>>>typeof(*event), field);   \
>>>  event && event->cpu == cpu && event->pmu == pmu;\
>>>  event = rb_entry_safe(rb_next(&event->field),   \
>>>typeof(*event), field))
>> 
>> Afaict, this assumes that you are also ordering on event->pmu, which
>> should be reflected in your _less function. And also assuming that
>> group_first() is doing the right thing. Can we see the code?
>
> I didn't do ordering by PMU for this patch set. Yet more I implemented 
> groups_first() like this:

Your iterator (quoted above) begs to differ.

Regards,
--
Alex


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-30 Thread Alexey Budankov
On 29.08.2017 16:51, Alexander Shishkin wrote:
> Alexey Budankov  writes:
> 
>> Now I figured that not all indexed events are always located under 
>> the root with the same cpu, and it depends on the order of insertion
>> e.g. with insertion order 01,02,03,14,15,16 we get this:
>>
>>  02
>> /  \
>>01  14
>>   /  \
>>  03  15
>>\
>>16
>>
>> and it is unclear how to iterate cpu==0 part of tree in this case.
> 
> Using this example, rb_next() should take you through the nodes in this
> order (assuming you start with 01): 01, 02, 03, 14, etc. So you iterate
> while event->cpu==cpu using rb_next() and you should be fine.

Well, indeed we get the most left leaf (03) in rb_next() for the case above.

> 
>> Iterating cpu specific subtree like this:
>>
>> #define for_each_group_event(event, group, cpu, pmu, field)   \
>>  for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>> typeof(*event), field);   \
>>   event && event->cpu == cpu && event->pmu == pmu;\
>>   event = rb_entry_safe(rb_next(&event->field),   \
>> typeof(*event), field))
> 
> Afaict, this assumes that you are also ordering on event->pmu, which
> should be reflected in your _less function. And also assuming that
> group_first() is doing the right thing. Can we see the code?

I didn't do ordering by PMU for this patch set. Yet more I implemented 
groups_first() like this:

static struct perf_event *
perf_event_groups_first(struct perf_event_groups *groups, int cpu)
{
struct perf_event *node_event = NULL;
struct rb_node *node = NULL;

node = groups->tree.rb_node;

while (node) {
node_event = container_of(node,
struct perf_event, group_node);

if (cpu < node_event->cpu) {
node = node->rb_left;
} else if (cpu > node_event->cpu) {
node = node->rb_right;
} else {
node = node->rb_left;
}
}

return node_event;
}

and it doesn't work as expected for case above with cpu == 1.

I corrected the code above to this:

static struct perf_event *
perf_event_groups_first(struct perf_event_groups *groups, int cpu)
{
struct perf_event *node_event = NULL, *match = NULL;
struct rb_node *node = NULL;

node = groups->tree.rb_node;

while (node) {
node_event = container_of(node,
struct perf_event, group_node);

if (cpu < node_event->cpu) {
node = node->rb_left;
} else if (cpu > node_event->cpu) {
node = node->rb_right;
} else {
match = node_event;
node = node->rb_left;
}
}

return match;
}

but now struggling with silent oopses which I guess are not 
related to multiplexing at all.

Please look at v8 for a while. It addresses your comments for v7.

> 
> Regards,
> --
> Alex
> 

Thanks,
Alexey


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-29 Thread Alexander Shishkin
Alexey Budankov  writes:

> Now I figured that not all indexed events are always located under 
> the root with the same cpu, and it depends on the order of insertion
> e.g. with insertion order 01,02,03,14,15,16 we get this:
>
>  02
> /  \
>01  14
>   /  \
>  03  15
>\
>16
>
> and it is unclear how to iterate cpu==0 part of tree in this case.

Using this example, rb_next() should take you through the nodes in this
order (assuming you start with 01): 01, 02, 03, 14, etc. So you iterate
while event->cpu==cpu using rb_next() and you should be fine.

> Iterating cpu specific subtree like this:
>
> #define for_each_group_event(event, group, cpu, pmu, field)\
>   for (event = rb_entry_safe(group_first(group, cpu, pmu), \
>  typeof(*event), field);   \
>event && event->cpu == cpu && event->pmu == pmu;\
>event = rb_entry_safe(rb_next(&event->field),   \
>  typeof(*event), field))

Afaict, this assumes that you are also ordering on event->pmu, which
should be reflected in your _less function. And also assuming that
group_first() is doing the right thing. Can we see the code?

Regards,
--
Alex


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-23 Thread Alexey Budankov
On 23.08.2017 16:39, Alexander Shishkin wrote:
> Alexey Budankov  writes:
> 
>> bool event_less(left, right)
>> {
>>   if (left->cpu < right->cpu)
>> return true;
>>
>>   if (left->cpu > right_cpu)
>> return false;
>>
>>   if (left->vtime < right->vtime)
>> return true;
>>
>>   return false;
>> }
>>
>> insert_group(group, event, tail)
>> {
>>   if (tail)
>> event->vtime = ++group->vtime;
>>
>>   tree_insert(&group->root, event);
>> }
> 
> [ ... ]
> 
>> 2. implementing special _less() function and rotation by re-inserting
>>group with incremented index;
>>
> 
> [ ... ]
> 
>> Now I figured that not all indexed events are always located under 
>> the root with the same cpu, and it depends on the order of insertion
>> e.g. with insertion order 01,02,03,14,15,16 we get this:
>>
>>  02
>> /  \
>>01  14
>>   /  \
>>  03  15
>>\
>>16
> 
> How did you arrive at this? Seeing the actual code would help, because
> this is not the ordering we're looking for. With Peter's earlier example
> (quoted above) it shouldn't look like this.

I implemented the solution Peter suggested. Then I was testing and noticed
considerable difference in amount of collected samples when multiplexing 
event, in comparison to the version with tree of lists. 

I then looked for a fast way to emulate the idea with virtual index as 
a secondary key and found this RB tree emulator:

https://www.cs.usfca.edu/~galles/visualization/RedBlack.html

and it showed me the picture I mentioned above:

  02
 /  \
01  14
   /  \
  03  15
\
16

I understand it is not 100% proof that index idea doesn't work 
however it means that in order to apply the idea to this patch 
some more changes are required additionally to what Peter 
shared earlier.

> 
> Regards,
> --
> Alex
> 
> 
> 



Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-23 Thread Alexander Shishkin
Alexey Budankov  writes:

> bool event_less(left, right)
> {
>   if (left->cpu < right->cpu)
> return true;
>
>   if (left->cpu > right_cpu)
> return false;
>
>   if (left->vtime < right->vtime)
> return true;
>
>   return false;
> }
>
> insert_group(group, event, tail)
> {
>   if (tail)
> event->vtime = ++group->vtime;
>
>   tree_insert(&group->root, event);
> }

[ ... ]

> 2. implementing special _less() function and rotation by re-inserting
>group with incremented index;
>

[ ... ]

> Now I figured that not all indexed events are always located under 
> the root with the same cpu, and it depends on the order of insertion
> e.g. with insertion order 01,02,03,14,15,16 we get this:
>
>  02
> /  \
>01  14
>   /  \
>  03  15
>\
>16

How did you arrive at this? Seeing the actual code would help, because
this is not the ordering we're looking for. With Peter's earlier example
(quoted above) it shouldn't look like this.

Regards,
--
Alex




Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-15 Thread Alexey Budankov
Hi Peter,

On 07.08.2017 10:17, Alexey Budankov wrote:
> On 04.08.2017 17:36, Peter Zijlstra wrote:
>> On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
>>> On 03.08.2017 16:00, Peter Zijlstra wrote:
 On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
>>
> +/*
> + * Find group list by a cpu key and rotate it.
> + */
> +static void
> +perf_event_groups_rotate(struct rb_root *groups, int cpu)
> +{
> + struct rb_node *node;
> + struct perf_event *node_event;
> +
> + node = groups->rb_node;
> +
> + while (node) {
> + node_event = container_of(node,
> + struct perf_event, group_node);
> +
> + if (cpu < node_event->cpu) {
> + node = node->rb_left;
> + } else if (cpu > node_event->cpu) {
> + node = node->rb_right;
> + } else {
> + list_rotate_left(&node_event->group_list);
> + break;
> + }
> + }
> +}

 Ah, you worry about how to rotate inside a tree?
>>>
>>> Exactly.
>>>

 You can do that by adding (run)time based ordering, and you'll end up
 with a runtime based scheduler.
>>>
>>> Do you mean replacing a CPU indexed rb_tree of lists with 
>>> an CPU indexed rb_tree of counter indexed rb_trees?
>>
>> No, single tree, just more complicated ordering rules.
>>
 A trivial variant keeps a simple counter per tree that is incremented
 for each rotation. That should end up with the events ordered exactly
 like the list. And if you have that comparator like above, expressing
 that additional ordering becomes simple ;-)

 Something like:

 struct group {
   u64 vtime;
   rb_tree tree;
 };

 bool event_less(left, right)
 {
   if (left->cpu < right->cpu)
 return true;

   if (left->cpu > right_cpu)
 return false;

   if (left->vtime < right->vtime)
 return true;

   return false;
 }

 insert_group(group, event, tail)
 {
   if (tail)
 event->vtime = ++group->vtime;

   tree_insert(&group->root, event);
 }

 Then every time you use insert_group(.tail=1) it goes to the end of that
 CPU's 'list'.

>>>
>>> Could you elaborate more on how to implement rotation?
>>
>> Its almost all there, but let me write a complete replacement for your
>> perf_event_group_rotate() above.
>>
>> /* find the leftmost event matching @cpu */
>> /* XXX not sure how to best parametrise a subtree search, */
>> /* again, C sucks... */
>> struct perf_event *__group_find_cpu(group, cpu)
>> {
>>  struct rb_node *node = group->tree.rb_node;
>>  struct perf_event *event, *match = NULL;
>>
>>  while (node) {
>>  event = container_of(node, struct perf_event, group_node);
>>
>>  if (cpu > event->cpu) {
>>  node = node->rb_right;
>>  } else if (cpu < event->cpu) {
>>  node = node->rb_left;
>>  } else {
>>  /*
>>   * subtree match, try left subtree for a
>>   * 'smaller' match.
>>   */
>>  match = event;
>>  node = node->rb_left;
>>  }
>>  }
>>
>>  return match;
>> }
>>
>> void perf_event_group_rotate(group, int cpu)
>> {
>>  struct perf_event *event = __group_find_cpu(cpu);
>>
>>  if (!event)
>>  return;
>>
>>  tree_delete(&group->tree, event);
>>  insert_group(group, event, 1);
>> }
>>
>> So we have a tree ordered by {cpu,vtime} and what we do is find the
>> leftmost {cpu} entry, that is the smallest vtime entry for that cpu. We
>> then take it out and re-insert it with a vtime number larger than any
>> other, which places it as the rightmost entry for that cpu.
>>
>>
>> So given:
>>
>>{1,1}
>>/ \
>> {0,5} {1,2}
>>/ \\
>> {0,1} {0,6}  {1,4}
>>
>>
>> __group_find_cpu(.cpu=1) will return {1,1} as being the leftmost entry
>> with cpu=1. We'll then remove it, update its vtime to 7 and re-insert.
>> resulting in something like:
>>
>>{1,2}
>>/ \
>> {0,5} {1,4}
>>/ \\
>> {0,1} {0,6}  {1,7}
>>
> 
> Makes sense. The implementation becomes a bit simpler. The drawbacks 
> may be several rotations of potentially big tree on the critical path, 
> instead of updating four pointers in case of the tree of lists.

I implemented the approach you had suggested (as I understood it),
tested it and got results that are drastically different from what 
I am getting for the tree of lists. Specifically I did:

1. keeping all groups in the same single tree by employing a 64-bit index
   additionally to CPU key;
   
2. implementing special _less() function and rotation by re-inserting
   group with incremented index;

3. replac

Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Alexey Budankov
On 07.08.2017 19:57, Peter Zijlstra wrote:
> On Mon, Aug 07, 2017 at 07:27:30PM +0300, Alexey Budankov wrote:
>> On 07.08.2017 18:55, Peter Zijlstra wrote:
> 
>>> In the extreme, if you construct your program such that you'll never get
>>> hit by the tick (this used to be a popular measure to hide yourself from
>>> time accounting)
>>
>> Well, some weird thing for me. Never run longer than one tick? 
>> I could imaging some I/O bound code that would fast serve some short 
>> messages, all the other time waiting for incoming requests.
>> Not sure if CPU events monitoring is helpful in this case.
> 
> Like I said, in extreme. Typically its less weird.> 
> Another example is scheduling a very constrained counter/group along
> with a bunch of simple events such that the group will only succeed to
> schedule when its the first. In this case it will get only 1/nr_events
> time with RR, as opposed to the other/simple events that will get
> nr_counters/nr_events time.
> 
> By making it runtime based, the constrained thing will more often be
> head of list and acquire equal total runtime to the other events.

I see and what could be the triggering condition for runtime based scheduling 
of groups as an alternative to hrtimer signal?

> 



Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Peter Zijlstra
On Mon, Aug 07, 2017 at 10:39:55AM -0700, Andi Kleen wrote:
> I'm not sure Alexey's patch kit will be able to solve every possible
> problem with the event scheduler. Trying to fix everything at 
> the same time is usually difficult. 

I didn't say he should solve this. Just said that putting everything in
a tree enables solving it.


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Andi Kleen
On Mon, Aug 07, 2017 at 06:57:11PM +0200, Peter Zijlstra wrote:
> On Mon, Aug 07, 2017 at 07:27:30PM +0300, Alexey Budankov wrote:
> > On 07.08.2017 18:55, Peter Zijlstra wrote:
> 
> > > In the extreme, if you construct your program such that you'll never get
> > > hit by the tick (this used to be a popular measure to hide yourself from
> > > time accounting)
> > 
> > Well, some weird thing for me. Never run longer than one tick? 
> > I could imaging some I/O bound code that would fast serve some short 
> > messages, all the other time waiting for incoming requests.
> > Not sure if CPU events monitoring is helpful in this case.
> 
> Like I said, in extreme. Typically its less weird.
> 
> Another example is scheduling a very constrained counter/group along
> with a bunch of simple events such that the group will only succeed to
> schedule when its the first. In this case it will get only 1/nr_events
> time with RR, as opposed to the other/simple events that will get
> nr_counters/nr_events time.
> 
> By making it runtime based, the constrained thing will more often be
> head of list and acquire equal total runtime to the other events.

I'm not sure Alexey's patch kit will be able to solve every possible
problem with the event scheduler. Trying to fix everything at 
the same time is usually difficult. 

It would seem better to mainly focus on the scaling problem for now
(which is essentially a show stopper bug for one platform)
and then tackle other problems later once that is solved.

-Andi


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Peter Zijlstra
On Mon, Aug 07, 2017 at 07:27:30PM +0300, Alexey Budankov wrote:
> On 07.08.2017 18:55, Peter Zijlstra wrote:

> > In the extreme, if you construct your program such that you'll never get
> > hit by the tick (this used to be a popular measure to hide yourself from
> > time accounting)
> 
> Well, some weird thing for me. Never run longer than one tick? 
> I could imaging some I/O bound code that would fast serve some short 
> messages, all the other time waiting for incoming requests.
> Not sure if CPU events monitoring is helpful in this case.

Like I said, in extreme. Typically its less weird.

Another example is scheduling a very constrained counter/group along
with a bunch of simple events such that the group will only succeed to
schedule when its the first. In this case it will get only 1/nr_events
time with RR, as opposed to the other/simple events that will get
nr_counters/nr_events time.

By making it runtime based, the constrained thing will more often be
head of list and acquire equal total runtime to the other events.


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Alexey Budankov
On 07.08.2017 18:55, Peter Zijlstra wrote:
> On Mon, Aug 07, 2017 at 06:32:16PM +0300, Alexey Budankov wrote:
>> On 07.08.2017 12:13, Peter Zijlstra wrote:
>>> On Mon, Aug 07, 2017 at 10:39:13AM +0200, Peter Zijlstra wrote:
 On Mon, Aug 07, 2017 at 10:17:46AM +0300, Alexey Budankov wrote:
> Makes sense. The implementation becomes a bit simpler. The drawbacks 
> may be several rotations of potentially big tree on the critical path, 
> instead of updating four pointers in case of the tree of lists.

 Yes, but like said, it allows implementing a better scheduler than RR,
 allowing us to fix rotation artifacts where task runtimes are near the
 rotation window.
>>
>> Could you elaborate more on the artifacts or my be share some link to the 
>> theory?
> 
> In the extreme, if you construct your program such that you'll never get
> hit by the tick (this used to be a popular measure to hide yourself from
> time accounting)

Well, some weird thing for me. Never run longer than one tick? 
I could imaging some I/O bound code that would fast serve some short 
messages, all the other time waiting for incoming requests.
Not sure if CPU events monitoring is helpful in this case.

> , you'll never rotate the counters, even though you can
> rack up quite a lot of runtime.> 
> By doing a runtime based scheduler, instead of a tick based RR, we'll
> still get rotation, and the tick will only function as a forced
> reprogram point.
> 
 A slightly more complicated, but also interested scheduling problem is
 the per-cpu flexible vs the per-task flexible. Ideally we'd rotate them
 at the same priority based on service, without strictly prioritizing the
 per-cpu events.

 Again, that is something that should be possible once we have a more
 capable event scheduler.


 So yes, cons and pros.. :-)
>>>
>>> Also, I think for AVL tree you could do the erase and (re)insert
>>> combined and then rebalance in one go, not sure RB allows the same
>>> thing, but it might be fun looking into.
>>
>> Not sure if AVL is more practical here. You get better balancing what gives 
>> you faster average search for the price of longer modifications 
>> so yes, need to measure and compare ... :-)
> 
> Oh, I wasn't suggesting using AVL (the last thing we need is another
> balanced tree in the kernel), I was merely wondering if you could do
> compound/bulk updates on RB as you can with AVL.

Aww, I see.

> 



Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Peter Zijlstra
On Mon, Aug 07, 2017 at 06:32:16PM +0300, Alexey Budankov wrote:
> On 07.08.2017 12:13, Peter Zijlstra wrote:
> > On Mon, Aug 07, 2017 at 10:39:13AM +0200, Peter Zijlstra wrote:
> >> On Mon, Aug 07, 2017 at 10:17:46AM +0300, Alexey Budankov wrote:
> >>> Makes sense. The implementation becomes a bit simpler. The drawbacks 
> >>> may be several rotations of potentially big tree on the critical path, 
> >>> instead of updating four pointers in case of the tree of lists.
> >>
> >> Yes, but like said, it allows implementing a better scheduler than RR,
> >> allowing us to fix rotation artifacts where task runtimes are near the
> >> rotation window.
> 
> Could you elaborate more on the artifacts or my be share some link to the 
> theory?

In the extreme, if you construct your program such that you'll never get
hit by the tick (this used to be a popular measure to hide yourself from
time accounting), you'll never rotate the counters, even though you can
rack up quite a lot of runtime.

By doing a runtime based scheduler, instead of a tick based RR, we'll
still get rotation, and the tick will only function as a forced
reprogram point.

> >> A slightly more complicated, but also interested scheduling problem is
> >> the per-cpu flexible vs the per-task flexible. Ideally we'd rotate them
> >> at the same priority based on service, without strictly prioritizing the
> >> per-cpu events.
> >>
> >> Again, that is something that should be possible once we have a more
> >> capable event scheduler.
> >>
> >>
> >> So yes, cons and pros.. :-)
> > 
> > Also, I think for AVL tree you could do the erase and (re)insert
> > combined and then rebalance in one go, not sure RB allows the same
> > thing, but it might be fun looking into.
> 
> Not sure if AVL is more practical here. You get better balancing what gives 
> you faster average search for the price of longer modifications 
> so yes, need to measure and compare ... :-)

Oh, I wasn't suggesting using AVL (the last thing we need is another
balanced tree in the kernel), I was merely wondering if you could do
compound/bulk updates on RB as you can with AVL.


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Alexey Budankov
On 07.08.2017 12:13, Peter Zijlstra wrote:
> On Mon, Aug 07, 2017 at 10:39:13AM +0200, Peter Zijlstra wrote:
>> On Mon, Aug 07, 2017 at 10:17:46AM +0300, Alexey Budankov wrote:
>>> Makes sense. The implementation becomes a bit simpler. The drawbacks 
>>> may be several rotations of potentially big tree on the critical path, 
>>> instead of updating four pointers in case of the tree of lists.
>>
>> Yes, but like said, it allows implementing a better scheduler than RR,
>> allowing us to fix rotation artifacts where task runtimes are near the
>> rotation window.

Could you elaborate more on the artifacts or my be share some link to the 
theory?

>>
>> A slightly more complicated, but also interested scheduling problem is
>> the per-cpu flexible vs the per-task flexible. Ideally we'd rotate them
>> at the same priority based on service, without strictly prioritizing the
>> per-cpu events.
>>
>> Again, that is something that should be possible once we have a more
>> capable event scheduler.
>>
>>
>> So yes, cons and pros.. :-)
> 
> Also, I think for AVL tree you could do the erase and (re)insert
> combined and then rebalance in one go, not sure RB allows the same
> thing, but it might be fun looking into.

Not sure if AVL is more practical here. You get better balancing what gives 
you faster average search for the price of longer modifications 
so yes, need to measure and compare ... :-)

> 



Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Alexey Budankov
On 04.08.2017 17:53, Peter Zijlstra wrote:
> On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
>> On 03.08.2017 16:00, Peter Zijlstra wrote:
>>> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
> 
 @@ -2759,13 +2932,13 @@ static void ctx_sched_out(struct 
 perf_event_context *ctx,
  
perf_pmu_disable(ctx->pmu);
if (is_active & EVENT_PINNED) {
 -  list_for_each_entry(event, &ctx->pinned_groups, group_entry)
 -  group_sched_out(event, cpuctx, ctx);
 +  perf_event_groups_iterate(&ctx->pinned_groups,
 +  group_sched_out_callback, ¶ms);
>>>
>>> So here I would expect to not iterate events where event->cpu !=
>>> smp_processor_id() (and ideally not where event->pmu != ctx->pmu).
>>>
>>
>> We still need to iterate thru all groups on thread context switch in 
>> and out as well as iterate thru cpu == -1 list (software events) 
>> additionally 
>> to smp_processor_id() list from multiplexing timer interrupt handler.
> 
> Well, just doing the @cpu=-1 and @cpu=this_cpu subtrees is less work
> than iterating _everything_, right?

Right. That is actually the aim of this whole patch set - to avoid iterating 
"_everything_".

> 
> The rest will not survive event_filter_match() anyway, so iterating them
> is complete waste of time, and once we have them in a tree, its actually
> easy to find this subset.
> 



Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Peter Zijlstra
On Mon, Aug 07, 2017 at 10:39:13AM +0200, Peter Zijlstra wrote:
> On Mon, Aug 07, 2017 at 10:17:46AM +0300, Alexey Budankov wrote:
> > Makes sense. The implementation becomes a bit simpler. The drawbacks 
> > may be several rotations of potentially big tree on the critical path, 
> > instead of updating four pointers in case of the tree of lists.
> 
> Yes, but like said, it allows implementing a better scheduler than RR,
> allowing us to fix rotation artifacts where task runtimes are near the
> rotation window.
> 
> A slightly more complicated, but also interested scheduling problem is
> the per-cpu flexible vs the per-task flexible. Ideally we'd rotate them
> at the same priority based on service, without strictly prioritizing the
> per-cpu events.
> 
> Again, that is something that should be possible once we have a more
> capable event scheduler.
> 
> 
> So yes, cons and pros.. :-)

Also, I think for AVL tree you could do the erase and (re)insert
combined and then rebalance in one go, not sure RB allows the same
thing, but it might be fun looking into.


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Peter Zijlstra
On Mon, Aug 07, 2017 at 10:17:46AM +0300, Alexey Budankov wrote:
> Makes sense. The implementation becomes a bit simpler. The drawbacks 
> may be several rotations of potentially big tree on the critical path, 
> instead of updating four pointers in case of the tree of lists.

Yes, but like said, it allows implementing a better scheduler than RR,
allowing us to fix rotation artifacts where task runtimes are near the
rotation window.

A slightly more complicated, but also interested scheduling problem is
the per-cpu flexible vs the per-task flexible. Ideally we'd rotate them
at the same priority based on service, without strictly prioritizing the
per-cpu events.

Again, that is something that should be possible once we have a more
capable event scheduler.


So yes, cons and pros.. :-)


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-07 Thread Alexey Budankov
On 04.08.2017 17:36, Peter Zijlstra wrote:
> On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
>> On 03.08.2017 16:00, Peter Zijlstra wrote:
>>> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
> 
 +/*
 + * Find group list by a cpu key and rotate it.
 + */
 +static void
 +perf_event_groups_rotate(struct rb_root *groups, int cpu)
 +{
 +  struct rb_node *node;
 +  struct perf_event *node_event;
 +
 +  node = groups->rb_node;
 +
 +  while (node) {
 +  node_event = container_of(node,
 +  struct perf_event, group_node);
 +
 +  if (cpu < node_event->cpu) {
 +  node = node->rb_left;
 +  } else if (cpu > node_event->cpu) {
 +  node = node->rb_right;
 +  } else {
 +  list_rotate_left(&node_event->group_list);
 +  break;
 +  }
 +  }
 +}
>>>
>>> Ah, you worry about how to rotate inside a tree?
>>
>> Exactly.
>>
>>>
>>> You can do that by adding (run)time based ordering, and you'll end up
>>> with a runtime based scheduler.
>>
>> Do you mean replacing a CPU indexed rb_tree of lists with 
>> an CPU indexed rb_tree of counter indexed rb_trees?
> 
> No, single tree, just more complicated ordering rules.
> 
>>> A trivial variant keeps a simple counter per tree that is incremented
>>> for each rotation. That should end up with the events ordered exactly
>>> like the list. And if you have that comparator like above, expressing
>>> that additional ordering becomes simple ;-)
>>>
>>> Something like:
>>>
>>> struct group {
>>>   u64 vtime;
>>>   rb_tree tree;
>>> };
>>>
>>> bool event_less(left, right)
>>> {
>>>   if (left->cpu < right->cpu)
>>> return true;
>>>
>>>   if (left->cpu > right_cpu)
>>> return false;
>>>
>>>   if (left->vtime < right->vtime)
>>> return true;
>>>
>>>   return false;
>>> }
>>>
>>> insert_group(group, event, tail)
>>> {
>>>   if (tail)
>>> event->vtime = ++group->vtime;
>>>
>>>   tree_insert(&group->root, event);
>>> }
>>>
>>> Then every time you use insert_group(.tail=1) it goes to the end of that
>>> CPU's 'list'.
>>>
>>
>> Could you elaborate more on how to implement rotation?
> 
> Its almost all there, but let me write a complete replacement for your
> perf_event_group_rotate() above.
> 
> /* find the leftmost event matching @cpu */
> /* XXX not sure how to best parametrise a subtree search, */
> /* again, C sucks... */
> struct perf_event *__group_find_cpu(group, cpu)
> {
>   struct rb_node *node = group->tree.rb_node;
>   struct perf_event *event, *match = NULL;
> 
>   while (node) {
>   event = container_of(node, struct perf_event, group_node);
> 
>   if (cpu > event->cpu) {
>   node = node->rb_right;
>   } else if (cpu < event->cpu) {
>   node = node->rb_left;
>   } else {
>   /*
>* subtree match, try left subtree for a
>* 'smaller' match.
>*/
>   match = event;
>   node = node->rb_left;
>   }
>   }
> 
>   return match;
> }
> 
> void perf_event_group_rotate(group, int cpu)
> {
>   struct perf_event *event = __group_find_cpu(cpu);
> 
>   if (!event)
>   return;
> 
>   tree_delete(&group->tree, event);
>   insert_group(group, event, 1);
> }
> 
> So we have a tree ordered by {cpu,vtime} and what we do is find the
> leftmost {cpu} entry, that is the smallest vtime entry for that cpu. We
> then take it out and re-insert it with a vtime number larger than any
> other, which places it as the rightmost entry for that cpu.
> 
> 
> So given:
> 
>{1,1}
>/ \
> {0,5} {1,2}
>/ \\
> {0,1} {0,6}  {1,4}
> 
> 
> __group_find_cpu(.cpu=1) will return {1,1} as being the leftmost entry
> with cpu=1. We'll then remove it, update its vtime to 7 and re-insert.
> resulting in something like:
> 
>{1,2}
>/ \
> {0,5} {1,4}
>/ \\
> {0,1} {0,6}  {1,7}
> 

Makes sense. The implementation becomes a bit simpler. The drawbacks 
may be several rotations of potentially big tree on the critical path, 
instead of updating four pointers in case of the tree of lists.

> 
> 
> 
> 



Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-04 Thread Peter Zijlstra
On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
> On 03.08.2017 16:00, Peter Zijlstra wrote:
> > On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:

> >> @@ -2759,13 +2932,13 @@ static void ctx_sched_out(struct 
> >> perf_event_context *ctx,
> >>  
> >>perf_pmu_disable(ctx->pmu);
> >>if (is_active & EVENT_PINNED) {
> >> -  list_for_each_entry(event, &ctx->pinned_groups, group_entry)
> >> -  group_sched_out(event, cpuctx, ctx);
> >> +  perf_event_groups_iterate(&ctx->pinned_groups,
> >> +  group_sched_out_callback, ¶ms);
> > 
> > So here I would expect to not iterate events where event->cpu !=
> > smp_processor_id() (and ideally not where event->pmu != ctx->pmu).
> >
> 
> We still need to iterate thru all groups on thread context switch in 
> and out as well as iterate thru cpu == -1 list (software events) additionally 
> to smp_processor_id() list from multiplexing timer interrupt handler.

Well, just doing the @cpu=-1 and @cpu=this_cpu subtrees is less work
than iterating _everything_, right?

The rest will not survive event_filter_match() anyway, so iterating them
is complete waste of time, and once we have them in a tree, its actually
easy to find this subset.


Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-04 Thread Peter Zijlstra
On Thu, Aug 03, 2017 at 11:30:09PM +0300, Alexey Budankov wrote:
> On 03.08.2017 16:00, Peter Zijlstra wrote:
> > On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:

> >> +/*
> >> + * Find group list by a cpu key and rotate it.
> >> + */
> >> +static void
> >> +perf_event_groups_rotate(struct rb_root *groups, int cpu)
> >> +{
> >> +  struct rb_node *node;
> >> +  struct perf_event *node_event;
> >> +
> >> +  node = groups->rb_node;
> >> +
> >> +  while (node) {
> >> +  node_event = container_of(node,
> >> +  struct perf_event, group_node);
> >> +
> >> +  if (cpu < node_event->cpu) {
> >> +  node = node->rb_left;
> >> +  } else if (cpu > node_event->cpu) {
> >> +  node = node->rb_right;
> >> +  } else {
> >> +  list_rotate_left(&node_event->group_list);
> >> +  break;
> >> +  }
> >> +  }
> >> +}
> > 
> > Ah, you worry about how to rotate inside a tree?
> 
> Exactly.
> 
> > 
> > You can do that by adding (run)time based ordering, and you'll end up
> > with a runtime based scheduler.
> 
> Do you mean replacing a CPU indexed rb_tree of lists with 
> an CPU indexed rb_tree of counter indexed rb_trees?

No, single tree, just more complicated ordering rules.

> > A trivial variant keeps a simple counter per tree that is incremented
> > for each rotation. That should end up with the events ordered exactly
> > like the list. And if you have that comparator like above, expressing
> > that additional ordering becomes simple ;-)
> > 
> > Something like:
> > 
> > struct group {
> >   u64 vtime;
> >   rb_tree tree;
> > };
> > 
> > bool event_less(left, right)
> > {
> >   if (left->cpu < right->cpu)
> > return true;
> > 
> >   if (left->cpu > right_cpu)
> > return false;
> > 
> >   if (left->vtime < right->vtime)
> > return true;
> > 
> >   return false;
> > }
> > 
> > insert_group(group, event, tail)
> > {
> >   if (tail)
> > event->vtime = ++group->vtime;
> > 
> >   tree_insert(&group->root, event);
> > }
> > 
> > Then every time you use insert_group(.tail=1) it goes to the end of that
> > CPU's 'list'.
> > 
> 
> Could you elaborate more on how to implement rotation?

Its almost all there, but let me write a complete replacement for your
perf_event_group_rotate() above.

/* find the leftmost event matching @cpu */
/* XXX not sure how to best parametrise a subtree search, */
/* again, C sucks... */
struct perf_event *__group_find_cpu(group, cpu)
{
struct rb_node *node = group->tree.rb_node;
struct perf_event *event, *match = NULL;

while (node) {
event = container_of(node, struct perf_event, group_node);

if (cpu > event->cpu) {
node = node->rb_right;
} else if (cpu < event->cpu) {
node = node->rb_left;
} else {
/*
 * subtree match, try left subtree for a
 * 'smaller' match.
 */
match = event;
node = node->rb_left;
}
}

return match;
}

void perf_event_group_rotate(group, int cpu)
{
struct perf_event *event = __group_find_cpu(cpu);

if (!event)
return;

tree_delete(&group->tree, event);
insert_group(group, event, 1);
}

So we have a tree ordered by {cpu,vtime} and what we do is find the
leftmost {cpu} entry, that is the smallest vtime entry for that cpu. We
then take it out and re-insert it with a vtime number larger than any
other, which places it as the rightmost entry for that cpu.


So given:

   {1,1}
   / \
{0,5} {1,2}
   / \\
{0,1} {0,6}  {1,4}


__group_find_cpu(.cpu=1) will return {1,1} as being the leftmost entry
with cpu=1. We'll then remove it, update its vtime to 7 and re-insert.
resulting in something like:

   {1,2}
   / \
{0,5} {1,4}
   / \\
{0,1} {0,6}  {1,7}






Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-03 Thread Alexey Budankov
On 03.08.2017 16:00, Peter Zijlstra wrote:
> On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
>> This patch moves event groups into rb tree sorted by CPU, so that 
>> multiplexing hrtimer interrupt handler would be able skipping to the current 
>> CPU's list and ignore groups allocated for the other CPUs.
>>
>> New API for manipulating event groups in the trees is implemented as well 
>> as adoption on the API in the current implementation.
>>
>> Because perf_event_groups_iterate() API provides capability to execute 
>> a callback for every event group in a tree, adoption of the API introduces
>> some code that packs and unpacks arguments of functions existing in 
>> the implementation as well as adjustments of their calling signatures
>> e.g. ctx_pinned_sched_in(), ctx_flexible_sched_in() and inherit_task_group().
> 
> This does not speak of why we need group_list.
> 
>> Signed-off-by: Alexey Budankov 
>> ---
>>  include/linux/perf_event.h |  18 ++-
>>  kernel/events/core.c   | 389 
>> +
>>  2 files changed, 306 insertions(+), 101 deletions(-)
>>
>> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
>> index a3b873f..282f121 100644
>> --- a/include/linux/perf_event.h
>> +++ b/include/linux/perf_event.h
>> @@ -572,6 +572,20 @@ struct perf_event {
>>   */
>>  struct list_headgroup_entry;
>>  struct list_headsibling_list;
>> +/*
>> + * Node on the pinned or flexible tree located at the event context;
>> + * the node may be empty in case its event is not directly attached
>> + * to the tree but to group_list list of the event directly
>> + * attached to the tree;
>> + */
>> +struct rb_node  group_node;
>> +/*
>> + * List keeps groups allocated for the same cpu;
>> + * the list may be empty in case its event is not directly
>> + * attached to the tree but to group_list list of the event directly
>> + * attached to the tree;
>> + */
>> +struct list_headgroup_list;
>>  
>>  /*
>>   * We need storage to track the entries in perf_pmu_migrate_context; we
>> @@ -741,8 +755,8 @@ struct perf_event_context {
>>  struct mutexmutex;
>>  
>>  struct list_headactive_ctx_list;
>> -struct list_headpinned_groups;
>> -struct list_headflexible_groups;
>> +struct rb_root  pinned_groups;
>> +struct rb_root  flexible_groups;
>>  struct list_headevent_list;
>>  int nr_events;
>>  int nr_active;
>> diff --git a/kernel/events/core.c b/kernel/events/core.c
>> index 426c2ff..0a4f619 100644
>> --- a/kernel/events/core.c
>> +++ b/kernel/events/core.c
>> @@ -1466,8 +1466,12 @@ static enum event_type_t get_event_type(struct 
>> perf_event *event)
>>  return event_type;
>>  }
>>  
>> -static struct list_head *
>> -ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
>> +/*
>> + * Extract pinned or flexible groups from the context
>> + * based on event attrs bits;
>> + */
>> +static struct rb_root *
>> +get_event_groups(struct perf_event *event, struct perf_event_context *ctx)
>>  {
>>  if (event->attr.pinned)
>>  return &ctx->pinned_groups;
>> @@ -1475,6 +1479,160 @@ ctx_group_list(struct perf_event *event, struct 
>> perf_event_context *ctx)
>>  return &ctx->flexible_groups;
>>  }
>>  
>> +static void
>> +perf_event_groups_insert(struct rb_root *groups,
>> +struct perf_event *event);
>> +
>> +static void
>> +perf_event_groups_delete(struct rb_root *groups,
>> +struct perf_event *event);
> 
> Can't we do away with these fwd declarations by simple reordering of the
> function definitions?

Ok. I will clean this up.

> 
>> +/*
>> + * Helper function to insert event into the pinned or
>> + * flexible groups;
>> + */
>> +static void
>> +add_event_to_groups(struct perf_event *event, struct perf_event_context 
>> *ctx)
>> +{
>> +struct rb_root *groups;
>> +
>> +groups = get_event_groups(event, ctx);
>> +perf_event_groups_insert(groups, event);
>> +}
>> +
>> +/*
>> + * Helper function to delete event from its groups;
>> + */
>> +static void
>> +del_event_from_groups(struct perf_event *event, struct perf_event_context 
>> *ctx)
>> +{
>> +struct rb_root *groups;
>> +
>> +groups = get_event_groups(event, ctx);
>> +perf_event_groups_delete(groups, event);
>> +}
>> +
>> +/*
>> + * Insert a group into a tree using event->cpu as a key. If event->cpu node
>> + * is already attached to the tree then the event is added to the attached
>> + * group's group_list list.
>> + */
>> +static void
>> +perf_event_groups_insert(struct rb_root *groups,
>> +struct perf_event *event)
>> +{
>> +struct rb_node **node;
>> +struct rb_

Re: [PATCH v6 1/3] perf/core: use rb trees for pinned/flexible groups

2017-08-03 Thread Peter Zijlstra
On Wed, Aug 02, 2017 at 11:13:54AM +0300, Alexey Budankov wrote:
> This patch moves event groups into rb tree sorted by CPU, so that 
> multiplexing hrtimer interrupt handler would be able skipping to the current 
> CPU's list and ignore groups allocated for the other CPUs.
> 
> New API for manipulating event groups in the trees is implemented as well 
> as adoption on the API in the current implementation.
> 
> Because perf_event_groups_iterate() API provides capability to execute 
> a callback for every event group in a tree, adoption of the API introduces
> some code that packs and unpacks arguments of functions existing in 
> the implementation as well as adjustments of their calling signatures
> e.g. ctx_pinned_sched_in(), ctx_flexible_sched_in() and inherit_task_group().

This does not speak of why we need group_list.

> Signed-off-by: Alexey Budankov 
> ---
>  include/linux/perf_event.h |  18 ++-
>  kernel/events/core.c   | 389 
> +
>  2 files changed, 306 insertions(+), 101 deletions(-)
> 
> diff --git a/include/linux/perf_event.h b/include/linux/perf_event.h
> index a3b873f..282f121 100644
> --- a/include/linux/perf_event.h
> +++ b/include/linux/perf_event.h
> @@ -572,6 +572,20 @@ struct perf_event {
>*/
>   struct list_headgroup_entry;
>   struct list_headsibling_list;
> + /*
> +  * Node on the pinned or flexible tree located at the event context;
> +  * the node may be empty in case its event is not directly attached
> +  * to the tree but to group_list list of the event directly
> +  * attached to the tree;
> +  */
> + struct rb_node  group_node;
> + /*
> +  * List keeps groups allocated for the same cpu;
> +  * the list may be empty in case its event is not directly
> +  * attached to the tree but to group_list list of the event directly
> +  * attached to the tree;
> +  */
> + struct list_headgroup_list;
>  
>   /*
>* We need storage to track the entries in perf_pmu_migrate_context; we
> @@ -741,8 +755,8 @@ struct perf_event_context {
>   struct mutexmutex;
>  
>   struct list_headactive_ctx_list;
> - struct list_headpinned_groups;
> - struct list_headflexible_groups;
> + struct rb_root  pinned_groups;
> + struct rb_root  flexible_groups;
>   struct list_headevent_list;
>   int nr_events;
>   int nr_active;
> diff --git a/kernel/events/core.c b/kernel/events/core.c
> index 426c2ff..0a4f619 100644
> --- a/kernel/events/core.c
> +++ b/kernel/events/core.c
> @@ -1466,8 +1466,12 @@ static enum event_type_t get_event_type(struct 
> perf_event *event)
>   return event_type;
>  }
>  
> -static struct list_head *
> -ctx_group_list(struct perf_event *event, struct perf_event_context *ctx)
> +/*
> + * Extract pinned or flexible groups from the context
> + * based on event attrs bits;
> + */
> +static struct rb_root *
> +get_event_groups(struct perf_event *event, struct perf_event_context *ctx)
>  {
>   if (event->attr.pinned)
>   return &ctx->pinned_groups;
> @@ -1475,6 +1479,160 @@ ctx_group_list(struct perf_event *event, struct 
> perf_event_context *ctx)
>   return &ctx->flexible_groups;
>  }
>  
> +static void
> +perf_event_groups_insert(struct rb_root *groups,
> + struct perf_event *event);
> +
> +static void
> +perf_event_groups_delete(struct rb_root *groups,
> + struct perf_event *event);

Can't we do away with these fwd declarations by simple reordering of the
function definitions?

> +/*
> + * Helper function to insert event into the pinned or
> + * flexible groups;
> + */
> +static void
> +add_event_to_groups(struct perf_event *event, struct perf_event_context *ctx)
> +{
> + struct rb_root *groups;
> +
> + groups = get_event_groups(event, ctx);
> + perf_event_groups_insert(groups, event);
> +}
> +
> +/*
> + * Helper function to delete event from its groups;
> + */
> +static void
> +del_event_from_groups(struct perf_event *event, struct perf_event_context 
> *ctx)
> +{
> + struct rb_root *groups;
> +
> + groups = get_event_groups(event, ctx);
> + perf_event_groups_delete(groups, event);
> +}
> +
> +/*
> + * Insert a group into a tree using event->cpu as a key. If event->cpu node
> + * is already attached to the tree then the event is added to the attached
> + * group's group_list list.
> + */
> +static void
> +perf_event_groups_insert(struct rb_root *groups,
> + struct perf_event *event)
> +{
> + struct rb_node **node;
> + struct rb_node *parent;
> + struct perf_event *node_event;
> +
> + node = &groups->rb_node;
> + parent = *node;
> +
> + while (*node) {
> + parent = *n