Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-17 Thread Thomas Gleixner
On Tue, 7 Feb 2017, Stephane Eranian wrote:
> 
> I think the design must ensure that the following usage models can be 
> monitored:
>- the allocations in your CAT partitions
>- the allocations from a task (inclusive of children tasks)
>- the allocations from a group of tasks (inclusive of children tasks)
>- the allocations from a CPU
>- the allocations from a group of CPUs

What's missing here is:

 - the allocations of a subset of users (tasks/groups/cpu(s)) of a
   particular CAT partition

Looking at your requirement list, all requirements, except the first point,
have no relationship to CAT (at least not from your write up). Now the
obvious questions are:

 - Does it make sense to ignore CAT relations in these sets?

 - Does it make sense to monitor a task / group of tasks, where the tasks
   belong to different CAT partitions?

 - Does it make sense to monitor a CPU / group of CPUs as a whole
   independent of which CAT partitions have been utilized during the
   monitoring period?

I don't think it makes any sense, unless the resulting information is split
up into CAT partitions.

I'm happy to be educated on the value of making this CAT unaware, but so
far I only came up with results, which need a crystal ball to analyze them.

Thanks,

tglx



Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-17 Thread Thomas Gleixner
On Tue, 7 Feb 2017, Stephane Eranian wrote:
> 
> I think the design must ensure that the following usage models can be 
> monitored:
>- the allocations in your CAT partitions
>- the allocations from a task (inclusive of children tasks)
>- the allocations from a group of tasks (inclusive of children tasks)
>- the allocations from a CPU
>- the allocations from a group of CPUs

What's missing here is:

 - the allocations of a subset of users (tasks/groups/cpu(s)) of a
   particular CAT partition

Looking at your requirement list, all requirements, except the first point,
have no relationship to CAT (at least not from your write up). Now the
obvious questions are:

 - Does it make sense to ignore CAT relations in these sets?

 - Does it make sense to monitor a task / group of tasks, where the tasks
   belong to different CAT partitions?

 - Does it make sense to monitor a CPU / group of CPUs as a whole
   independent of which CAT partitions have been utilized during the
   monitoring period?

I don't think it makes any sense, unless the resulting information is split
up into CAT partitions.

I'm happy to be educated on the value of making this CAT unaware, but so
far I only came up with results, which need a crystal ball to analyze them.

Thanks,

tglx



Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-08 Thread Stephane Eranian
Tony,

On Tue, Feb 7, 2017 at 10:52 AM, Luck, Tony  wrote:
> On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote:
>> Hi,
>>
>> I wanted to take a few steps back and look at the overall goals for
>> cache monitoring.
>> From the various threads and discussion, my understanding is as follows.
>>
>> I think the design must ensure that the following usage models can be 
>> monitored:
>>- the allocations in your CAT partitions
>>- the allocations from a task (inclusive of children tasks)
>>- the allocations from a group of tasks (inclusive of children tasks)
>>- the allocations from a CPU
>>- the allocations from a group of CPUs
>>
>> All cases but first one (CAT) are natural usage. So I want to describe
>> the CAT in more details.
>> The goal, as I understand it, it to monitor what is going on inside
>> the CAT partition to detect
>> whether it saturates or if it has room to "breathe". Let's take a
>> simple example.
>
> By "natural usage" you mean "like perf(1) provides for other events"?
>
Yes, people are used to monitoring events per task or per CPU. In that
sense, it is the common usage model. Cgroup monitoring is a derivative
of per-cpu mode.

> But we are trying to figure out requirements here ... what data do people
> need to manage caches and memory bandwidth.  So from this perspective
> monitoring a CAT group is a natural first choice ... did we provision
> this group with too much, or too little cache.
>
I am not saying CAT is not natural. I am saying it is a justified requirement
but a new one and thus need to make sure it is understood and that the
kernel must track CAT partition and CAT partition cache occupancy monitoring
similarly.

> From that starting point I can see that a possible next step when
> finding that a CAT group has too small a cache is to drill down to
> find out how the tasks in the group are using cache.  Armed with that
> information you could move tasks that hog too much cache (and are believed
> to be streaming through memory) into a different CAT group.
>
This is a valid usage model. But you have people who care about monitoring
occupancy but do not necessarily use CAT partitions. Yet in this case, the
occupancy data is still very useful to gauge cache footprint of a workload.
Therefore this usage model should not be discounted.

> What I'm not seeing is how drilling to CPUs helps you.
>
Looking for imbalance, for instance.
Are all the allocations done from only a subset of the CPUs?

> Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that
> shows that 75% of the cache occupancy is attributed to CPU0, and only
> 25% to CPU1.  What can you do with this information to improve things?
> If it is deemed too complex (from a kernel code perspective) to
> implement per-CPU reporting how bad a loss would that be?
>
It is okay to first focus on per-task and per-CAT partition. What I'd
like to see is
an API that could possibly be extended later on to do per-CPU only mode. I am
okay with having only per-CAT and per-task groups initially to keep
things simpler.
But the rsrcfs interface should allow extension to per-CPU only mode. Then the
kernel implementation would take care of allocating the RMID accordingly. The
key is always to ensure allocations can be tracked since inception of the group
be it CAT, tasks, CPU.

> -Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-08 Thread Stephane Eranian
Tony,

On Tue, Feb 7, 2017 at 10:52 AM, Luck, Tony  wrote:
> On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote:
>> Hi,
>>
>> I wanted to take a few steps back and look at the overall goals for
>> cache monitoring.
>> From the various threads and discussion, my understanding is as follows.
>>
>> I think the design must ensure that the following usage models can be 
>> monitored:
>>- the allocations in your CAT partitions
>>- the allocations from a task (inclusive of children tasks)
>>- the allocations from a group of tasks (inclusive of children tasks)
>>- the allocations from a CPU
>>- the allocations from a group of CPUs
>>
>> All cases but first one (CAT) are natural usage. So I want to describe
>> the CAT in more details.
>> The goal, as I understand it, it to monitor what is going on inside
>> the CAT partition to detect
>> whether it saturates or if it has room to "breathe". Let's take a
>> simple example.
>
> By "natural usage" you mean "like perf(1) provides for other events"?
>
Yes, people are used to monitoring events per task or per CPU. In that
sense, it is the common usage model. Cgroup monitoring is a derivative
of per-cpu mode.

> But we are trying to figure out requirements here ... what data do people
> need to manage caches and memory bandwidth.  So from this perspective
> monitoring a CAT group is a natural first choice ... did we provision
> this group with too much, or too little cache.
>
I am not saying CAT is not natural. I am saying it is a justified requirement
but a new one and thus need to make sure it is understood and that the
kernel must track CAT partition and CAT partition cache occupancy monitoring
similarly.

> From that starting point I can see that a possible next step when
> finding that a CAT group has too small a cache is to drill down to
> find out how the tasks in the group are using cache.  Armed with that
> information you could move tasks that hog too much cache (and are believed
> to be streaming through memory) into a different CAT group.
>
This is a valid usage model. But you have people who care about monitoring
occupancy but do not necessarily use CAT partitions. Yet in this case, the
occupancy data is still very useful to gauge cache footprint of a workload.
Therefore this usage model should not be discounted.

> What I'm not seeing is how drilling to CPUs helps you.
>
Looking for imbalance, for instance.
Are all the allocations done from only a subset of the CPUs?

> Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that
> shows that 75% of the cache occupancy is attributed to CPU0, and only
> 25% to CPU1.  What can you do with this information to improve things?
> If it is deemed too complex (from a kernel code perspective) to
> implement per-CPU reporting how bad a loss would that be?
>
It is okay to first focus on per-task and per-CAT partition. What I'd
like to see is
an API that could possibly be extended later on to do per-CPU only mode. I am
okay with having only per-CAT and per-task groups initially to keep
things simpler.
But the rsrcfs interface should allow extension to per-CPU only mode. Then the
kernel implementation would take care of allocating the RMID accordingly. The
key is always to ensure allocations can be tracked since inception of the group
be it CAT, tasks, CPU.

> -Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-08 Thread Peter Zijlstra
On Fri, Jan 20, 2017 at 12:11:53PM -0800, David Carrillo-Cisneros wrote:
> Implementation ideas:
> 
> First idea is to expose one monitoring file per resource in a CTRLGRP,
> so the list of CTRLGRP's files would be: schemata, tasks, cpus,
> monitor_l3_0, monitor_l3_1, ...
> 
> the monitor_ file descriptor is passed to perf_event_open
> in the way cgroup file descriptors are passed now. All events to the
> same (CTRLGRP,resource_id) share RMID.
> 
> The RMID allocation part can either be handled by RDT Allocation or by
> the RDT Monitoring PMU. Either ways, the existence of PMU's
> perf_events allocates/releases the RMID.

So I've had complaints about exactly that behaviour. Someone wanted
RMIDs assigned (and start measuring) the moment the grouping got
created/tasks started running etc..

So I think the design should also explicitly state how this is supposed
to be handled and not left as an implementation detail.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-08 Thread Peter Zijlstra
On Fri, Jan 20, 2017 at 12:11:53PM -0800, David Carrillo-Cisneros wrote:
> Implementation ideas:
> 
> First idea is to expose one monitoring file per resource in a CTRLGRP,
> so the list of CTRLGRP's files would be: schemata, tasks, cpus,
> monitor_l3_0, monitor_l3_1, ...
> 
> the monitor_ file descriptor is passed to perf_event_open
> in the way cgroup file descriptors are passed now. All events to the
> same (CTRLGRP,resource_id) share RMID.
> 
> The RMID allocation part can either be handled by RDT Allocation or by
> the RDT Monitoring PMU. Either ways, the existence of PMU's
> perf_events allocates/releases the RMID.

So I've had complaints about exactly that behaviour. Someone wanted
RMIDs assigned (and start measuring) the moment the grouping got
created/tasks started running etc..

So I think the design should also explicitly state how this is supposed
to be handled and not left as an implementation detail.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-08 Thread Peter Zijlstra
On Fri, Jan 20, 2017 at 03:51:48PM -0800, Shivappa Vikas wrote:
> I think the email thread is going very long and we should just meet f2f
> probably next week to iron out the requirements and chalk out a design
> proposal.

The thread isn't the problem; you lot not trimming your emails is
however.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-08 Thread Peter Zijlstra
On Fri, Jan 20, 2017 at 03:51:48PM -0800, Shivappa Vikas wrote:
> I think the email thread is going very long and we should just meet f2f
> probably next week to iron out the requirements and chalk out a design
> proposal.

The thread isn't the problem; you lot not trimming your emails is
however.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-07 Thread Shivappa Vikas



On Tue, 7 Feb 2017, Stephane Eranian wrote:


Hi,

I wanted to take a few steps back and look at the overall goals for
cache monitoring.
From the various threads and discussion, my understanding is as follows.

I think the design must ensure that the following usage models can be monitored:
  - the allocations in your CAT partitions
  - the allocations from a task (inclusive of children tasks)
  - the allocations from a group of tasks (inclusive of children tasks)
  - the allocations from a CPU
  - the allocations from a group of CPUs

All cases but first one (CAT) are natural usage. So I want to describe
the CAT in more details.
The goal, as I understand it, it to monitor what is going on inside
the CAT partition to detect
whether it saturates or if it has room to "breathe". Let's take a
simple example.

Suppose, we have a CAT group, cat1:

cat1: 20MB partition (CLOSID1)
   CPUs=CPU0,CPU1
   TASKs=PID20

There can only be one CLOSID active on a CPU at a time. The kernel
chooses to prioritize tasks over CPU when enforcing cases with multiple
CLOSIDs.

Let's review how this works for cat1 and for each scenario look at how
the kernel enforces or not the cache partition:

1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1
2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1
3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1
4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1)
5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID

Now, let's review how we could track the allocations done in cat1 using a single
RMID. There can only be one RMID active at a time per CPU. The kernel
chooses to prioritize tasks over CPU:

cat1: 20MB partition (CLOSID1, RMID1)
   CPUs=CPU0,CPU1
   TASKs=PID20

1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1
2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1
3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1
4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1)
5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID

To make sense to a user, the cases where the hardware monitors MUST be
the same as the cases where the hardware enforces the cache
partitioning.

Here we see that it works using a single RMID.

However doing so limits certain monitoring modes where a user might want to
get a breakdown per CPU of the allocations, such as with:
 $ perf stat -a -A -e llc_occupancy -R cat1
(where -R points to the monitoring group in rsrcfs). Here this mode would not be
possible because the two CPUs in the group share the same RMID.


In the requirements here https://marc.info/?l=linux-kernel=148597969808732

8)  Can get measurements for subsets of tasks in a CAT group (to find the 
guys hogging the resources).


This should also applies to the subsets of cpus.

That would let you monitor on CPUs that is a subset or different from a CAT 
group.  That should let you create mon groups like in the second example you 
mention along with the control groups above.


mon0: RMID0
CPUs=CPU0

mon1: RMID1
CPUs=CPU1

mon2: RMID2
CPUs=CPU2

...




Now let's take another scenario, and suppose you have two monitoring groups
as follows:

mon1: RMID1
   CPUs=CPU0,CPU1
mon2: RMID2
   TASKS=PID20

If PID20 runs on CP0, then RMID2 is activated, and thus allocations
done by PID20 are not counted towards RMID1. There is a blind spot.

Whether or not this is a problem depends on the semantic exported by
the interface for CPU mode:
  1-Count all allocations from any tasks running on CPU
  2-Count all allocations from tasks which are NOT monitoring themselves

If the kernel choses 1, then there is a blind spot and the measurement
is not as accurate as it could be because of the decision to use only one RDMID.
But if the kernel choses 2, then everything works fine with a single RMID.

If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e.,
measure any activity from any thread (choice 1), then the single RMID per group
does not work.

If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a
CPU, i.e., measures only when threads of the cgroup run on that CPU, then using
a single RMID per group works.



Agree there are blind spots in both. But the requirements is trying to be based 
on the resctrl allocation as Thomas suggested.

Which is aligned to monitoring real time tasks as i understand.
for the above example, some tasks which donot have an RMID(say in the root 
group) are the real time tasks that are specially configured to running on a cpux which need to be 
allocated or monitored.




Hope this helps clarifies the usage model and design choices.



Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-07 Thread Shivappa Vikas



On Tue, 7 Feb 2017, Stephane Eranian wrote:


Hi,

I wanted to take a few steps back and look at the overall goals for
cache monitoring.
From the various threads and discussion, my understanding is as follows.

I think the design must ensure that the following usage models can be monitored:
  - the allocations in your CAT partitions
  - the allocations from a task (inclusive of children tasks)
  - the allocations from a group of tasks (inclusive of children tasks)
  - the allocations from a CPU
  - the allocations from a group of CPUs

All cases but first one (CAT) are natural usage. So I want to describe
the CAT in more details.
The goal, as I understand it, it to monitor what is going on inside
the CAT partition to detect
whether it saturates or if it has room to "breathe". Let's take a
simple example.

Suppose, we have a CAT group, cat1:

cat1: 20MB partition (CLOSID1)
   CPUs=CPU0,CPU1
   TASKs=PID20

There can only be one CLOSID active on a CPU at a time. The kernel
chooses to prioritize tasks over CPU when enforcing cases with multiple
CLOSIDs.

Let's review how this works for cat1 and for each scenario look at how
the kernel enforces or not the cache partition:

1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1
2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1
3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1
4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1)
5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID

Now, let's review how we could track the allocations done in cat1 using a single
RMID. There can only be one RMID active at a time per CPU. The kernel
chooses to prioritize tasks over CPU:

cat1: 20MB partition (CLOSID1, RMID1)
   CPUs=CPU0,CPU1
   TASKs=PID20

1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1
2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1
3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1
4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1)
5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID

To make sense to a user, the cases where the hardware monitors MUST be
the same as the cases where the hardware enforces the cache
partitioning.

Here we see that it works using a single RMID.

However doing so limits certain monitoring modes where a user might want to
get a breakdown per CPU of the allocations, such as with:
 $ perf stat -a -A -e llc_occupancy -R cat1
(where -R points to the monitoring group in rsrcfs). Here this mode would not be
possible because the two CPUs in the group share the same RMID.


In the requirements here https://marc.info/?l=linux-kernel=148597969808732

8)  Can get measurements for subsets of tasks in a CAT group (to find the 
guys hogging the resources).


This should also applies to the subsets of cpus.

That would let you monitor on CPUs that is a subset or different from a CAT 
group.  That should let you create mon groups like in the second example you 
mention along with the control groups above.


mon0: RMID0
CPUs=CPU0

mon1: RMID1
CPUs=CPU1

mon2: RMID2
CPUs=CPU2

...




Now let's take another scenario, and suppose you have two monitoring groups
as follows:

mon1: RMID1
   CPUs=CPU0,CPU1
mon2: RMID2
   TASKS=PID20

If PID20 runs on CP0, then RMID2 is activated, and thus allocations
done by PID20 are not counted towards RMID1. There is a blind spot.

Whether or not this is a problem depends on the semantic exported by
the interface for CPU mode:
  1-Count all allocations from any tasks running on CPU
  2-Count all allocations from tasks which are NOT monitoring themselves

If the kernel choses 1, then there is a blind spot and the measurement
is not as accurate as it could be because of the decision to use only one RDMID.
But if the kernel choses 2, then everything works fine with a single RMID.

If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e.,
measure any activity from any thread (choice 1), then the single RMID per group
does not work.

If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a
CPU, i.e., measures only when threads of the cgroup run on that CPU, then using
a single RMID per group works.



Agree there are blind spots in both. But the requirements is trying to be based 
on the resctrl allocation as Thomas suggested.

Which is aligned to monitoring real time tasks as i understand.
for the above example, some tasks which donot have an RMID(say in the root 
group) are the real time tasks that are specially configured to running on a cpux which need to be 
allocated or monitored.




Hope this helps clarifies the usage model and design choices.



Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-07 Thread Luck, Tony
On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote:
> Hi,
> 
> I wanted to take a few steps back and look at the overall goals for
> cache monitoring.
> From the various threads and discussion, my understanding is as follows.
> 
> I think the design must ensure that the following usage models can be 
> monitored:
>- the allocations in your CAT partitions
>- the allocations from a task (inclusive of children tasks)
>- the allocations from a group of tasks (inclusive of children tasks)
>- the allocations from a CPU
>- the allocations from a group of CPUs
> 
> All cases but first one (CAT) are natural usage. So I want to describe
> the CAT in more details.
> The goal, as I understand it, it to monitor what is going on inside
> the CAT partition to detect
> whether it saturates or if it has room to "breathe". Let's take a
> simple example.

By "natural usage" you mean "like perf(1) provides for other events"?

But we are trying to figure out requirements here ... what data do people
need to manage caches and memory bandwidth.  So from this perspective
monitoring a CAT group is a natural first choice ... did we provision
this group with too much, or too little cache.

>From that starting point I can see that a possible next step when
finding that a CAT group has too small a cache is to drill down to
find out how the tasks in the group are using cache.  Armed with that
information you could move tasks that hog too much cache (and are believed
to be streaming through memory) into a different CAT group.

What I'm not seeing is how drilling to CPUs helps you.

Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that
shows that 75% of the cache occupancy is attributed to CPU0, and only
25% to CPU1.  What can you do with this information to improve things?
If it is deemed too complex (from a kernel code perspective) to
implement per-CPU reporting how bad a loss would that be?

-Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-07 Thread Luck, Tony
On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote:
> Hi,
> 
> I wanted to take a few steps back and look at the overall goals for
> cache monitoring.
> From the various threads and discussion, my understanding is as follows.
> 
> I think the design must ensure that the following usage models can be 
> monitored:
>- the allocations in your CAT partitions
>- the allocations from a task (inclusive of children tasks)
>- the allocations from a group of tasks (inclusive of children tasks)
>- the allocations from a CPU
>- the allocations from a group of CPUs
> 
> All cases but first one (CAT) are natural usage. So I want to describe
> the CAT in more details.
> The goal, as I understand it, it to monitor what is going on inside
> the CAT partition to detect
> whether it saturates or if it has room to "breathe". Let's take a
> simple example.

By "natural usage" you mean "like perf(1) provides for other events"?

But we are trying to figure out requirements here ... what data do people
need to manage caches and memory bandwidth.  So from this perspective
monitoring a CAT group is a natural first choice ... did we provision
this group with too much, or too little cache.

>From that starting point I can see that a possible next step when
finding that a CAT group has too small a cache is to drill down to
find out how the tasks in the group are using cache.  Armed with that
information you could move tasks that hog too much cache (and are believed
to be streaming through memory) into a different CAT group.

What I'm not seeing is how drilling to CPUs helps you.

Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that
shows that 75% of the cache occupancy is attributed to CPU0, and only
25% to CPU1.  What can you do with this information to improve things?
If it is deemed too complex (from a kernel code perspective) to
implement per-CPU reporting how bad a loss would that be?

-Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-07 Thread Stephane Eranian
Hi,

I wanted to take a few steps back and look at the overall goals for
cache monitoring.
>From the various threads and discussion, my understanding is as follows.

I think the design must ensure that the following usage models can be monitored:
   - the allocations in your CAT partitions
   - the allocations from a task (inclusive of children tasks)
   - the allocations from a group of tasks (inclusive of children tasks)
   - the allocations from a CPU
   - the allocations from a group of CPUs

All cases but first one (CAT) are natural usage. So I want to describe
the CAT in more details.
The goal, as I understand it, it to monitor what is going on inside
the CAT partition to detect
whether it saturates or if it has room to "breathe". Let's take a
simple example.

Suppose, we have a CAT group, cat1:

cat1: 20MB partition (CLOSID1)
CPUs=CPU0,CPU1
TASKs=PID20

There can only be one CLOSID active on a CPU at a time. The kernel
chooses to prioritize tasks over CPU when enforcing cases with multiple
CLOSIDs.

Let's review how this works for cat1 and for each scenario look at how
the kernel enforces or not the cache partition:

 1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1
 2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1
 3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1
 4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1)
 5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID

Now, let's review how we could track the allocations done in cat1 using a single
RMID. There can only be one RMID active at a time per CPU. The kernel
chooses to prioritize tasks over CPU:

cat1: 20MB partition (CLOSID1, RMID1)
CPUs=CPU0,CPU1
TASKs=PID20

 1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1
 2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1
 3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1
 4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1)
 5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID

To make sense to a user, the cases where the hardware monitors MUST be
the same as the cases where the hardware enforces the cache
partitioning.

Here we see that it works using a single RMID.

However doing so limits certain monitoring modes where a user might want to
get a breakdown per CPU of the allocations, such as with:
  $ perf stat -a -A -e llc_occupancy -R cat1
(where -R points to the monitoring group in rsrcfs). Here this mode would not be
possible because the two CPUs in the group share the same RMID.

Now let's take another scenario, and suppose you have two monitoring groups
as follows:

mon1: RMID1
CPUs=CPU0,CPU1
mon2: RMID2
TASKS=PID20

If PID20 runs on CP0, then RMID2 is activated, and thus allocations
done by PID20 are not counted towards RMID1. There is a blind spot.

Whether or not this is a problem depends on the semantic exported by
the interface for CPU mode:
   1-Count all allocations from any tasks running on CPU
   2-Count all allocations from tasks which are NOT monitoring themselves

If the kernel choses 1, then there is a blind spot and the measurement
is not as accurate as it could be because of the decision to use only one RDMID.
But if the kernel choses 2, then everything works fine with a single RMID.

If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e.,
measure any activity from any thread (choice 1), then the single RMID per group
does not work.

If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a
CPU, i.e., measures only when threads of the cgroup run on that CPU, then using
a single RMID per group works.

Hope this helps clarifies the usage model and design choices.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-07 Thread Stephane Eranian
Hi,

I wanted to take a few steps back and look at the overall goals for
cache monitoring.
>From the various threads and discussion, my understanding is as follows.

I think the design must ensure that the following usage models can be monitored:
   - the allocations in your CAT partitions
   - the allocations from a task (inclusive of children tasks)
   - the allocations from a group of tasks (inclusive of children tasks)
   - the allocations from a CPU
   - the allocations from a group of CPUs

All cases but first one (CAT) are natural usage. So I want to describe
the CAT in more details.
The goal, as I understand it, it to monitor what is going on inside
the CAT partition to detect
whether it saturates or if it has room to "breathe". Let's take a
simple example.

Suppose, we have a CAT group, cat1:

cat1: 20MB partition (CLOSID1)
CPUs=CPU0,CPU1
TASKs=PID20

There can only be one CLOSID active on a CPU at a time. The kernel
chooses to prioritize tasks over CPU when enforcing cases with multiple
CLOSIDs.

Let's review how this works for cat1 and for each scenario look at how
the kernel enforces or not the cache partition:

 1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1
 2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1
 3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1
 4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1)
 5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID

Now, let's review how we could track the allocations done in cat1 using a single
RMID. There can only be one RMID active at a time per CPU. The kernel
chooses to prioritize tasks over CPU:

cat1: 20MB partition (CLOSID1, RMID1)
CPUs=CPU0,CPU1
TASKs=PID20

 1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1
 2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1
 3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1
 4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1)
 5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID

To make sense to a user, the cases where the hardware monitors MUST be
the same as the cases where the hardware enforces the cache
partitioning.

Here we see that it works using a single RMID.

However doing so limits certain monitoring modes where a user might want to
get a breakdown per CPU of the allocations, such as with:
  $ perf stat -a -A -e llc_occupancy -R cat1
(where -R points to the monitoring group in rsrcfs). Here this mode would not be
possible because the two CPUs in the group share the same RMID.

Now let's take another scenario, and suppose you have two monitoring groups
as follows:

mon1: RMID1
CPUs=CPU0,CPU1
mon2: RMID2
TASKS=PID20

If PID20 runs on CP0, then RMID2 is activated, and thus allocations
done by PID20 are not counted towards RMID1. There is a blind spot.

Whether or not this is a problem depends on the semantic exported by
the interface for CPU mode:
   1-Count all allocations from any tasks running on CPU
   2-Count all allocations from tasks which are NOT monitoring themselves

If the kernel choses 1, then there is a blind spot and the measurement
is not as accurate as it could be because of the decision to use only one RDMID.
But if the kernel choses 2, then everything works fine with a single RMID.

If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e.,
measure any activity from any thread (choice 1), then the single RMID per group
does not work.

If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a
CPU, i.e., measures only when threads of the cgroup run on that CPU, then using
a single RMID per group works.

Hope this helps clarifies the usage model and design choices.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread David Carrillo-Cisneros
On Mon, Feb 6, 2017 at 3:27 PM, Luck, Tony  wrote:
>> cgroup mode gives a per-CPU breakdown of event and running time, the
>> tool aggregates it into running time vs event count. Both per-cpu
>> breakdown and the aggregate are useful.
>>
>> Piggy-backing on perf's cgroup mode would give us all the above for free.
>
> Do you have some sample output from a perf run on a cgroup measuring a
> "normal" event showing what you get?

# perf stat -I 1000 -e cycles -a -C 0-1 -A -x, -G /
 1.000116648,CPU0,20677864,,cycles,/
 1.000169948,CPU1,24760887,,cycles,/
 2.000453849,CPU0,36120862,,cycles,/
 2.000480259,CPU1,12535575,,cycles,/
 3.000664762,CPU0,7564504,,cycles,/
 3.000692552,CPU1,7307480,,cycles,/

>
> I think that requires that we still go through perf ->start() and ->stop() 
> functions
> to know how much time we spent running.  I thought we were looking at bundling
> the RMID updates into the same spot in sched() where we switch the CLOSID.
> More or less at the "start" point, but there is no "stop".  If we are 
> switching between
> runnable processes, it amounts to pretty much the same thing ... except we 
> bill
> to someone all the time instead of having a gap in the context switch where we
> stopped billing to the old task and haven't started billing to the new one 
> yet.

Another problem is that it will require a perf event all the time for
timing measurements to be consistent with RMID measurements.

The only sane option I can come up is to do timing in RDT the way perf
cgroup does it (keep a per-cpu time that increases with local clock's
delta). A reader can add the times for all CPUs in cpu_mask.

>
> But if we idle ... then we don't "stop".  Shouldn't matter much from a 
> measurement
> perspective because idle won't use cache or consume bandwidth. But we'd count
> that time as "on cpu" for the last process to run.

I may be missing something basic but isn't  __switch_to called when
switching to the idle task? that will update the CLOSID and RMID to
whatever the idle task in on, isnt it?

Thanks,
David


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread David Carrillo-Cisneros
On Mon, Feb 6, 2017 at 3:27 PM, Luck, Tony  wrote:
>> cgroup mode gives a per-CPU breakdown of event and running time, the
>> tool aggregates it into running time vs event count. Both per-cpu
>> breakdown and the aggregate are useful.
>>
>> Piggy-backing on perf's cgroup mode would give us all the above for free.
>
> Do you have some sample output from a perf run on a cgroup measuring a
> "normal" event showing what you get?

# perf stat -I 1000 -e cycles -a -C 0-1 -A -x, -G /
 1.000116648,CPU0,20677864,,cycles,/
 1.000169948,CPU1,24760887,,cycles,/
 2.000453849,CPU0,36120862,,cycles,/
 2.000480259,CPU1,12535575,,cycles,/
 3.000664762,CPU0,7564504,,cycles,/
 3.000692552,CPU1,7307480,,cycles,/

>
> I think that requires that we still go through perf ->start() and ->stop() 
> functions
> to know how much time we spent running.  I thought we were looking at bundling
> the RMID updates into the same spot in sched() where we switch the CLOSID.
> More or less at the "start" point, but there is no "stop".  If we are 
> switching between
> runnable processes, it amounts to pretty much the same thing ... except we 
> bill
> to someone all the time instead of having a gap in the context switch where we
> stopped billing to the old task and haven't started billing to the new one 
> yet.

Another problem is that it will require a perf event all the time for
timing measurements to be consistent with RMID measurements.

The only sane option I can come up is to do timing in RDT the way perf
cgroup does it (keep a per-cpu time that increases with local clock's
delta). A reader can add the times for all CPUs in cpu_mask.

>
> But if we idle ... then we don't "stop".  Shouldn't matter much from a 
> measurement
> perspective because idle won't use cache or consume bandwidth. But we'd count
> that time as "on cpu" for the last process to run.

I may be missing something basic but isn't  __switch_to called when
switching to the idle task? that will update the CLOSID and RMID to
whatever the idle task in on, isnt it?

Thanks,
David


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread Luck, Tony
> cgroup mode gives a per-CPU breakdown of event and running time, the
> tool aggregates it into running time vs event count. Both per-cpu
> breakdown and the aggregate are useful.
>
> Piggy-backing on perf's cgroup mode would give us all the above for free.

Do you have some sample output from a perf run on a cgroup measuring a
"normal" event showing what you get?

I think that requires that we still go through perf ->start() and ->stop() 
functions
to know how much time we spent running.  I thought we were looking at bundling
the RMID updates into the same spot in sched() where we switch the CLOSID.
More or less at the "start" point, but there is no "stop".  If we are switching 
between
runnable processes, it amounts to pretty much the same thing ... except we bill
to someone all the time instead of having a gap in the context switch where we
stopped billing to the old task and haven't started billing to the new one yet.

But if we idle ... then we don't "stop".  Shouldn't matter much from a 
measurement
perspective because idle won't use cache or consume bandwidth. But we'd count
that time as "on cpu" for the last process to run.

-Tony


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread Luck, Tony
> cgroup mode gives a per-CPU breakdown of event and running time, the
> tool aggregates it into running time vs event count. Both per-cpu
> breakdown and the aggregate are useful.
>
> Piggy-backing on perf's cgroup mode would give us all the above for free.

Do you have some sample output from a perf run on a cgroup measuring a
"normal" event showing what you get?

I think that requires that we still go through perf ->start() and ->stop() 
functions
to know how much time we spent running.  I thought we were looking at bundling
the RMID updates into the same spot in sched() where we switch the CLOSID.
More or less at the "start" point, but there is no "stop".  If we are switching 
between
runnable processes, it amounts to pretty much the same thing ... except we bill
to someone all the time instead of having a gap in the context switch where we
stopped billing to the old task and haven't started billing to the new one yet.

But if we idle ... then we don't "stop".  Shouldn't matter much from a 
measurement
perspective because idle won't use cache or consume bandwidth. But we'd count
that time as "on cpu" for the last process to run.

-Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread David Carrillo-Cisneros
On Mon, Feb 6, 2017 at 1:22 PM, Luck, Tony  wrote:
>> 12) Whatever fs or syscall is provided instead of perf syscalls, it
>> should provide total_time_enabled in the way perf does, otherwise is
>> hard to interpret MBM values.
>
> It seems that it is hard to define what we even mean by memory bandwidth.
>
> If you are measuring just one task and you find that the total number of bytes
> read is 1GB at some point, and one second later the total bytes is 2GB, then
> it is clear that the average bandwidth for this process is 1GB/s. If you know
> that the task was only running for 50% of the cycles during that 1s interval,
> you could say that it is doing 2GB/s ... which is I believe what you were
> thinking when you wrote #12 above.

Yes, that's one of the cases.

> But whether that is right depends a
> bit on *why* it only ran 50% of the time. If it was time-sliced out by the
> scheduler ... then it may have been trying to be a 2GB/s app. But if it
> was waiting for packets from the network, then it really is using 1 GB/s.

IMO, "right" means that measured bandwidth and running time are
correct. The *why* is a bigger question.

>
> All bets are off if you are measuring a service that consists of several
> tasks running concurrently. All you can really talk about is the aggregate
> average bandwidth (total bytes / wall-clock time). It makes no sense to
> try and factor in how much cpu time each of the individual tasks got.

cgroup mode gives a per-CPU breakdown of event and running time, the
tool aggregates it into running time vs event count. Both per-cpu
breakdown and the aggregate are useful.

Piggy-backing on perf's cgroup mode would give us all the above for free.

>
> -Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread David Carrillo-Cisneros
On Mon, Feb 6, 2017 at 1:22 PM, Luck, Tony  wrote:
>> 12) Whatever fs or syscall is provided instead of perf syscalls, it
>> should provide total_time_enabled in the way perf does, otherwise is
>> hard to interpret MBM values.
>
> It seems that it is hard to define what we even mean by memory bandwidth.
>
> If you are measuring just one task and you find that the total number of bytes
> read is 1GB at some point, and one second later the total bytes is 2GB, then
> it is clear that the average bandwidth for this process is 1GB/s. If you know
> that the task was only running for 50% of the cycles during that 1s interval,
> you could say that it is doing 2GB/s ... which is I believe what you were
> thinking when you wrote #12 above.

Yes, that's one of the cases.

> But whether that is right depends a
> bit on *why* it only ran 50% of the time. If it was time-sliced out by the
> scheduler ... then it may have been trying to be a 2GB/s app. But if it
> was waiting for packets from the network, then it really is using 1 GB/s.

IMO, "right" means that measured bandwidth and running time are
correct. The *why* is a bigger question.

>
> All bets are off if you are measuring a service that consists of several
> tasks running concurrently. All you can really talk about is the aggregate
> average bandwidth (total bytes / wall-clock time). It makes no sense to
> try and factor in how much cpu time each of the individual tasks got.

cgroup mode gives a per-CPU breakdown of event and running time, the
tool aggregates it into running time vs event count. Both per-cpu
breakdown and the aggregate are useful.

Piggy-backing on perf's cgroup mode would give us all the above for free.

>
> -Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread David Carrillo-Cisneros
On Mon, Feb 6, 2017 at 1:36 PM, Shivappa Vikas  wrote:
>
>
> On Mon, 6 Feb 2017, Luck, Tony wrote:
>
>>> 12) Whatever fs or syscall is provided instead of perf syscalls, it
>>> should provide total_time_enabled in the way perf does, otherwise is
>>> hard to interpret MBM values.
>>
>>
>> It seems that it is hard to define what we even mean by memory bandwidth.
>>
>> If you are measuring just one task and you find that the total number of
>> bytes
>> read is 1GB at some point, and one second later the total bytes is 2GB,
>> then
>> it is clear that the average bandwidth for this process is 1GB/s. If you
>> know
>> that the task was only running for 50% of the cycles during that 1s
>> interval,
>> you could say that it is doing 2GB/s ... which is I believe what you were
>> thinking when you wrote #12 above.  But whether that is right depends a
>> bit on *why* it only ran 50% of the time. If it was time-sliced out by the
>> scheduler ... then it may have been trying to be a 2GB/s app. But if it
>> was waiting for packets from the network, then it really is using 1 GB/s.
>
>
> Is the requirement is to have both enabled and run time or just enabled time
> (enabled time must be easy to report - just the wall time from start trace
> to end trace)?

Both, but since the original requirements dropped rotation, then
total_running == total_enabled.

>
> This is not reported correctly in the upstream perf cqm and for
> cgroup -C we dont report it either (since we report the package).

using the -x option shows the run time and the % enabled. Many tools
uses that csv output.

>
> Thanks,
> Vikas
>
>
>>
>> All bets are off if you are measuring a service that consists of several
>> tasks running concurrently. All you can really talk about is the aggregate
>> average bandwidth (total bytes / wall-clock time). It makes no sense to
>> try and factor in how much cpu time each of the individual tasks got.
>>
>> -Tony
>>
>


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread David Carrillo-Cisneros
On Mon, Feb 6, 2017 at 1:36 PM, Shivappa Vikas  wrote:
>
>
> On Mon, 6 Feb 2017, Luck, Tony wrote:
>
>>> 12) Whatever fs or syscall is provided instead of perf syscalls, it
>>> should provide total_time_enabled in the way perf does, otherwise is
>>> hard to interpret MBM values.
>>
>>
>> It seems that it is hard to define what we even mean by memory bandwidth.
>>
>> If you are measuring just one task and you find that the total number of
>> bytes
>> read is 1GB at some point, and one second later the total bytes is 2GB,
>> then
>> it is clear that the average bandwidth for this process is 1GB/s. If you
>> know
>> that the task was only running for 50% of the cycles during that 1s
>> interval,
>> you could say that it is doing 2GB/s ... which is I believe what you were
>> thinking when you wrote #12 above.  But whether that is right depends a
>> bit on *why* it only ran 50% of the time. If it was time-sliced out by the
>> scheduler ... then it may have been trying to be a 2GB/s app. But if it
>> was waiting for packets from the network, then it really is using 1 GB/s.
>
>
> Is the requirement is to have both enabled and run time or just enabled time
> (enabled time must be easy to report - just the wall time from start trace
> to end trace)?

Both, but since the original requirements dropped rotation, then
total_running == total_enabled.

>
> This is not reported correctly in the upstream perf cqm and for
> cgroup -C we dont report it either (since we report the package).

using the -x option shows the run time and the % enabled. Many tools
uses that csv output.

>
> Thanks,
> Vikas
>
>
>>
>> All bets are off if you are measuring a service that consists of several
>> tasks running concurrently. All you can really talk about is the aggregate
>> average bandwidth (total bytes / wall-clock time). It makes no sense to
>> try and factor in how much cpu time each of the individual tasks got.
>>
>> -Tony
>>
>


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread Shivappa Vikas



On Mon, 6 Feb 2017, Luck, Tony wrote:


12) Whatever fs or syscall is provided instead of perf syscalls, it
should provide total_time_enabled in the way perf does, otherwise is
hard to interpret MBM values.


It seems that it is hard to define what we even mean by memory bandwidth.

If you are measuring just one task and you find that the total number of bytes
read is 1GB at some point, and one second later the total bytes is 2GB, then
it is clear that the average bandwidth for this process is 1GB/s. If you know
that the task was only running for 50% of the cycles during that 1s interval,
you could say that it is doing 2GB/s ... which is I believe what you were
thinking when you wrote #12 above.  But whether that is right depends a
bit on *why* it only ran 50% of the time. If it was time-sliced out by the
scheduler ... then it may have been trying to be a 2GB/s app. But if it
was waiting for packets from the network, then it really is using 1 GB/s.


Is the requirement is to have both enabled and run time or just enabled time 
(enabled time must be easy to report - just the wall time from start trace to 
end trace)?


This is not reported correctly in the upstream perf cqm and for
cgroup -C we dont report it either (since we report the package).

Thanks,
Vikas



All bets are off if you are measuring a service that consists of several
tasks running concurrently. All you can really talk about is the aggregate
average bandwidth (total bytes / wall-clock time). It makes no sense to
try and factor in how much cpu time each of the individual tasks got.

-Tony



RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread Shivappa Vikas



On Mon, 6 Feb 2017, Luck, Tony wrote:


12) Whatever fs or syscall is provided instead of perf syscalls, it
should provide total_time_enabled in the way perf does, otherwise is
hard to interpret MBM values.


It seems that it is hard to define what we even mean by memory bandwidth.

If you are measuring just one task and you find that the total number of bytes
read is 1GB at some point, and one second later the total bytes is 2GB, then
it is clear that the average bandwidth for this process is 1GB/s. If you know
that the task was only running for 50% of the cycles during that 1s interval,
you could say that it is doing 2GB/s ... which is I believe what you were
thinking when you wrote #12 above.  But whether that is right depends a
bit on *why* it only ran 50% of the time. If it was time-sliced out by the
scheduler ... then it may have been trying to be a 2GB/s app. But if it
was waiting for packets from the network, then it really is using 1 GB/s.


Is the requirement is to have both enabled and run time or just enabled time 
(enabled time must be easy to report - just the wall time from start trace to 
end trace)?


This is not reported correctly in the upstream perf cqm and for
cgroup -C we dont report it either (since we report the package).

Thanks,
Vikas



All bets are off if you are measuring a service that consists of several
tasks running concurrently. All you can really talk about is the aggregate
average bandwidth (total bytes / wall-clock time). It makes no sense to
try and factor in how much cpu time each of the individual tasks got.

-Tony



RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread Luck, Tony
> 12) Whatever fs or syscall is provided instead of perf syscalls, it
> should provide total_time_enabled in the way perf does, otherwise is
> hard to interpret MBM values.

It seems that it is hard to define what we even mean by memory bandwidth.

If you are measuring just one task and you find that the total number of bytes
read is 1GB at some point, and one second later the total bytes is 2GB, then
it is clear that the average bandwidth for this process is 1GB/s. If you know
that the task was only running for 50% of the cycles during that 1s interval,
you could say that it is doing 2GB/s ... which is I believe what you were
thinking when you wrote #12 above.  But whether that is right depends a
bit on *why* it only ran 50% of the time. If it was time-sliced out by the
scheduler ... then it may have been trying to be a 2GB/s app. But if it
was waiting for packets from the network, then it really is using 1 GB/s.

All bets are off if you are measuring a service that consists of several
tasks running concurrently. All you can really talk about is the aggregate
average bandwidth (total bytes / wall-clock time). It makes no sense to
try and factor in how much cpu time each of the individual tasks got.

-Tony


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread Luck, Tony
> 12) Whatever fs or syscall is provided instead of perf syscalls, it
> should provide total_time_enabled in the way perf does, otherwise is
> hard to interpret MBM values.

It seems that it is hard to define what we even mean by memory bandwidth.

If you are measuring just one task and you find that the total number of bytes
read is 1GB at some point, and one second later the total bytes is 2GB, then
it is clear that the average bandwidth for this process is 1GB/s. If you know
that the task was only running for 50% of the cycles during that 1s interval,
you could say that it is doing 2GB/s ... which is I believe what you were
thinking when you wrote #12 above.  But whether that is right depends a
bit on *why* it only ran 50% of the time. If it was time-sliced out by the
scheduler ... then it may have been trying to be a 2GB/s app. But if it
was waiting for packets from the network, then it really is using 1 GB/s.

All bets are off if you are measuring a service that consists of several
tasks running concurrently. All you can really talk about is the aggregate
average bandwidth (total bytes / wall-clock time). It makes no sense to
try and factor in how much cpu time each of the individual tasks got.

-Tony


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread Luck, Tony
Digging through the e-mails from last week to generate a new version
of the requirements I looked harder at this:

> 12) Whatever fs or syscall is provided instead of perf syscalls, it
> should provide total_time_enabled in the way perf does, otherwise is
> hard to interpret MBM values.

This looks tricky if we are piggy-backing on the CAT code to switch
RMID along with CLOSID at context switch time.  We could get an
approximation by adding:

if (newRMID != oldRMID) {
now = grab current time in some format
atomic_add(rmid_enabled_time[oldRMID], now - 
this_cpu_read(rmid_time));
this_cpu_write(rmid_time, now);
}

but:

1) that would only work on a single socket machine (we'd really want 
rmid_enabled_time
separately for each socket)
2) when we want to read that enabled time, we'd really need to add time for all 
the
threads currently running on CPUs across the system since we last switched RMID
3) reading the time and doing atomic ops in context switch code won't be popular

:-(

-Tony




RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-06 Thread Luck, Tony
Digging through the e-mails from last week to generate a new version
of the requirements I looked harder at this:

> 12) Whatever fs or syscall is provided instead of perf syscalls, it
> should provide total_time_enabled in the way perf does, otherwise is
> hard to interpret MBM values.

This looks tricky if we are piggy-backing on the CAT code to switch
RMID along with CLOSID at context switch time.  We could get an
approximation by adding:

if (newRMID != oldRMID) {
now = grab current time in some format
atomic_add(rmid_enabled_time[oldRMID], now - 
this_cpu_read(rmid_time));
this_cpu_write(rmid_time, now);
}

but:

1) that would only work on a single socket machine (we'd really want 
rmid_enabled_time
separately for each socket)
2) when we want to read that enabled time, we'd really need to add time for all 
the
threads currently running on CPUs across the system since we last switched RMID
3) reading the time and doing atomic ops in context switch code won't be popular

:-(

-Tony




Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-03 Thread Luck, Tony
On Fri, Feb 03, 2017 at 01:08:05PM -0800, David Carrillo-Cisneros wrote:
> On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tony  wrote:
> > On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote:
> >> If we tie allocation groups and monitoring groups, we are tying the
> >> meaning of CPUs and we'll have to choose between the CAT meaning or
> >> the perf meaning.
> >>
> >> Let's allow semantics that will allow perf like monitoring to
> >> eventually work, even if its not immediately supported.
> >
> > Would it work to make monitor groups be "task list only" or "cpu mask only"
> > (unlike control groups that allow mixing).
> 
> That works, but please don't use chmod. Make it explicit by the group
> position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1).

I had been thinking that after writing a PID to "tasks" we'd disallow
writes to "cpus". But is sounds nicer for the user to declare their
intention upfront. Counter propsosal in the naming war:

.../monitor/bytask/{groupname}
.../monitor/bycpu/{groupname}

> > Then the intel_rdt_sched_in() code could pick the RMID in ways that
> > give you the perf(1) meaning. I.e. if you create a monitor group and assign
> > some CPUs to it, then we will always load the RMID for that monitor group
> > when running on those cpus, regardless of what group(s) the current process
> > belongs to.  But if you didn't create any cpu-only monitor groups, then we'd
> > assign RMID using same rules as CLOSID (so measurements from a control group
> > would track allocation policies).
> 
> I think that's very confusing for the user. A group's observed
> behavior should be determined by its attributes and not change
> depending on how other groups are configured. Think on multiple users
> monitoring simultaneously.
> 
> >
> > We are already planning that creating monitor only groups will change
> > what is reported in the control group (e.g. you pull some tasks out of
> > the control group to monitor them separately, so the control group only
> > reports the tasks that you didn't move out for monitoring).
> 
> That's also confusing, and the work-around that Vikas proposed of two
> separate files to enumerate tasks (one for control and one for
> monitoring) breaks the concept of a task group.

There are some simple cases where we can make the data shown in the
original control group look the same. E.g. we move a few tasks over to a
/bytask/ group (or several groups if we want a very fine breakdown) and
then have the report from the control group sum the RMIDs from the monitor
groups and add to the total from the native RMID of the control group.

But this falls apart if the user asks a single monitor group to monitor
tasks from multiple control groups.  Perhaps we could disallow this
(when we assign the first task to a monitor group, capture the CLOSID
and then only allow other tasks with the same CLOSID to be added ... unless
the group becomes empty, and which point we can latch onto a new CLOSID).

/bycpu/ monitoring is very resource intensive if we have to preserve
the control group reports. We'd need to allocate MAXCLOSID[1] RMIDs for
each group so that we can keep separate counts for tasks from each
control group that run on our CPUs and then sum them to report the
/bycpu/ data (instead of just one RMID, and no math).  This also
puts more memory references into the sched_in path while we
figure out which RMID to load into PQR_ASSOC.

I'd want to warn the user in the Documentation that splitting off
too many monitor groups from a control group will result in less
than stellar accuracy in reporting as the kernel cannot read
multiple RMIDs atomically and data is changing between reads.

> I know the present implementation scope is limited, so you could:
>   - support 1) and/or 2) only
>   - do a simple RMID management (e.g. same RMID all packages, allocate
> RMID on creation or fail)
>   - do the custom fs based tool that Vikas mentioned instead of using
> perf_event_open (if it's somehow easier to build and maintain a new
> tool rather than reuse perf(1) ).
> 
> any or all of the above are fine. But please don't choose group
> semantics that will prevent us from eventually supporting full
> perf-like behavior or that we already know explode in user's face.

I'm trying hard to find a way to do this. I.e. start with a patch
that has limited capabilities and needs a custom tool, but can later
grow into something that meets your needs.

-Tony

[1] Lazy allocation means finding we can't find a free RMID in the
middle of context switch ... not willing to go there.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-03 Thread Luck, Tony
On Fri, Feb 03, 2017 at 01:08:05PM -0800, David Carrillo-Cisneros wrote:
> On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tony  wrote:
> > On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote:
> >> If we tie allocation groups and monitoring groups, we are tying the
> >> meaning of CPUs and we'll have to choose between the CAT meaning or
> >> the perf meaning.
> >>
> >> Let's allow semantics that will allow perf like monitoring to
> >> eventually work, even if its not immediately supported.
> >
> > Would it work to make monitor groups be "task list only" or "cpu mask only"
> > (unlike control groups that allow mixing).
> 
> That works, but please don't use chmod. Make it explicit by the group
> position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1).

I had been thinking that after writing a PID to "tasks" we'd disallow
writes to "cpus". But is sounds nicer for the user to declare their
intention upfront. Counter propsosal in the naming war:

.../monitor/bytask/{groupname}
.../monitor/bycpu/{groupname}

> > Then the intel_rdt_sched_in() code could pick the RMID in ways that
> > give you the perf(1) meaning. I.e. if you create a monitor group and assign
> > some CPUs to it, then we will always load the RMID for that monitor group
> > when running on those cpus, regardless of what group(s) the current process
> > belongs to.  But if you didn't create any cpu-only monitor groups, then we'd
> > assign RMID using same rules as CLOSID (so measurements from a control group
> > would track allocation policies).
> 
> I think that's very confusing for the user. A group's observed
> behavior should be determined by its attributes and not change
> depending on how other groups are configured. Think on multiple users
> monitoring simultaneously.
> 
> >
> > We are already planning that creating monitor only groups will change
> > what is reported in the control group (e.g. you pull some tasks out of
> > the control group to monitor them separately, so the control group only
> > reports the tasks that you didn't move out for monitoring).
> 
> That's also confusing, and the work-around that Vikas proposed of two
> separate files to enumerate tasks (one for control and one for
> monitoring) breaks the concept of a task group.

There are some simple cases where we can make the data shown in the
original control group look the same. E.g. we move a few tasks over to a
/bytask/ group (or several groups if we want a very fine breakdown) and
then have the report from the control group sum the RMIDs from the monitor
groups and add to the total from the native RMID of the control group.

But this falls apart if the user asks a single monitor group to monitor
tasks from multiple control groups.  Perhaps we could disallow this
(when we assign the first task to a monitor group, capture the CLOSID
and then only allow other tasks with the same CLOSID to be added ... unless
the group becomes empty, and which point we can latch onto a new CLOSID).

/bycpu/ monitoring is very resource intensive if we have to preserve
the control group reports. We'd need to allocate MAXCLOSID[1] RMIDs for
each group so that we can keep separate counts for tasks from each
control group that run on our CPUs and then sum them to report the
/bycpu/ data (instead of just one RMID, and no math).  This also
puts more memory references into the sched_in path while we
figure out which RMID to load into PQR_ASSOC.

I'd want to warn the user in the Documentation that splitting off
too many monitor groups from a control group will result in less
than stellar accuracy in reporting as the kernel cannot read
multiple RMIDs atomically and data is changing between reads.

> I know the present implementation scope is limited, so you could:
>   - support 1) and/or 2) only
>   - do a simple RMID management (e.g. same RMID all packages, allocate
> RMID on creation or fail)
>   - do the custom fs based tool that Vikas mentioned instead of using
> perf_event_open (if it's somehow easier to build and maintain a new
> tool rather than reuse perf(1) ).
> 
> any or all of the above are fine. But please don't choose group
> semantics that will prevent us from eventually supporting full
> perf-like behavior or that we already know explode in user's face.

I'm trying hard to find a way to do this. I.e. start with a patch
that has limited capabilities and needs a custom tool, but can later
grow into something that meets your needs.

-Tony

[1] Lazy allocation means finding we can't find a free RMID in the
middle of context switch ... not willing to go there.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-03 Thread David Carrillo-Cisneros
On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tony  wrote:
> On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote:
>> If we tie allocation groups and monitoring groups, we are tying the
>> meaning of CPUs and we'll have to choose between the CAT meaning or
>> the perf meaning.
>>
>> Let's allow semantics that will allow perf like monitoring to
>> eventually work, even if its not immediately supported.
>
> Would it work to make monitor groups be "task list only" or "cpu mask only"
> (unlike control groups that allow mixing).

That works, but please don't use chmod. Make it explicit by the group
position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1).

>
> Then the intel_rdt_sched_in() code could pick the RMID in ways that
> give you the perf(1) meaning. I.e. if you create a monitor group and assign
> some CPUs to it, then we will always load the RMID for that monitor group
> when running on those cpus, regardless of what group(s) the current process
> belongs to.  But if you didn't create any cpu-only monitor groups, then we'd
> assign RMID using same rules as CLOSID (so measurements from a control group
> would track allocation policies).

I think that's very confusing for the user. A group's observed
behavior should be determined by its attributes and not change
depending on how other groups are configured. Think on multiple users
monitoring simultaneously.

>
> We are already planning that creating monitor only groups will change
> what is reported in the control group (e.g. you pull some tasks out of
> the control group to monitor them separately, so the control group only
> reports the tasks that you didn't move out for monitoring).

That's also confusing, and the work-around that Vikas proposed of two
separate files to enumerate tasks (one for control and one for
monitoring) breaks the concept of a task group.





>From our discussions, we can support the use cases we care about
without weird-corner cases, by having:
  - A set of allocation group as stand now. Either use the current
resctrl, or rename it to something like resdir/ctrl (before v4.10
sails).
  - A set of monitoring task groups. Either in a "tasks" folder in a
resmon fs  or in resdir/mon/tasks.
  - A set of monitoring CPU groups. Either in a "cpus" folder in a
resmon fs  or in resdir/mon/cpus.

So when a user measures a group (shown using the -G option, it could
as well be the -R Vikas wants):

1) perf stat -e llc_occupancy -G resdir/ctrl/g1
measures the CAT allocation group as if RMIDs were managed like CLOSIDs.

2) perf stat -e llc_occupancy -G resdir/mon/tasks/g1
measures the combined occupancy of all tasks in g1 (like a cgroups in
present perf).

3) perf stat -e llc_occupancy -C 
*XOR* perf stat -e llc_occupancy -G resdir/mon/cpus/g1
measures the combined occupancy of all tasks while ran in any CPU in
g1 (perf-like filtering, not the CAT way).

I know the present implementation scope is limited, so you could:
  - support 1) and/or 2) only
  - do a simple RMID management (e.g. same RMID all packages, allocate
RMID on creation or fail)
  - do the custom fs based tool that Vikas mentioned instead of using
perf_event_open (if it's somehow easier to build and maintain a new
tool rather than reuse perf(1) ).

any or all of the above are fine. But please don't choose group
semantics that will prevent us from eventually supporting full
perf-like behavior or that we already know explode in user's face.

Thanks,
David


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-03 Thread David Carrillo-Cisneros
On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tony  wrote:
> On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote:
>> If we tie allocation groups and monitoring groups, we are tying the
>> meaning of CPUs and we'll have to choose between the CAT meaning or
>> the perf meaning.
>>
>> Let's allow semantics that will allow perf like monitoring to
>> eventually work, even if its not immediately supported.
>
> Would it work to make monitor groups be "task list only" or "cpu mask only"
> (unlike control groups that allow mixing).

That works, but please don't use chmod. Make it explicit by the group
position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1).

>
> Then the intel_rdt_sched_in() code could pick the RMID in ways that
> give you the perf(1) meaning. I.e. if you create a monitor group and assign
> some CPUs to it, then we will always load the RMID for that monitor group
> when running on those cpus, regardless of what group(s) the current process
> belongs to.  But if you didn't create any cpu-only monitor groups, then we'd
> assign RMID using same rules as CLOSID (so measurements from a control group
> would track allocation policies).

I think that's very confusing for the user. A group's observed
behavior should be determined by its attributes and not change
depending on how other groups are configured. Think on multiple users
monitoring simultaneously.

>
> We are already planning that creating monitor only groups will change
> what is reported in the control group (e.g. you pull some tasks out of
> the control group to monitor them separately, so the control group only
> reports the tasks that you didn't move out for monitoring).

That's also confusing, and the work-around that Vikas proposed of two
separate files to enumerate tasks (one for control and one for
monitoring) breaks the concept of a task group.





>From our discussions, we can support the use cases we care about
without weird-corner cases, by having:
  - A set of allocation group as stand now. Either use the current
resctrl, or rename it to something like resdir/ctrl (before v4.10
sails).
  - A set of monitoring task groups. Either in a "tasks" folder in a
resmon fs  or in resdir/mon/tasks.
  - A set of monitoring CPU groups. Either in a "cpus" folder in a
resmon fs  or in resdir/mon/cpus.

So when a user measures a group (shown using the -G option, it could
as well be the -R Vikas wants):

1) perf stat -e llc_occupancy -G resdir/ctrl/g1
measures the CAT allocation group as if RMIDs were managed like CLOSIDs.

2) perf stat -e llc_occupancy -G resdir/mon/tasks/g1
measures the combined occupancy of all tasks in g1 (like a cgroups in
present perf).

3) perf stat -e llc_occupancy -C 
*XOR* perf stat -e llc_occupancy -G resdir/mon/cpus/g1
measures the combined occupancy of all tasks while ran in any CPU in
g1 (perf-like filtering, not the CAT way).

I know the present implementation scope is limited, so you could:
  - support 1) and/or 2) only
  - do a simple RMID management (e.g. same RMID all packages, allocate
RMID on creation or fail)
  - do the custom fs based tool that Vikas mentioned instead of using
perf_event_open (if it's somehow easier to build and maintain a new
tool rather than reuse perf(1) ).

any or all of the above are fine. But please don't choose group
semantics that will prevent us from eventually supporting full
perf-like behavior or that we already know explode in user's face.

Thanks,
David


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-03 Thread Luck, Tony
On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote:
> If we tie allocation groups and monitoring groups, we are tying the
> meaning of CPUs and we'll have to choose between the CAT meaning or
> the perf meaning.
> 
> Let's allow semantics that will allow perf like monitoring to
> eventually work, even if its not immediately supported.

Would it work to make monitor groups be "task list only" or "cpu mask only"
(unlike control groups that allow mixing).

Then the intel_rdt_sched_in() code could pick the RMID in ways that
give you the perf(1) meaning. I.e. if you create a monitor group and assign
some CPUs to it, then we will always load the RMID for that monitor group
when running on those cpus, regardless of what group(s) the current process
belongs to.  But if you didn't create any cpu-only monitor groups, then we'd
assign RMID using same rules as CLOSID (so measurements from a control group
would track allocation policies).

We are already planning that creating monitor only groups will change
what is reported in the control group (e.g. you pull some tasks out of
the control group to monitor them separately, so the control group only
reports the tasks that you didn't move out for monitoring).

-Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-03 Thread Luck, Tony
On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote:
> If we tie allocation groups and monitoring groups, we are tying the
> meaning of CPUs and we'll have to choose between the CAT meaning or
> the perf meaning.
> 
> Let's allow semantics that will allow perf like monitoring to
> eventually work, even if its not immediately supported.

Would it work to make monitor groups be "task list only" or "cpu mask only"
(unlike control groups that allow mixing).

Then the intel_rdt_sched_in() code could pick the RMID in ways that
give you the perf(1) meaning. I.e. if you create a monitor group and assign
some CPUs to it, then we will always load the RMID for that monitor group
when running on those cpus, regardless of what group(s) the current process
belongs to.  But if you didn't create any cpu-only monitor groups, then we'd
assign RMID using same rules as CLOSID (so measurements from a control group
would track allocation policies).

We are already planning that creating monitor only groups will change
what is reported in the control group (e.g. you pull some tasks out of
the control group to monitor them separately, so the control group only
reports the tasks that you didn't move out for monitoring).

-Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread David Carrillo-Cisneros
Something to be aware is that CAT cpus don't work the way CPU
filtering works in perf:

If I have the following CAT groups:
 - default group with task TD
 - group GC1 with CPU0 and CLOSID 1
 - group GT1 with no CPUs and task T1 and CLOSID2
 - TD and T1 run on CPU0.

Then T1 will use CLOSID2 and TD CLOSID1. Some allocations done in CPU0
did not use CLOSID1.

Now, if I have the same setup in monitoring groups and I were to read
llc_occupancy in the RMID of GC1, I'd read llc_occupancy for TD only,
and have a blind spot on T1. That's not how CPU events work on perf.

So CPUs have a different meaning on CAT than on perf.

The above is another reason to separate the allocation and the
monitoring groups. Having
  - Independent allocation and monitoring groups.
  - Independent CPU and task grouping.
would allow us semantics that monitor CAT groups and eventually can be
extended to also monitor the perf way, this is support:
  - filter by task
  - filter by task group (cgroup or monitoring group or whatever).
  - filter by CPU (the perf way)
  - combinations of task/task_group and CPU (the perf way)

If we tie allocation groups and monitoring groups, we are tying the
meaning of CPUs and we'll have to choose between the CAT meaning or
the perf meaning.

Let's allow semantics that will allow perf like monitoring to
eventually work, even if its not immediately supported.

Thanks,
David


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread David Carrillo-Cisneros
Something to be aware is that CAT cpus don't work the way CPU
filtering works in perf:

If I have the following CAT groups:
 - default group with task TD
 - group GC1 with CPU0 and CLOSID 1
 - group GT1 with no CPUs and task T1 and CLOSID2
 - TD and T1 run on CPU0.

Then T1 will use CLOSID2 and TD CLOSID1. Some allocations done in CPU0
did not use CLOSID1.

Now, if I have the same setup in monitoring groups and I were to read
llc_occupancy in the RMID of GC1, I'd read llc_occupancy for TD only,
and have a blind spot on T1. That's not how CPU events work on perf.

So CPUs have a different meaning on CAT than on perf.

The above is another reason to separate the allocation and the
monitoring groups. Having
  - Independent allocation and monitoring groups.
  - Independent CPU and task grouping.
would allow us semantics that monitor CAT groups and eventually can be
extended to also monitor the perf way, this is support:
  - filter by task
  - filter by task group (cgroup or monitoring group or whatever).
  - filter by CPU (the perf way)
  - combinations of task/task_group and CPU (the perf way)

If we tie allocation groups and monitoring groups, we are tying the
meaning of CPUs and we'll have to choose between the CAT meaning or
the perf meaning.

Let's allow semantics that will allow perf like monitoring to
eventually work, even if its not immediately supported.

Thanks,
David


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread David Carrillo-Cisneros
On Thu, Feb 2, 2017 at 3:41 PM, Luck, Tony  wrote:
> On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote:
>> There is no need to change perf(1) to support
>>  # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
>>
>> the PMU can work with resctrl to provide the support through
>> perf_event_open, with the advantage that tools other than perf could
>> also use it.
>
> I agree it would be better to expose the counters through
> a standard perf_event_open() interface ... but we don't seem
> to have had much luck doing that so far.
>
> That would need the requirements to be re-written with the
> focus of what does resctrl need to do to support each of the
> perf(1) command line modes of operation.  The fact that these
> counters work rather differently from normal h/w counters
> has resulted in massively complex volumes of code trying
> to map them into what perf_event_open() expects.
>
> The key points of weirdness seem to be:
>
> 1) We need to allocate an RMID for the duration of monitoring. While
>there are quite a lot of RMIDs, it is easy to envision scenarios
>where there are not enough.
>
> 2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a 
> process
>of interest is running.
>
> 3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events
>
> 4) For llc_occupancy the count can change even when none of the processes
>are running becauase cache lines are evicted
>
> 5) llc_occupancy measures the delta, not the absolute occupancy. To
>get a good result requires monitoring from process creation (or
>lots of patience, or the nuclear option "wbinvd").
>
> 6) RMID counters are package scoped
>
>
> These result in all sorts of hard to resolve situations. E.g. you are
> monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm
> looking at the cache occupancy of PID=234 using RMID=45. The scheduler
> decides to run my proocess on your CPU.  We can only load one RMID, so
> one of us will be disappointed (unless we have some crazy complex code
> where your instance of perf borrows RMID=45 and reads out the local
> byte count on sched_in() and sched_out() to add to the runing count
> you were keeping against RMID=22).
>
> How can we document such restrictions for people who haven't been
> digging in this code for over a year?
>
> I think a perf_event_open() interface would make some simple cases
> work, but result in some swearing once people start running multiple
> complex monitors at the same time.

More problems:

7) Time multiplexing of RMIDs is hard because llc_occupancy cannot be reset.

8) Only one RMID per CPU can be loaded at a time into PQR_ASSOC.

Most of the complexity in past attempts were mainly caused by:
  A. Task events being defined as system-wide and not package-wide.
What you describe in points (4) and (6) made this complicated.
  B. The cgroup hierarchy, due to (7) and (8).

A and B caused the bulk of the code by complicating RMID assignment,
reading and rotation.

Now that we've learned from the past experience, we have defined
per-domain monitoring and use flat groups. FWICT, that enough to allow
a simple implementation that can be expressed through perf_event_open.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread David Carrillo-Cisneros
On Thu, Feb 2, 2017 at 3:41 PM, Luck, Tony  wrote:
> On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote:
>> There is no need to change perf(1) to support
>>  # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
>>
>> the PMU can work with resctrl to provide the support through
>> perf_event_open, with the advantage that tools other than perf could
>> also use it.
>
> I agree it would be better to expose the counters through
> a standard perf_event_open() interface ... but we don't seem
> to have had much luck doing that so far.
>
> That would need the requirements to be re-written with the
> focus of what does resctrl need to do to support each of the
> perf(1) command line modes of operation.  The fact that these
> counters work rather differently from normal h/w counters
> has resulted in massively complex volumes of code trying
> to map them into what perf_event_open() expects.
>
> The key points of weirdness seem to be:
>
> 1) We need to allocate an RMID for the duration of monitoring. While
>there are quite a lot of RMIDs, it is easy to envision scenarios
>where there are not enough.
>
> 2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a 
> process
>of interest is running.
>
> 3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events
>
> 4) For llc_occupancy the count can change even when none of the processes
>are running becauase cache lines are evicted
>
> 5) llc_occupancy measures the delta, not the absolute occupancy. To
>get a good result requires monitoring from process creation (or
>lots of patience, or the nuclear option "wbinvd").
>
> 6) RMID counters are package scoped
>
>
> These result in all sorts of hard to resolve situations. E.g. you are
> monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm
> looking at the cache occupancy of PID=234 using RMID=45. The scheduler
> decides to run my proocess on your CPU.  We can only load one RMID, so
> one of us will be disappointed (unless we have some crazy complex code
> where your instance of perf borrows RMID=45 and reads out the local
> byte count on sched_in() and sched_out() to add to the runing count
> you were keeping against RMID=22).
>
> How can we document such restrictions for people who haven't been
> digging in this code for over a year?
>
> I think a perf_event_open() interface would make some simple cases
> work, but result in some swearing once people start running multiple
> complex monitors at the same time.

More problems:

7) Time multiplexing of RMIDs is hard because llc_occupancy cannot be reset.

8) Only one RMID per CPU can be loaded at a time into PQR_ASSOC.

Most of the complexity in past attempts were mainly caused by:
  A. Task events being defined as system-wide and not package-wide.
What you describe in points (4) and (6) made this complicated.
  B. The cgroup hierarchy, due to (7) and (8).

A and B caused the bulk of the code by complicating RMID assignment,
reading and rotation.

Now that we've learned from the past experience, we have defined
per-domain monitoring and use flat groups. FWICT, that enough to allow
a simple implementation that can be expressed through perf_event_open.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread Luck, Tony
On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote:
> There is no need to change perf(1) to support
>  # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
> 
> the PMU can work with resctrl to provide the support through
> perf_event_open, with the advantage that tools other than perf could
> also use it.

I agree it would be better to expose the counters through
a standard perf_event_open() interface ... but we don't seem
to have had much luck doing that so far.

That would need the requirements to be re-written with the
focus of what does resctrl need to do to support each of the
perf(1) command line modes of operation.  The fact that these
counters work rather differently from normal h/w counters
has resulted in massively complex volumes of code trying
to map them into what perf_event_open() expects.

The key points of weirdness seem to be:

1) We need to allocate an RMID for the duration of monitoring. While
   there are quite a lot of RMIDs, it is easy to envision scenarios
   where there are not enough.

2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a process
   of interest is running.

3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events

4) For llc_occupancy the count can change even when none of the processes
   are running becauase cache lines are evicted

5) llc_occupancy measures the delta, not the absolute occupancy. To
   get a good result requires monitoring from process creation (or
   lots of patience, or the nuclear option "wbinvd").

6) RMID counters are package scoped


These result in all sorts of hard to resolve situations. E.g. you are
monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm
looking at the cache occupancy of PID=234 using RMID=45. The scheduler
decides to run my proocess on your CPU.  We can only load one RMID, so
one of us will be disappointed (unless we have some crazy complex code
where your instance of perf borrows RMID=45 and reads out the local
byte count on sched_in() and sched_out() to add to the runing count
you were keeping against RMID=22).

How can we document such restrictions for people who haven't been
digging in this code for over a year?

I think a perf_event_open() interface would make some simple cases
work, but result in some swearing once people start running multiple
complex monitors at the same time.

-Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread Luck, Tony
On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote:
> There is no need to change perf(1) to support
>  # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
> 
> the PMU can work with resctrl to provide the support through
> perf_event_open, with the advantage that tools other than perf could
> also use it.

I agree it would be better to expose the counters through
a standard perf_event_open() interface ... but we don't seem
to have had much luck doing that so far.

That would need the requirements to be re-written with the
focus of what does resctrl need to do to support each of the
perf(1) command line modes of operation.  The fact that these
counters work rather differently from normal h/w counters
has resulted in massively complex volumes of code trying
to map them into what perf_event_open() expects.

The key points of weirdness seem to be:

1) We need to allocate an RMID for the duration of monitoring. While
   there are quite a lot of RMIDs, it is easy to envision scenarios
   where there are not enough.

2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a process
   of interest is running.

3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events

4) For llc_occupancy the count can change even when none of the processes
   are running becauase cache lines are evicted

5) llc_occupancy measures the delta, not the absolute occupancy. To
   get a good result requires monitoring from process creation (or
   lots of patience, or the nuclear option "wbinvd").

6) RMID counters are package scoped


These result in all sorts of hard to resolve situations. E.g. you are
monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm
looking at the cache occupancy of PID=234 using RMID=45. The scheduler
decides to run my proocess on your CPU.  We can only load one RMID, so
one of us will be disappointed (unless we have some crazy complex code
where your instance of perf borrows RMID=45 and reads out the local
byte count on sched_in() and sched_out() to add to the runing count
you were keeping against RMID=22).

How can we document such restrictions for people who haven't been
digging in this code for over a year?

I think a perf_event_open() interface would make some simple cases
work, but result in some swearing once people start running multiple
complex monitors at the same time.

-Tony


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread David Carrillo-Cisneros
On Thu, Feb 2, 2017 at 11:33 AM, Luck, Tony  wrote:
>>> Nice to have:
>>> 1)  Readout using "perf(1)" [subset of modes that make sense ... tying 
>>> monitoring
>>> to resctrl file system will make most command line usage of perf(1) close 
>>> to impossible.
>>
>>
>> We discussed this offline and I still disagree that it is close to
>> impossible to use perf and perf_event_open. In fact, I think it's very
>> simple :
>
> Maybe s/most/many/ ?
>
> The issue here is that we are going to define which tasks and cpus are being
> monitored *outside* of the perf command.  So usage like:
>
> # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
>
> are completely out of scope ... we aren't planning to change the perf(1)
> command to know about creating a CQM monitor group, assigning the
> PID of {command} to it, and then report on llc_occupancy.
>
> So perf(1) usage is only going to support modes where it attaches to some
> monitor group that was previously established.  The "-C 2" option to monitor
> CPU 2 is certainly plausible ... assuming you set up a monitor group to track
> what is happening on CPU 2 ... I just don't know how perf(1) would know the
> name of that group.

There is no need to change perf(1) to support
 # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}

the PMU can work with resctrl to provide the support through
perf_event_open, with the advantage that tools other than perf could
also use it.

I'd argue is more stable and has less corner cases if the
task_mongroups get extra RMIDs for the task events attached to them
than having userspace tools create and destroy groups and move tasks
behind the scenes.

I provided implementation details on the write-up I shared offline on
Monday. If "easy monitoring" of stand-alone task becomes a
requirement, we can dig on the pros and cons of implementing in kernel
vs user space.

>
> Vikas is pushing for "-R rdtgroup" ... though our offline discussions included
> overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups
> depending on event type.

I see it more like generalizing the -G option to represent a task
group that can be a cgroup or a PMU specific one.

Currently the perf(1) simply translates the argument of the -G option
into a file descriptor. My idea doesn't change that, just makes perf
tool to look for a "task_group_root" file in the PMU folder and use it
to find as base path for the file descriptor. If a PMU doesnt have
such file, then perf(1) uses the perf cgroup mounting point, as it
does now. That makes for a very simple implementation on the perf tool
side.


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread David Carrillo-Cisneros
On Thu, Feb 2, 2017 at 11:33 AM, Luck, Tony  wrote:
>>> Nice to have:
>>> 1)  Readout using "perf(1)" [subset of modes that make sense ... tying 
>>> monitoring
>>> to resctrl file system will make most command line usage of perf(1) close 
>>> to impossible.
>>
>>
>> We discussed this offline and I still disagree that it is close to
>> impossible to use perf and perf_event_open. In fact, I think it's very
>> simple :
>
> Maybe s/most/many/ ?
>
> The issue here is that we are going to define which tasks and cpus are being
> monitored *outside* of the perf command.  So usage like:
>
> # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}
>
> are completely out of scope ... we aren't planning to change the perf(1)
> command to know about creating a CQM monitor group, assigning the
> PID of {command} to it, and then report on llc_occupancy.
>
> So perf(1) usage is only going to support modes where it attaches to some
> monitor group that was previously established.  The "-C 2" option to monitor
> CPU 2 is certainly plausible ... assuming you set up a monitor group to track
> what is happening on CPU 2 ... I just don't know how perf(1) would know the
> name of that group.

There is no need to change perf(1) to support
 # perf stat -I 1000 -e intel_cqm/llc_occupancy {command}

the PMU can work with resctrl to provide the support through
perf_event_open, with the advantage that tools other than perf could
also use it.

I'd argue is more stable and has less corner cases if the
task_mongroups get extra RMIDs for the task events attached to them
than having userspace tools create and destroy groups and move tasks
behind the scenes.

I provided implementation details on the write-up I shared offline on
Monday. If "easy monitoring" of stand-alone task becomes a
requirement, we can dig on the pros and cons of implementing in kernel
vs user space.

>
> Vikas is pushing for "-R rdtgroup" ... though our offline discussions included
> overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups
> depending on event type.

I see it more like generalizing the -G option to represent a task
group that can be a cgroup or a PMU specific one.

Currently the perf(1) simply translates the argument of the -G option
into a file descriptor. My idea doesn't change that, just makes perf
tool to look for a "task_group_root" file in the PMU folder and use it
to find as base path for the file descriptor. If a PMU doesnt have
such file, then perf(1) uses the perf cgroup mounting point, as it
does now. That makes for a very simple implementation on the perf tool
side.


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread Shivappa Vikas


Hello Peterz/Andi,

On Thu, 2 Feb 2017, Luck, Tony wrote:


Nice to have:
1)  Readout using "perf(1)" [subset of modes that make sense ... tying 
monitoring
to resctrl file system will make most command line usage of perf(1) close to 
impossible.



Vikas is pushing for "-R rdtgroup" ... though our offline discussions included
overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups
depending on event type.


Assume we build support to monitor the existing resctrl CAT groups like Thomas 
suggested. For the perf interface would 
something like below seems reasonable or a disaster(given that we have a new -R 
option specific to the PMU/which works only on this PMU) ?


# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata

Now monitor the group p1 using perf. perf would have a new option -R to monitor 
the resctrl groups. perf would still have a cqm event like today 
intel_cqm/llc_occupancy which supports however only one mode -R and not any of 
-C,-t,-G etc. So pretty much the -R works like a -G .. except that it works on 
the resctrl fs and not perf_cgroup.
PMU would have a flag to indicate the perf user mode to check only 
the llc_occupancy event is supported for the -R.


# perf stat -e intel_cqm/llc_occupancy -R p1

-Vikas



-Tony



RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread Shivappa Vikas


Hello Peterz/Andi,

On Thu, 2 Feb 2017, Luck, Tony wrote:


Nice to have:
1)  Readout using "perf(1)" [subset of modes that make sense ... tying 
monitoring
to resctrl file system will make most command line usage of perf(1) close to 
impossible.



Vikas is pushing for "-R rdtgroup" ... though our offline discussions included
overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups
depending on event type.


Assume we build support to monitor the existing resctrl CAT groups like Thomas 
suggested. For the perf interface would 
something like below seems reasonable or a disaster(given that we have a new -R 
option specific to the PMU/which works only on this PMU) ?


# mount -t resctrl resctrl /sys/fs/resctrl
# cd /sys/fs/resctrl
# mkdir p0 p1
# echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata
# echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata

Now monitor the group p1 using perf. perf would have a new option -R to monitor 
the resctrl groups. perf would still have a cqm event like today 
intel_cqm/llc_occupancy which supports however only one mode -R and not any of 
-C,-t,-G etc. So pretty much the -R works like a -G .. except that it works on 
the resctrl fs and not perf_cgroup.
PMU would have a flag to indicate the perf user mode to check only 
the llc_occupancy event is supported for the -R.


# perf stat -e intel_cqm/llc_occupancy -R p1

-Vikas



-Tony



RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread Luck, Tony
>> Nice to have:
>> 1)  Readout using "perf(1)" [subset of modes that make sense ... tying 
>> monitoring
>> to resctrl file system will make most command line usage of perf(1) close to 
>> impossible.
>
>
> We discussed this offline and I still disagree that it is close to
> impossible to use perf and perf_event_open. In fact, I think it's very
> simple :

Maybe s/most/many/ ?

The issue here is that we are going to define which tasks and cpus are being
monitored *outside* of the perf command.  So usage like:

# perf stat -I 1000 -e intel_cqm/llc_occupancy {command}

are completely out of scope ... we aren't planning to change the perf(1)
command to know about creating a CQM monitor group, assigning the
PID of {command} to it, and then report on llc_occupancy.

So perf(1) usage is only going to support modes where it attaches to some
monitor group that was previously established.  The "-C 2" option to monitor
CPU 2 is certainly plausible ... assuming you set up a monitor group to track
what is happening on CPU 2 ... I just don't know how perf(1) would know the
name of that group.

Vikas is pushing for "-R rdtgroup" ... though our offline discussions included
overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups
depending on event type.

-Tony


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread Luck, Tony
>> Nice to have:
>> 1)  Readout using "perf(1)" [subset of modes that make sense ... tying 
>> monitoring
>> to resctrl file system will make most command line usage of perf(1) close to 
>> impossible.
>
>
> We discussed this offline and I still disagree that it is close to
> impossible to use perf and perf_event_open. In fact, I think it's very
> simple :

Maybe s/most/many/ ?

The issue here is that we are going to define which tasks and cpus are being
monitored *outside* of the perf command.  So usage like:

# perf stat -I 1000 -e intel_cqm/llc_occupancy {command}

are completely out of scope ... we aren't planning to change the perf(1)
command to know about creating a CQM monitor group, assigning the
PID of {command} to it, and then report on llc_occupancy.

So perf(1) usage is only going to support modes where it attaches to some
monitor group that was previously established.  The "-C 2" option to monitor
CPU 2 is certainly plausible ... assuming you set up a monitor group to track
what is happening on CPU 2 ... I just don't know how perf(1) would know the
name of that group.

Vikas is pushing for "-R rdtgroup" ... though our offline discussions included
overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups
depending on event type.

-Tony


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread Shivappa Vikas



On Wed, 1 Feb 2017, Yu, Fenghua wrote:


From: Andi Kleen [mailto:a...@firstfloor.org]
"Luck, Tony"  writes:

9)  Measure per logical CPU (pick active RMID in same precedence for

task/cpu as CAT picks CLOSID)

10) Put multiple CPUs into a group


I'm not sure this is a real requirement. It's just an optimization, right? If 
you
can assign policies to threads, you can implicitly set it per CPU through 
affinity
(or the other way around).
The only benefit would be possibly less context switch overhead, but if all
the thread (including idle) assigned to a CPU have the same policy it would
have the same results.

I suspect dropping this would likely simplify the interface significantly.


Assigning a pid P to a CPU and monitoring the P don't count all events 
happening on the CPU.
Other processes/threads (e.g. kernel threads) than the assigned P can run on 
the CPU.
Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot 
cases.


This matches the use case where a bunch of real time tasks which have no CLOS 
id(kernel threads or others in root group) would want to run exclusively on a 
cpu and are configured so. If any other tasks run there from other class of 
service we dont want to pullute the cache - hence choose their own CLOSId.


Now in order to measure this RMIds need to match the same policy as CAT.

Thanks,
Vikas



Thanks.

-Fenghua



RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread Shivappa Vikas



On Wed, 1 Feb 2017, Yu, Fenghua wrote:


From: Andi Kleen [mailto:a...@firstfloor.org]
"Luck, Tony"  writes:

9)  Measure per logical CPU (pick active RMID in same precedence for

task/cpu as CAT picks CLOSID)

10) Put multiple CPUs into a group


I'm not sure this is a real requirement. It's just an optimization, right? If 
you
can assign policies to threads, you can implicitly set it per CPU through 
affinity
(or the other way around).
The only benefit would be possibly less context switch overhead, but if all
the thread (including idle) assigned to a CPU have the same policy it would
have the same results.

I suspect dropping this would likely simplify the interface significantly.


Assigning a pid P to a CPU and monitoring the P don't count all events 
happening on the CPU.
Other processes/threads (e.g. kernel threads) than the assigned P can run on 
the CPU.
Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot 
cases.


This matches the use case where a bunch of real time tasks which have no CLOS 
id(kernel threads or others in root group) would want to run exclusively on a 
cpu and are configured so. If any other tasks run there from other class of 
service we dont want to pullute the cache - hence choose their own CLOSId.


Now in order to measure this RMIds need to match the same policy as CAT.

Thanks,
Vikas



Thanks.

-Fenghua



RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread Luck, Tony
>> 7)  Must be able to measure based on existing resctrl CAT group
>> 8)  Can get measurements for subsets of tasks in a CAT group (to find 
>> the guys hogging the resources)
>> 9)  Measure per logical CPU (pick active RMID in same precedence for 
>> task/cpu as CAT picks CLOSID)
>
> I agree that "Measure per logical CPU" is a requirement, but why is
> "pick active RMID in same precedence for task/cpu as CAT picks CLOSID"
> one as well? Are we set on handling RMIDs the way CLOSIDs are
> handled? there are drawbacks to do so, one is that it would make
> impossible to do CPU monitoring and CPU filtering the way is done for
> all other PMUs.

I'm too focused on monitoring existing CAT groups.  If we move the 
parenthetical remark
from item 9, to item 7, then I think it is better.  When monitoring a CAT group 
we need to
monitor exactly the processes that are controlled by the CAT group. So RMID 
must match
CLOSID, and the precedence rules make that work.

For other monitoring cases we can do things differently - so long as we have a 
way
to express what we want, and we don't pile a ton of code into context switch to 
figure
out which RMID is to be loaded into PQR_ASSOC.

I thought of another requirement this morning:

N+1) When we set up monitoring we must allocate all the resources we need (or 
fail the setup
 if we can't get them). Not allowed to error in the middle of 
monitoring because we can't
 find a free RMID)

-Tony


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-02 Thread Luck, Tony
>> 7)  Must be able to measure based on existing resctrl CAT group
>> 8)  Can get measurements for subsets of tasks in a CAT group (to find 
>> the guys hogging the resources)
>> 9)  Measure per logical CPU (pick active RMID in same precedence for 
>> task/cpu as CAT picks CLOSID)
>
> I agree that "Measure per logical CPU" is a requirement, but why is
> "pick active RMID in same precedence for task/cpu as CAT picks CLOSID"
> one as well? Are we set on handling RMIDs the way CLOSIDs are
> handled? there are drawbacks to do so, one is that it would make
> impossible to do CPU monitoring and CPU filtering the way is done for
> all other PMUs.

I'm too focused on monitoring existing CAT groups.  If we move the 
parenthetical remark
from item 9, to item 7, then I think it is better.  When monitoring a CAT group 
we need to
monitor exactly the processes that are controlled by the CAT group. So RMID 
must match
CLOSID, and the precedence rules make that work.

For other monitoring cases we can do things differently - so long as we have a 
way
to express what we want, and we don't pile a ton of code into context switch to 
figure
out which RMID is to be loaded into PQR_ASSOC.

I thought of another requirement this morning:

N+1) When we set up monitoring we must allocate all the resources we need (or 
fail the setup
 if we can't get them). Not allowed to error in the middle of 
monitoring because we can't
 find a free RMID)

-Tony


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread Yu, Fenghua
> From: Andi Kleen [mailto:a...@firstfloor.org]
> "Luck, Tony"  writes:
> > 9)  Measure per logical CPU (pick active RMID in same precedence for
> task/cpu as CAT picks CLOSID)
> > 10) Put multiple CPUs into a group
> 
> I'm not sure this is a real requirement. It's just an optimization, right? If 
> you
> can assign policies to threads, you can implicitly set it per CPU through 
> affinity
> (or the other way around).
> The only benefit would be possibly less context switch overhead, but if all
> the thread (including idle) assigned to a CPU have the same policy it would
> have the same results.
> 
> I suspect dropping this would likely simplify the interface significantly.

Assigning a pid P to a CPU and monitoring the P don't count all events 
happening on the CPU.
Other processes/threads (e.g. kernel threads) than the assigned P can run on 
the CPU.
Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot 
cases.

Thanks.

-Fenghua


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread Yu, Fenghua
> From: Andi Kleen [mailto:a...@firstfloor.org]
> "Luck, Tony"  writes:
> > 9)  Measure per logical CPU (pick active RMID in same precedence for
> task/cpu as CAT picks CLOSID)
> > 10) Put multiple CPUs into a group
> 
> I'm not sure this is a real requirement. It's just an optimization, right? If 
> you
> can assign policies to threads, you can implicitly set it per CPU through 
> affinity
> (or the other way around).
> The only benefit would be possibly less context switch overhead, but if all
> the thread (including idle) assigned to a CPU have the same policy it would
> have the same results.
> 
> I suspect dropping this would likely simplify the interface significantly.

Assigning a pid P to a CPU and monitoring the P don't count all events 
happening on the CPU.
Other processes/threads (e.g. kernel threads) than the assigned P can run on 
the CPU.
Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot 
cases.

Thanks.

-Fenghua


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread Andi Kleen
> > I'm not sure this is a real requirement. It's just an optimization,
> > right? If you can assign policies to threads, you can implicitly set it
> > per CPU through affinity (or the other way around).
> 
> That's difficult when distinct users/systems do monitoring and system
> management.  What if the cluster manager decides to change affinity
> for a task after the monitoring service has initiated monitoring a CPU
> in the way you describe?

Why would you want to monitor a CPU if you don't know what it is
running?  The results would be meaningless. So you really want
to integrate those two services.

> 
> > The only benefit would be possibly less context switch overhead,
> > but if all the thread (including idle) assigned to a CPU have the
> > same policy it would have the same results.
> 
> I think another of the reasons for the CPU monitoring requirement is
> to monitor interruptions in CPUs running the idle thread. In CAT,

idle threads are just threads, so they could be just exposed
to perf (e.g. combination of pid 0 + cpu filter)


> Also, if perf's like monitoring is supported, it'd allow something like
> 
>   perf stat -e LLC-load,LLC-prefetches,intel_cqm/total_bytes -C 2

This would work without a special API.

-Andi


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread Andi Kleen
> > I'm not sure this is a real requirement. It's just an optimization,
> > right? If you can assign policies to threads, you can implicitly set it
> > per CPU through affinity (or the other way around).
> 
> That's difficult when distinct users/systems do monitoring and system
> management.  What if the cluster manager decides to change affinity
> for a task after the monitoring service has initiated monitoring a CPU
> in the way you describe?

Why would you want to monitor a CPU if you don't know what it is
running?  The results would be meaningless. So you really want
to integrate those two services.

> 
> > The only benefit would be possibly less context switch overhead,
> > but if all the thread (including idle) assigned to a CPU have the
> > same policy it would have the same results.
> 
> I think another of the reasons for the CPU monitoring requirement is
> to monitor interruptions in CPUs running the idle thread. In CAT,

idle threads are just threads, so they could be just exposed
to perf (e.g. combination of pid 0 + cpu filter)


> Also, if perf's like monitoring is supported, it'd allow something like
> 
>   perf stat -e LLC-load,LLC-prefetches,intel_cqm/total_bytes -C 2

This would work without a special API.

-Andi


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread David Carrillo-Cisneros
On Wed, Feb 1, 2017 at 4:35 PM, Andi Kleen  wrote:
> "Luck, Tony"  writes:
>> 9)Measure per logical CPU (pick active RMID in same precedence for 
>> task/cpu as CAT picks CLOSID)
>> 10)   Put multiple CPUs into a group
>
> I'm not sure this is a real requirement. It's just an optimization,
> right? If you can assign policies to threads, you can implicitly set it
> per CPU through affinity (or the other way around).

That's difficult when distinct users/systems do monitoring and system
management.  What if the cluster manager decides to change affinity
for a task after the monitoring service has initiated monitoring a CPU
in the way you describe?

> The only benefit would be possibly less context switch overhead,
> but if all the thread (including idle) assigned to a CPU have the
> same policy it would have the same results.

I think another of the reasons for the CPU monitoring requirement is
to monitor interruptions in CPUs running the idle thread. In CAT,
those interruptions use the CPU's CLOSID. Here they'd use the CPU's
RMID. Since RMID's are scarce, CPUs can be aggregated into groups to
save many.

Also, if perf's like monitoring is supported, it'd allow something like

  perf stat -e LLC-load,LLC-prefetches,intel_cqm/total_bytes -C 2

Thanks,
David


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread David Carrillo-Cisneros
On Wed, Feb 1, 2017 at 4:35 PM, Andi Kleen  wrote:
> "Luck, Tony"  writes:
>> 9)Measure per logical CPU (pick active RMID in same precedence for 
>> task/cpu as CAT picks CLOSID)
>> 10)   Put multiple CPUs into a group
>
> I'm not sure this is a real requirement. It's just an optimization,
> right? If you can assign policies to threads, you can implicitly set it
> per CPU through affinity (or the other way around).

That's difficult when distinct users/systems do monitoring and system
management.  What if the cluster manager decides to change affinity
for a task after the monitoring service has initiated monitoring a CPU
in the way you describe?

> The only benefit would be possibly less context switch overhead,
> but if all the thread (including idle) assigned to a CPU have the
> same policy it would have the same results.

I think another of the reasons for the CPU monitoring requirement is
to monitor interruptions in CPUs running the idle thread. In CAT,
those interruptions use the CPU's CLOSID. Here they'd use the CPU's
RMID. Since RMID's are scarce, CPUs can be aggregated into groups to
save many.

Also, if perf's like monitoring is supported, it'd allow something like

  perf stat -e LLC-load,LLC-prefetches,intel_cqm/total_bytes -C 2

Thanks,
David


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread Andi Kleen
"Luck, Tony"  writes:
> 9)Measure per logical CPU (pick active RMID in same precedence for 
> task/cpu as CAT picks CLOSID)
> 10)   Put multiple CPUs into a group

I'm not sure this is a real requirement. It's just an optimization,
right? If you can assign policies to threads, you can implicitly set it
per CPU through affinity (or the other way around).
The only benefit would be possibly less context switch overhead,
but if all the thread (including idle) assigned to a CPU have the
same policy it would have the same results.

I suspect dropping this would likely simplify the interface significantly.

-Andi


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread Andi Kleen
"Luck, Tony"  writes:
> 9)Measure per logical CPU (pick active RMID in same precedence for 
> task/cpu as CAT picks CLOSID)
> 10)   Put multiple CPUs into a group

I'm not sure this is a real requirement. It's just an optimization,
right? If you can assign policies to threads, you can implicitly set it
per CPU through affinity (or the other way around).
The only benefit would be possibly less context switch overhead,
but if all the thread (including idle) assigned to a CPU have the
same policy it would have the same results.

I suspect dropping this would likely simplify the interface significantly.

-Andi


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread David Carrillo-Cisneros
On Wed, Feb 1, 2017 at 12:08 PM Luck, Tony  wrote:
>
> > I was asking for requirements, not a design proposal. In order to make a
> > design you need a requirements specification.
>
> Here's what I came up with ... not a fully baked list, but should allow for 
> some useful
> discussion on whether any of these are not really needed, or if there is a 
> glaring hole
> that misses some use case:
>
> 1)  Able to measure using all supported events (currently L3 occupancy, 
> Total B/W, Local B/W)
> 2)  Measure per thread
> 3)  Including kernel threads
> 4)  Put multiple threads into a single measurement group (forced by h/w 
> shortage of RMIDs, but probably good to have anyway)

Even with infinite hw RMIDs you want to be able to have one RMID per
thread groups to avoid reading a potentially large list of RMIDs every
time you measure one group's event (with the delay and error
associated to measure many RMIDs whose values fluctuate rapidly).

> 5)  New threads created inherit measurement group from parent
> 6)  Report separate results per domain (L3)
> 7)  Must be able to measure based on existing resctrl CAT group
> 8)  Can get measurements for subsets of tasks in a CAT group (to find the 
> guys hogging the resources)
> 9)  Measure per logical CPU (pick active RMID in same precedence for 
> task/cpu as CAT picks CLOSID)

I agree that "Measure per logical CPU" is a requirement, but why is
"pick active RMID in same precedence for task/cpu as CAT picks CLOSID"
 one as well? Are we set on handling RMIDs the way CLOSIDs are
handled? there are drawbacks to do so, one is that it would make
impossible to do CPU monitoring and CPU filtering the way is done for
all other PMUs.

i.e. the following commands (or their equivalent in whatever other API
you create) won't work:

a) perf stat -e intel_cqm/total_bytes/ -C 2

or

b.1) perf stat -e intel_cqm/total_bytes/ -C 2 

or

b.2) perf stat -e intel_cqm/llc_occupancy/ -a 

in (a) because many RMIDs may run in the CPU and, in (b's) because the
same measurement group's RMID will be used across all CPUs. I know
this is similar to how it is in CAT, but CAT was never intended to do
monitoring. We can do the CAT way and the perf way, or not, but if we
will drop support for perf's like CPU support, it must be explicitly
stated and not an implicit consequence of a design choice leaked into
requirements.

> 10) Put multiple CPUs into a group


11) Able to measure across CAT groups.  So that a user can:
  A) measure a task that runs on CPUs that are in different CAT groups
(one of Thomas' use case FWICT), and
  B) measure tasks even if they change their CAT group (my use case).

>
> Nice to have:
> 1)  Readout using "perf(1)" [subset of modes that make sense ... tying 
> monitoring to resctrl file system will make most command line usage of 
> perf(1) close to impossible.


We discussed this offline and I still disagree that it is close to
impossible to use perf and perf_event_open. In fact, I think it's very
simple :

a) We stretch the usage of the pid parameter in perf_event_open to
also allow a PMU specific task group fd (as of now it's either a PID
or a cgroup fd).
b) PMUs that can handle non-cgroup task groups have a special PMU_CAP
flag to signal the generic code to not resolve the fd to a cgroup
pointer and, instead, save it as is in struct perf_event (a few lines
of code).
c) The PMU takes care of resolving the task group's fd.

The above is ONE way to do it, there may be others. But there is a big
advantage on leveraging perf_event_open and ease integration with the
perf tool and the myriads of tools that use the perf API.

12) Whatever fs or syscall is provided instead of perf syscalls, it
should provide total_time_enabled in the way perf does, otherwise is
hard to interpret MBM values.

>
> -Tony
>
>


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread David Carrillo-Cisneros
On Wed, Feb 1, 2017 at 12:08 PM Luck, Tony  wrote:
>
> > I was asking for requirements, not a design proposal. In order to make a
> > design you need a requirements specification.
>
> Here's what I came up with ... not a fully baked list, but should allow for 
> some useful
> discussion on whether any of these are not really needed, or if there is a 
> glaring hole
> that misses some use case:
>
> 1)  Able to measure using all supported events (currently L3 occupancy, 
> Total B/W, Local B/W)
> 2)  Measure per thread
> 3)  Including kernel threads
> 4)  Put multiple threads into a single measurement group (forced by h/w 
> shortage of RMIDs, but probably good to have anyway)

Even with infinite hw RMIDs you want to be able to have one RMID per
thread groups to avoid reading a potentially large list of RMIDs every
time you measure one group's event (with the delay and error
associated to measure many RMIDs whose values fluctuate rapidly).

> 5)  New threads created inherit measurement group from parent
> 6)  Report separate results per domain (L3)
> 7)  Must be able to measure based on existing resctrl CAT group
> 8)  Can get measurements for subsets of tasks in a CAT group (to find the 
> guys hogging the resources)
> 9)  Measure per logical CPU (pick active RMID in same precedence for 
> task/cpu as CAT picks CLOSID)

I agree that "Measure per logical CPU" is a requirement, but why is
"pick active RMID in same precedence for task/cpu as CAT picks CLOSID"
 one as well? Are we set on handling RMIDs the way CLOSIDs are
handled? there are drawbacks to do so, one is that it would make
impossible to do CPU monitoring and CPU filtering the way is done for
all other PMUs.

i.e. the following commands (or their equivalent in whatever other API
you create) won't work:

a) perf stat -e intel_cqm/total_bytes/ -C 2

or

b.1) perf stat -e intel_cqm/total_bytes/ -C 2 

or

b.2) perf stat -e intel_cqm/llc_occupancy/ -a 

in (a) because many RMIDs may run in the CPU and, in (b's) because the
same measurement group's RMID will be used across all CPUs. I know
this is similar to how it is in CAT, but CAT was never intended to do
monitoring. We can do the CAT way and the perf way, or not, but if we
will drop support for perf's like CPU support, it must be explicitly
stated and not an implicit consequence of a design choice leaked into
requirements.

> 10) Put multiple CPUs into a group


11) Able to measure across CAT groups.  So that a user can:
  A) measure a task that runs on CPUs that are in different CAT groups
(one of Thomas' use case FWICT), and
  B) measure tasks even if they change their CAT group (my use case).

>
> Nice to have:
> 1)  Readout using "perf(1)" [subset of modes that make sense ... tying 
> monitoring to resctrl file system will make most command line usage of 
> perf(1) close to impossible.


We discussed this offline and I still disagree that it is close to
impossible to use perf and perf_event_open. In fact, I think it's very
simple :

a) We stretch the usage of the pid parameter in perf_event_open to
also allow a PMU specific task group fd (as of now it's either a PID
or a cgroup fd).
b) PMUs that can handle non-cgroup task groups have a special PMU_CAP
flag to signal the generic code to not resolve the fd to a cgroup
pointer and, instead, save it as is in struct perf_event (a few lines
of code).
c) The PMU takes care of resolving the task group's fd.

The above is ONE way to do it, there may be others. But there is a big
advantage on leveraging perf_event_open and ease integration with the
perf tool and the myriads of tools that use the perf API.

12) Whatever fs or syscall is provided instead of perf syscalls, it
should provide total_time_enabled in the way perf does, otherwise is
hard to interpret MBM values.

>
> -Tony
>
>


RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread Luck, Tony
> I was asking for requirements, not a design proposal. In order to make a
> design you need a requirements specification.

Here's what I came up with ... not a fully baked list, but should allow for 
some useful
discussion on whether any of these are not really needed, or if there is a 
glaring hole
that misses some use case:

1)  Able to measure using all supported events (currently L3 occupancy, 
Total B/W, Local B/W)
2)  Measure per thread
3)  Including kernel threads
4)  Put multiple threads into a single measurement group (forced by h/w 
shortage of RMIDs, but probably good to have anyway)
5)  New threads created inherit measurement group from parent
6)  Report separate results per domain (L3)
7)  Must be able to measure based on existing resctrl CAT group
8)  Can get measurements for subsets of tasks in a CAT group (to find the 
guys hogging the resources)
9)  Measure per logical CPU (pick active RMID in same precedence for 
task/cpu as CAT picks CLOSID)
10) Put multiple CPUs into a group

Nice to have:
1)  Readout using "perf(1)" [subset of modes that make sense ... tying 
monitoring to resctrl file system will make most command line usage of perf(1) 
close to impossible.

-Tony




RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread Luck, Tony
> I was asking for requirements, not a design proposal. In order to make a
> design you need a requirements specification.

Here's what I came up with ... not a fully baked list, but should allow for 
some useful
discussion on whether any of these are not really needed, or if there is a 
glaring hole
that misses some use case:

1)  Able to measure using all supported events (currently L3 occupancy, 
Total B/W, Local B/W)
2)  Measure per thread
3)  Including kernel threads
4)  Put multiple threads into a single measurement group (forced by h/w 
shortage of RMIDs, but probably good to have anyway)
5)  New threads created inherit measurement group from parent
6)  Report separate results per domain (L3)
7)  Must be able to measure based on existing resctrl CAT group
8)  Can get measurements for subsets of tasks in a CAT group (to find the 
guys hogging the resources)
9)  Measure per logical CPU (pick active RMID in same precedence for 
task/cpu as CAT picks CLOSID)
10) Put multiple CPUs into a group

Nice to have:
1)  Readout using "perf(1)" [subset of modes that make sense ... tying 
monitoring to resctrl file system will make most command line usage of perf(1) 
close to impossible.

-Tony




Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-23 Thread Peter Zijlstra
On Mon, Jan 23, 2017 at 10:47:44AM +0100, Thomas Gleixner wrote:
> So again: 
> 
>   Can please everyone involved write up their specific requirements
>   for CQM and stop spamming us with half baken design proposals?
> 
>   And I mean abstract requirements and not again something which is
>   referring to existing crap or some desired crap.
> 
> The complete list of requirements has to be agreed on before we talk about
> anything else.

So something along the lines of:

 A) need to create a (named) group of tasks
   1) group composition needs to be dynamic; ie. we can add/remove member
  tasks at any time.
   2) a task can only belong to _one_ group at any one time.
   3) grouping need not be hierarchical?

 B) for each group, we need to set a CAT mask
   1) this CAT mask must be dynamic; ie. we can, during the existence of
  the group, change the mask at any time.

 C) for each group, we need to monitor CQM bits
   1) this monitor need not change


Supporting Use-Cases:

A.1: The Job (or VM) can have a dynamic task set
B.1: Dynamic QoS for each Job (or VM) as demand / load changes



Feel free to expand etc..


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-23 Thread Peter Zijlstra
On Mon, Jan 23, 2017 at 10:47:44AM +0100, Thomas Gleixner wrote:
> So again: 
> 
>   Can please everyone involved write up their specific requirements
>   for CQM and stop spamming us with half baken design proposals?
> 
>   And I mean abstract requirements and not again something which is
>   referring to existing crap or some desired crap.
> 
> The complete list of requirements has to be agreed on before we talk about
> anything else.

So something along the lines of:

 A) need to create a (named) group of tasks
   1) group composition needs to be dynamic; ie. we can add/remove member
  tasks at any time.
   2) a task can only belong to _one_ group at any one time.
   3) grouping need not be hierarchical?

 B) for each group, we need to set a CAT mask
   1) this CAT mask must be dynamic; ie. we can, during the existence of
  the group, change the mask at any time.

 C) for each group, we need to monitor CQM bits
   1) this monitor need not change


Supporting Use-Cases:

A.1: The Job (or VM) can have a dynamic task set
B.1: Dynamic QoS for each Job (or VM) as demand / load changes



Feel free to expand etc..


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-23 Thread Thomas Gleixner
On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:
> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner  wrote:
> > Can you please write up in a abstract way what the design requirements are
> > that you need. So far we are talking about implementation details and
> > unspecfied wishlists, but what we really need is an abstract requirement.
> 
> My pleasure:
> 
> 
> Design Proposal for Monitoring of RDT Allocation Groups.

I was asking for requirements, not a design proposal. In order to make a
design you need a requirements specification.

So again: 

  Can please everyone involved write up their specific requirements
  for CQM and stop spamming us with half baken design proposals?

  And I mean abstract requirements and not again something which is
  referring to existing crap or some desired crap.

The complete list of requirements has to be agreed on before we talk about
anything else.

Thanks,

tglx








Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-23 Thread Thomas Gleixner
On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:
> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner  wrote:
> > Can you please write up in a abstract way what the design requirements are
> > that you need. So far we are talking about implementation details and
> > unspecfied wishlists, but what we really need is an abstract requirement.
> 
> My pleasure:
> 
> 
> Design Proposal for Monitoring of RDT Allocation Groups.

I was asking for requirements, not a design proposal. In order to make a
design you need a requirements specification.

So again: 

  Can please everyone involved write up their specific requirements
  for CQM and stop spamming us with half baken design proposals?

  And I mean abstract requirements and not again something which is
  referring to existing crap or some desired crap.

The complete list of requirements has to be agreed on before we talk about
anything else.

Thanks,

tglx








Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Shivappa Vikas



On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:


On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikas
 wrote:



On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:


On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner 
wrote:



On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:



If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.



So if I understand you correctly, then you want a mechanism to have
groups
of entities (tasks, cpus) and associate them to a particular resource
control group.

So they share the CLOSID of the control group and each entity group can
have its own RMID.

Now you want to be able to move the entity groups around between control
groups without losing the RMID associated to the entity group.

So the whole picture would look like this:

rdt ->  CTRLGRP -> CLOSID

mon ->  MONGRP  -> RMID

And you want to move MONGRP from one CTRLGRP to another.



Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
same thing. Details below.



Can you please write up in a abstract way what the design requirements
are
that you need. So far we are talking about implementation details and
unspecfied wishlists, but what we really need is an abstract requirement.



My pleasure:


Design Proposal for Monitoring of RDT Allocation Groups.

-

Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
cache bitmask (CBM) per resource. Non-unique CBM are possible although
useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
CLOSIDs are much more scarce than RMIDs.

If we lift the condition of unique CLOSID, then the user can create
multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
would share the CLOSID and RDT_Allocation must maintain the schemata
to CLOSID relationship (similarly to what the previous CAT driver used
to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
now: adding an element removes it from its previous CTRLGRP.


This change would allow further partitioning the allocation groups
into (allocation, monitoring) groups as follows:

With allocation only:
   CTRLGRP0 CTRLGRP_ALLOC_ONLY
schemata:  L3:0=0xff0   L3:0=x00f
tasks:   PID0   P0_0,P0_1,P1_0,P1_1
cpus:0x30xC



Not clear what the PID0 and P0_0 mean ?


PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it
does now in RDT. I am not changing that.



If you have to support something like MONGRP and CTRLGRP overall you want to
allow for a task to be present in multiple groups ?


I am not proposing to support MONGRP and CTRLGRP. I am proposing to
allow monitoring of CTRGRPs only.



If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
independently, with the new model we could create:
   CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3
schemata:  L3:0=0xff0   L3:0=x00fL3:0=0x00f L3:0=0x00f
tasks:   PID0   P0_0,P0_1 P1_0, P1_1
cpus:0x3   0xC  0x0 0x0

Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for
(L3,0).


Now we can ask perf to monitor any of the CTRLGRPs independently -once
we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
The perf_event will reserve and assign the RMID to the monitored
CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
(CLOSID and RMID), so perf won't have to.



This can be solved by suporting just the -t in perf and a new option in perf
to suport resctrl group monitoring (something similar to -R). That way we
provide the flexible granularity to monitor tasks independent of whether
they are in any resctrl group (and hence also a subset).


One of the key points of my proposal is to remove monitoring PIDs
independently. That simplifies things by letting RDT handle CLOSIDs
and RMIDs together.



CTRLGRP TASKS   MASK
CTRLGRP1PID1,PID2   L3:0=0Xf,1=0xf0
CTRLGRP2PID3,PID4   L3:0=0Xf0,1=0xf00

#perf stat -e llc_occupancy -R CTRLGRP1

#perf stat -e llc_occupancy -t PID3,PID4

The RMID allocation is independent of resctrl CLOSid allocation and hence
the RMID is not always married to CLOS which seems like the requirement
here.


It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can
change in my proposal.



OR

We could have CTRLGRPs with control_only, monitor_only or control_monitor
options.

now a task could be present in both control_only and monitor_only
group or it could be present only in a control_monitor_group. The
transitions from one state to another are guarded by this same principle.

CTRLGRP 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Shivappa Vikas



On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:


On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikas
 wrote:



On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:


On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner 
wrote:



On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:



If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.



So if I understand you correctly, then you want a mechanism to have
groups
of entities (tasks, cpus) and associate them to a particular resource
control group.

So they share the CLOSID of the control group and each entity group can
have its own RMID.

Now you want to be able to move the entity groups around between control
groups without losing the RMID associated to the entity group.

So the whole picture would look like this:

rdt ->  CTRLGRP -> CLOSID

mon ->  MONGRP  -> RMID

And you want to move MONGRP from one CTRLGRP to another.



Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
same thing. Details below.



Can you please write up in a abstract way what the design requirements
are
that you need. So far we are talking about implementation details and
unspecfied wishlists, but what we really need is an abstract requirement.



My pleasure:


Design Proposal for Monitoring of RDT Allocation Groups.

-

Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
cache bitmask (CBM) per resource. Non-unique CBM are possible although
useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
CLOSIDs are much more scarce than RMIDs.

If we lift the condition of unique CLOSID, then the user can create
multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
would share the CLOSID and RDT_Allocation must maintain the schemata
to CLOSID relationship (similarly to what the previous CAT driver used
to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
now: adding an element removes it from its previous CTRLGRP.


This change would allow further partitioning the allocation groups
into (allocation, monitoring) groups as follows:

With allocation only:
   CTRLGRP0 CTRLGRP_ALLOC_ONLY
schemata:  L3:0=0xff0   L3:0=x00f
tasks:   PID0   P0_0,P0_1,P1_0,P1_1
cpus:0x30xC



Not clear what the PID0 and P0_0 mean ?


PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it
does now in RDT. I am not changing that.



If you have to support something like MONGRP and CTRLGRP overall you want to
allow for a task to be present in multiple groups ?


I am not proposing to support MONGRP and CTRLGRP. I am proposing to
allow monitoring of CTRGRPs only.



If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
independently, with the new model we could create:
   CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3
schemata:  L3:0=0xff0   L3:0=x00fL3:0=0x00f L3:0=0x00f
tasks:   PID0   P0_0,P0_1 P1_0, P1_1
cpus:0x3   0xC  0x0 0x0

Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for
(L3,0).


Now we can ask perf to monitor any of the CTRLGRPs independently -once
we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
The perf_event will reserve and assign the RMID to the monitored
CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
(CLOSID and RMID), so perf won't have to.



This can be solved by suporting just the -t in perf and a new option in perf
to suport resctrl group monitoring (something similar to -R). That way we
provide the flexible granularity to monitor tasks independent of whether
they are in any resctrl group (and hence also a subset).


One of the key points of my proposal is to remove monitoring PIDs
independently. That simplifies things by letting RDT handle CLOSIDs
and RMIDs together.



CTRLGRP TASKS   MASK
CTRLGRP1PID1,PID2   L3:0=0Xf,1=0xf0
CTRLGRP2PID3,PID4   L3:0=0Xf0,1=0xf00

#perf stat -e llc_occupancy -R CTRLGRP1

#perf stat -e llc_occupancy -t PID3,PID4

The RMID allocation is independent of resctrl CLOSid allocation and hence
the RMID is not always married to CLOS which seems like the requirement
here.


It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can
change in my proposal.



OR

We could have CTRLGRPs with control_only, monitor_only or control_monitor
options.

now a task could be present in both control_only and monitor_only
group or it could be present only in a control_monitor_group. The
transitions from one state to another are guarded by this same principle.

CTRLGRP TASKS   MASK

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread David Carrillo-Cisneros
On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikas
 wrote:
>
>
> On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:
>
>> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner 
>> wrote:
>>>
>>>
>>> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:


 If resctrl groups could lift the restriction of one resctl per CLOSID,
 then the user can create many resctrl in the way perf cgroups are
 created now. The advantage is that there wont be cgroup hierarchy!
 making things much simpler. Also no need to optimize perf event
 context switch to make llc_occupancy work.
>>>
>>>
>>> So if I understand you correctly, then you want a mechanism to have
>>> groups
>>> of entities (tasks, cpus) and associate them to a particular resource
>>> control group.
>>>
>>> So they share the CLOSID of the control group and each entity group can
>>> have its own RMID.
>>>
>>> Now you want to be able to move the entity groups around between control
>>> groups without losing the RMID associated to the entity group.
>>>
>>> So the whole picture would look like this:
>>>
>>> rdt ->  CTRLGRP -> CLOSID
>>>
>>> mon ->  MONGRP  -> RMID
>>>
>>> And you want to move MONGRP from one CTRLGRP to another.
>>
>>
>> Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
>> same thing. Details below.
>>
>>>
>>> Can you please write up in a abstract way what the design requirements
>>> are
>>> that you need. So far we are talking about implementation details and
>>> unspecfied wishlists, but what we really need is an abstract requirement.
>>
>>
>> My pleasure:
>>
>>
>> Design Proposal for Monitoring of RDT Allocation Groups.
>>
>> -
>>
>> Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
>> cache bitmask (CBM) per resource. Non-unique CBM are possible although
>> useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
>> CLOSIDs are much more scarce than RMIDs.
>>
>> If we lift the condition of unique CLOSID, then the user can create
>> multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
>> would share the CLOSID and RDT_Allocation must maintain the schemata
>> to CLOSID relationship (similarly to what the previous CAT driver used
>> to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
>> now: adding an element removes it from its previous CTRLGRP.
>>
>>
>> This change would allow further partitioning the allocation groups
>> into (allocation, monitoring) groups as follows:
>>
>> With allocation only:
>>CTRLGRP0 CTRLGRP_ALLOC_ONLY
>> schemata:  L3:0=0xff0   L3:0=x00f
>> tasks:   PID0   P0_0,P0_1,P1_0,P1_1
>> cpus:0x30xC
>
>
> Not clear what the PID0 and P0_0 mean ?

PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it
does now in RDT. I am not changing that.

>
> If you have to support something like MONGRP and CTRLGRP overall you want to
> allow for a task to be present in multiple groups ?

I am not proposing to support MONGRP and CTRLGRP. I am proposing to
allow monitoring of CTRGRPs only.

>>
>> If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
>> independently, with the new model we could create:
>>CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3
>> schemata:  L3:0=0xff0   L3:0=x00fL3:0=0x00f L3:0=0x00f
>> tasks:   PID0   P0_0,P0_1 P1_0, P1_1
>> cpus:0x3   0xC  0x0 0x0
>>
>> Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for
>> (L3,0).
>>
>>
>> Now we can ask perf to monitor any of the CTRLGRPs independently -once
>> we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
>> The perf_event will reserve and assign the RMID to the monitored
>> CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
>> (CLOSID and RMID), so perf won't have to.
>
>
> This can be solved by suporting just the -t in perf and a new option in perf
> to suport resctrl group monitoring (something similar to -R). That way we
> provide the flexible granularity to monitor tasks independent of whether
> they are in any resctrl group (and hence also a subset).

One of the key points of my proposal is to remove monitoring PIDs
independently. That simplifies things by letting RDT handle CLOSIDs
and RMIDs together.

>
> CTRLGRP TASKS   MASK
> CTRLGRP1PID1,PID2   L3:0=0Xf,1=0xf0
> CTRLGRP2PID3,PID4   L3:0=0Xf0,1=0xf00
>
> #perf stat -e llc_occupancy -R CTRLGRP1
>
> #perf stat -e llc_occupancy -t PID3,PID4
>
> The RMID allocation is independent of resctrl CLOSid allocation and hence
> the RMID is not always married to CLOS which seems like the requirement
> here.

It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can
change in my proposal.

>
> OR
>
> We could have CTRLGRPs with control_only, 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread David Carrillo-Cisneros
On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikas
 wrote:
>
>
> On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:
>
>> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner 
>> wrote:
>>>
>>>
>>> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:


 If resctrl groups could lift the restriction of one resctl per CLOSID,
 then the user can create many resctrl in the way perf cgroups are
 created now. The advantage is that there wont be cgroup hierarchy!
 making things much simpler. Also no need to optimize perf event
 context switch to make llc_occupancy work.
>>>
>>>
>>> So if I understand you correctly, then you want a mechanism to have
>>> groups
>>> of entities (tasks, cpus) and associate them to a particular resource
>>> control group.
>>>
>>> So they share the CLOSID of the control group and each entity group can
>>> have its own RMID.
>>>
>>> Now you want to be able to move the entity groups around between control
>>> groups without losing the RMID associated to the entity group.
>>>
>>> So the whole picture would look like this:
>>>
>>> rdt ->  CTRLGRP -> CLOSID
>>>
>>> mon ->  MONGRP  -> RMID
>>>
>>> And you want to move MONGRP from one CTRLGRP to another.
>>
>>
>> Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
>> same thing. Details below.
>>
>>>
>>> Can you please write up in a abstract way what the design requirements
>>> are
>>> that you need. So far we are talking about implementation details and
>>> unspecfied wishlists, but what we really need is an abstract requirement.
>>
>>
>> My pleasure:
>>
>>
>> Design Proposal for Monitoring of RDT Allocation Groups.
>>
>> -
>>
>> Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
>> cache bitmask (CBM) per resource. Non-unique CBM are possible although
>> useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
>> CLOSIDs are much more scarce than RMIDs.
>>
>> If we lift the condition of unique CLOSID, then the user can create
>> multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
>> would share the CLOSID and RDT_Allocation must maintain the schemata
>> to CLOSID relationship (similarly to what the previous CAT driver used
>> to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
>> now: adding an element removes it from its previous CTRLGRP.
>>
>>
>> This change would allow further partitioning the allocation groups
>> into (allocation, monitoring) groups as follows:
>>
>> With allocation only:
>>CTRLGRP0 CTRLGRP_ALLOC_ONLY
>> schemata:  L3:0=0xff0   L3:0=x00f
>> tasks:   PID0   P0_0,P0_1,P1_0,P1_1
>> cpus:0x30xC
>
>
> Not clear what the PID0 and P0_0 mean ?

PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it
does now in RDT. I am not changing that.

>
> If you have to support something like MONGRP and CTRLGRP overall you want to
> allow for a task to be present in multiple groups ?

I am not proposing to support MONGRP and CTRLGRP. I am proposing to
allow monitoring of CTRGRPs only.

>>
>> If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
>> independently, with the new model we could create:
>>CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3
>> schemata:  L3:0=0xff0   L3:0=x00fL3:0=0x00f L3:0=0x00f
>> tasks:   PID0   P0_0,P0_1 P1_0, P1_1
>> cpus:0x3   0xC  0x0 0x0
>>
>> Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for
>> (L3,0).
>>
>>
>> Now we can ask perf to monitor any of the CTRLGRPs independently -once
>> we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
>> The perf_event will reserve and assign the RMID to the monitored
>> CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
>> (CLOSID and RMID), so perf won't have to.
>
>
> This can be solved by suporting just the -t in perf and a new option in perf
> to suport resctrl group monitoring (something similar to -R). That way we
> provide the flexible granularity to monitor tasks independent of whether
> they are in any resctrl group (and hence also a subset).

One of the key points of my proposal is to remove monitoring PIDs
independently. That simplifies things by letting RDT handle CLOSIDs
and RMIDs together.

>
> CTRLGRP TASKS   MASK
> CTRLGRP1PID1,PID2   L3:0=0Xf,1=0xf0
> CTRLGRP2PID3,PID4   L3:0=0Xf0,1=0xf00
>
> #perf stat -e llc_occupancy -R CTRLGRP1
>
> #perf stat -e llc_occupancy -t PID3,PID4
>
> The RMID allocation is independent of resctrl CLOSid allocation and hence
> the RMID is not always married to CLOS which seems like the requirement
> here.

It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can
change in my proposal.

>
> OR
>
> We could have CTRLGRPs with control_only, monitor_only or control_monitor
> options.
>
> now a 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Shivappa Vikas



On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:


On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner  wrote:


On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:


If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.


So if I understand you correctly, then you want a mechanism to have groups
of entities (tasks, cpus) and associate them to a particular resource
control group.

So they share the CLOSID of the control group and each entity group can
have its own RMID.

Now you want to be able to move the entity groups around between control
groups without losing the RMID associated to the entity group.

So the whole picture would look like this:

rdt ->  CTRLGRP -> CLOSID

mon ->  MONGRP  -> RMID

And you want to move MONGRP from one CTRLGRP to another.


Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
same thing. Details below.



Can you please write up in a abstract way what the design requirements are
that you need. So far we are talking about implementation details and
unspecfied wishlists, but what we really need is an abstract requirement.


My pleasure:


Design Proposal for Monitoring of RDT Allocation Groups.
-

Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
cache bitmask (CBM) per resource. Non-unique CBM are possible although
useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
CLOSIDs are much more scarce than RMIDs.

If we lift the condition of unique CLOSID, then the user can create
multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
would share the CLOSID and RDT_Allocation must maintain the schemata
to CLOSID relationship (similarly to what the previous CAT driver used
to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
now: adding an element removes it from its previous CTRLGRP.


This change would allow further partitioning the allocation groups
into (allocation, monitoring) groups as follows:

With allocation only:
   CTRLGRP0 CTRLGRP_ALLOC_ONLY
schemata:  L3:0=0xff0   L3:0=x00f
tasks:   PID0   P0_0,P0_1,P1_0,P1_1
cpus:0x30xC


Not clear what the PID0 and P0_0 mean ?

If you have to support something like MONGRP and CTRLGRP overall 
you want to allow for a task to be present in multiple groups ?




If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
independently, with the new model we could create:
   CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3
schemata:  L3:0=0xff0   L3:0=x00fL3:0=0x00f L3:0=0x00f
tasks:   PID0   P0_0,P0_1 P1_0, P1_1
cpus:0x3   0xC  0x0 0x0

Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0).


Now we can ask perf to monitor any of the CTRLGRPs independently -once
we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
The perf_event will reserve and assign the RMID to the monitored
CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
(CLOSID and RMID), so perf won't have to.


This can be solved by suporting just the -t in perf and a new option in perf to 
suport resctrl group monitoring (something similar to -R). That way we provide 
the flexible granularity to monitor tasks 
independent of whether they are in any resctrl group (and hence also a subset).


CTRLGRP TASKS   MASK
CTRLGRP1PID1,PID2   L3:0=0Xf,1=0xf0
CTRLGRP2PID3,PID4   L3:0=0Xf0,1=0xf00

#perf stat -e llc_occupancy -R CTRLGRP1

#perf stat -e llc_occupancy -t PID3,PID4

The RMID allocation is independent of resctrl CLOSid allocation and hence the 
RMID is not always married to CLOS which seems like the requirement here.


OR

We could have CTRLGRPs with control_only, monitor_only or control_monitor 
options.


now a task could be present in both control_only and monitor_only
group or it could be present only in a control_monitor_group. The transitions 
from one state to another are guarded by this same principle.


CTRLGRP TASKS   MASKTYPE
CTRLGRP1PID1,PID2   L3:0=0Xf,1=0xf0 control_only
CTRLGRP2PID3,PID4   L3:0=0Xf0,1=0xf00   control_only
CTRLGRP3PID2,PID3   monitor_only
CTRLGRP4PID5,PID6   L3:0=0Xf0,1=0xf00   control_monitor

CTRLGRP3 allows you to monitor a set of tasks which is not bound to be in the 
same CTRLGRP and you can add or move tasks into this. The adding and removing 
the tasks is whats easily supported compared to the task granularity although 
such a thing could still be supported with 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Shivappa Vikas



On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote:


On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner  wrote:


On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:


If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.


So if I understand you correctly, then you want a mechanism to have groups
of entities (tasks, cpus) and associate them to a particular resource
control group.

So they share the CLOSID of the control group and each entity group can
have its own RMID.

Now you want to be able to move the entity groups around between control
groups without losing the RMID associated to the entity group.

So the whole picture would look like this:

rdt ->  CTRLGRP -> CLOSID

mon ->  MONGRP  -> RMID

And you want to move MONGRP from one CTRLGRP to another.


Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
same thing. Details below.



Can you please write up in a abstract way what the design requirements are
that you need. So far we are talking about implementation details and
unspecfied wishlists, but what we really need is an abstract requirement.


My pleasure:


Design Proposal for Monitoring of RDT Allocation Groups.
-

Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
cache bitmask (CBM) per resource. Non-unique CBM are possible although
useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
CLOSIDs are much more scarce than RMIDs.

If we lift the condition of unique CLOSID, then the user can create
multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
would share the CLOSID and RDT_Allocation must maintain the schemata
to CLOSID relationship (similarly to what the previous CAT driver used
to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
now: adding an element removes it from its previous CTRLGRP.


This change would allow further partitioning the allocation groups
into (allocation, monitoring) groups as follows:

With allocation only:
   CTRLGRP0 CTRLGRP_ALLOC_ONLY
schemata:  L3:0=0xff0   L3:0=x00f
tasks:   PID0   P0_0,P0_1,P1_0,P1_1
cpus:0x30xC


Not clear what the PID0 and P0_0 mean ?

If you have to support something like MONGRP and CTRLGRP overall 
you want to allow for a task to be present in multiple groups ?




If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
independently, with the new model we could create:
   CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3
schemata:  L3:0=0xff0   L3:0=x00fL3:0=0x00f L3:0=0x00f
tasks:   PID0   P0_0,P0_1 P1_0, P1_1
cpus:0x3   0xC  0x0 0x0

Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0).


Now we can ask perf to monitor any of the CTRLGRPs independently -once
we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
The perf_event will reserve and assign the RMID to the monitored
CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
(CLOSID and RMID), so perf won't have to.


This can be solved by suporting just the -t in perf and a new option in perf to 
suport resctrl group monitoring (something similar to -R). That way we provide 
the flexible granularity to monitor tasks 
independent of whether they are in any resctrl group (and hence also a subset).


CTRLGRP TASKS   MASK
CTRLGRP1PID1,PID2   L3:0=0Xf,1=0xf0
CTRLGRP2PID3,PID4   L3:0=0Xf0,1=0xf00

#perf stat -e llc_occupancy -R CTRLGRP1

#perf stat -e llc_occupancy -t PID3,PID4

The RMID allocation is independent of resctrl CLOSid allocation and hence the 
RMID is not always married to CLOS which seems like the requirement here.


OR

We could have CTRLGRPs with control_only, monitor_only or control_monitor 
options.


now a task could be present in both control_only and monitor_only
group or it could be present only in a control_monitor_group. The transitions 
from one state to another are guarded by this same principle.


CTRLGRP TASKS   MASKTYPE
CTRLGRP1PID1,PID2   L3:0=0Xf,1=0xf0 control_only
CTRLGRP2PID3,PID4   L3:0=0Xf0,1=0xf00   control_only
CTRLGRP3PID2,PID3   monitor_only
CTRLGRP4PID5,PID6   L3:0=0Xf0,1=0xf00   control_monitor

CTRLGRP3 allows you to monitor a set of tasks which is not bound to be in the 
same CTRLGRP and you can add or move tasks into this. The adding and removing 
the tasks is whats easily supported compared to the task granularity although 
such a thing could still be supported with the task granularity.

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Shivappa Vikas



On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:


On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
 wrote:

Resending including Thomas , also with some changes. Sorry for the spam

Based on Thomas and Peterz feedback Can think of two design
variants which target:

-Support monitoring and allocating using the same resctrl group.
user can use a resctrl group to allocate resources and also monitor
them (with respect to tasks or cpu)

-Also allows monitoring outside of resctrl so that user can
monitor subgroups who use the same closid. This mode can be used
when user wants to monitor more than just the resctrl groups.

The first design version uses and modifies perf_cgroup, second version
builds a new interface resmon.


The second version would require to build a whole new set of tools,
deploy them and maintain them. Users will have to run perf for certain
events and resmon (or whatever is named the new tool) for rdt. I see
it as too complex and much prefer to keep using perf.


This was so that we have the flexibility to align the tools as per the 
requirement of the feature rather than twisting the perf behaviour and also have 
that flexibility for future when new RDT features are added (something 
similar to what we did by introducing resctrl groups instead of using cgroups 
for CAT)


Sometimes thats a lot simpler as we dont need a lot code given the 
limited/specific syscalls we need to support. Just like the resctrl fs which is 
specific to RDT.


It looks like your requirement is to be able to monitor a group of tasks 
independently apart from the resctrl groups?


The task option should provide that flexibility to monitor a bunch of tasks 
independently apart from whether they are part of resctrl group or not. The 
assignment of RMID is contolled underneat by the kernel so we can optimize the 
usage of RMIDs and also RMIDs are tied to this group of tasks whether its a 
subset of resctrl group or not.





The first version is close to the patches
sent with some additions/changes. This includes details of the design as
per Thomas/Peterz feedback.

1> First Design option: without modifying the resctrl and using perf



In this design everything in resctrl interface works like
before (the info, resource group files like task schemata all remain the
same)


Monitor cqm using perf
--

perf can monitor individual tasks using the -t
option just like before.

# perf stat -e llc_occupancy -t PID1,PID2

user can monitor the cpu occupancy using the -C option in perf:

# perf stat -e llc_occupancy -C 5

Below shows how user can monitor cgroup occupancy:

# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# mkdir /sys/fs/cgroup/perf_event/g2
# echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks

# perf stat -e intel_cqm/llc_occupancy/ -a -G g2

To monitor a resctrl group, user can group the same tasks in resctrl
group into the cgroup.

To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
group p1 to cgroup g1

# echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks

Introducing a new option for resctrl may complicate monitoring because
supporting cgroup 'task groups' and resctrl 'task groups' leads to
situations where:
if the groups intersect, then there is no way to know what
l3_allocations contribute to which group.

ex:
p1 has tasks t1, t2, t3
g1 has tasks t2, t3, t4

The only way to get occupancy for g1 and p1 would be to allocate an RMID
for each task which can as well be done with the -t option.


That's simply recreating the resctrl group as a cgroup.

I think that the main advantage of doing allocation first is that we
could use the context switch in rdt allocation and greatly simplify
the pmu side of it.

If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.

Then we only need a way to express that monitoring must happen in a
resctl to the perf_event_open syscall.

My first thought is to have a "rdt_monitor" file per resctl group. A
user passes it to perf_event_open in the way cgroups are passed now.
We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also
cover rdt_monitor files. The syscall can figure if it's a cgroup or a
rdt_group. The rdt_monitoring PMU would only work with rdt_monitor
groups

Then the rdm_monitoring PMU will be pretty dumb, having neither task
nor CPU contexts. Just providing the pmu->read and pmu->event_init
functions.

Task monitoring can be done with resctrl as well by adding the PID to
a new resctl and opening the event on it. And, since we'd allow CLOSID
to be 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Shivappa Vikas



On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:


On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
 wrote:

Resending including Thomas , also with some changes. Sorry for the spam

Based on Thomas and Peterz feedback Can think of two design
variants which target:

-Support monitoring and allocating using the same resctrl group.
user can use a resctrl group to allocate resources and also monitor
them (with respect to tasks or cpu)

-Also allows monitoring outside of resctrl so that user can
monitor subgroups who use the same closid. This mode can be used
when user wants to monitor more than just the resctrl groups.

The first design version uses and modifies perf_cgroup, second version
builds a new interface resmon.


The second version would require to build a whole new set of tools,
deploy them and maintain them. Users will have to run perf for certain
events and resmon (or whatever is named the new tool) for rdt. I see
it as too complex and much prefer to keep using perf.


This was so that we have the flexibility to align the tools as per the 
requirement of the feature rather than twisting the perf behaviour and also have 
that flexibility for future when new RDT features are added (something 
similar to what we did by introducing resctrl groups instead of using cgroups 
for CAT)


Sometimes thats a lot simpler as we dont need a lot code given the 
limited/specific syscalls we need to support. Just like the resctrl fs which is 
specific to RDT.


It looks like your requirement is to be able to monitor a group of tasks 
independently apart from the resctrl groups?


The task option should provide that flexibility to monitor a bunch of tasks 
independently apart from whether they are part of resctrl group or not. The 
assignment of RMID is contolled underneat by the kernel so we can optimize the 
usage of RMIDs and also RMIDs are tied to this group of tasks whether its a 
subset of resctrl group or not.





The first version is close to the patches
sent with some additions/changes. This includes details of the design as
per Thomas/Peterz feedback.

1> First Design option: without modifying the resctrl and using perf



In this design everything in resctrl interface works like
before (the info, resource group files like task schemata all remain the
same)


Monitor cqm using perf
--

perf can monitor individual tasks using the -t
option just like before.

# perf stat -e llc_occupancy -t PID1,PID2

user can monitor the cpu occupancy using the -C option in perf:

# perf stat -e llc_occupancy -C 5

Below shows how user can monitor cgroup occupancy:

# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# mkdir /sys/fs/cgroup/perf_event/g2
# echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks

# perf stat -e intel_cqm/llc_occupancy/ -a -G g2

To monitor a resctrl group, user can group the same tasks in resctrl
group into the cgroup.

To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
group p1 to cgroup g1

# echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks

Introducing a new option for resctrl may complicate monitoring because
supporting cgroup 'task groups' and resctrl 'task groups' leads to
situations where:
if the groups intersect, then there is no way to know what
l3_allocations contribute to which group.

ex:
p1 has tasks t1, t2, t3
g1 has tasks t2, t3, t4

The only way to get occupancy for g1 and p1 would be to allocate an RMID
for each task which can as well be done with the -t option.


That's simply recreating the resctrl group as a cgroup.

I think that the main advantage of doing allocation first is that we
could use the context switch in rdt allocation and greatly simplify
the pmu side of it.

If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.

Then we only need a way to express that monitoring must happen in a
resctl to the perf_event_open syscall.

My first thought is to have a "rdt_monitor" file per resctl group. A
user passes it to perf_event_open in the way cgroups are passed now.
We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also
cover rdt_monitor files. The syscall can figure if it's a cgroup or a
rdt_group. The rdt_monitoring PMU would only work with rdt_monitor
groups

Then the rdm_monitoring PMU will be pretty dumb, having neither task
nor CPU contexts. Just providing the pmu->read and pmu->event_init
functions.

Task monitoring can be done with resctrl as well by adding the PID to
a new resctl and opening the event on it. And, since we'd allow CLOSID
to be shared between resctrl groups, 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread David Carrillo-Cisneros
On Fri, Jan 20, 2017 at 12:30 AM, Thomas Gleixner  wrote:
> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
>> On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner  wrote:
>> > Above you are talking about the same CLOSID and different RMIDS and not
>> > about changing both.
>>
>> The scenario I talked about implies changing CLOSID without affecting
>> monitoring.
>> It happens when the allocation needs for a thread/cgroup/CPU change
>> dynamically. Forcing to change the RMID together with the CLOSID would
>> give wrong monitoring values unless the old RMID is kept around until
>> becomes free, which is ugly and would waste a RMID.
>
> When the allocation needs for a resource control group change, then we
> simply update the allocation constraints of that group without chaning the
> CLOSID. So everything just stays the same.
>
> If you move entities to a different group then of course the CLOSID
> changes and then it's a different story how to deal with monitoring.
>
>> > To gather any useful information for both CPU1 and T1 you need TWO
>> > RMIDs. Everything else is voodoo and crystal ball analysis and we are not
>> > going to support that.
>> >
>>
>> Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU
>> just because the CLOSID changed is wasteful.
>
> Again, the CLOSID only changes if you move entities to a different resource
> control group and in that case the RMID change is the least of your worries.
>
>> Correct. But there may not be a fixed CLOSID association if loads
>> exhibit dynamic behavior and/or system load changes dynamically.
>
> So, you really want to move entities around between resource control groups
> dynamically? I'm not seing why you would want to do that, but I'm all ear
> to get educated on that.

No, I don't want to move entities across resource control groups. I
was confused by the idea of CLOSIDs being married to control groups,
but now is clear even to me that that was never the intention.

Thanks,
David

>
> Thanks,
>
> tglx


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread David Carrillo-Cisneros
On Fri, Jan 20, 2017 at 12:30 AM, Thomas Gleixner  wrote:
> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
>> On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner  wrote:
>> > Above you are talking about the same CLOSID and different RMIDS and not
>> > about changing both.
>>
>> The scenario I talked about implies changing CLOSID without affecting
>> monitoring.
>> It happens when the allocation needs for a thread/cgroup/CPU change
>> dynamically. Forcing to change the RMID together with the CLOSID would
>> give wrong monitoring values unless the old RMID is kept around until
>> becomes free, which is ugly and would waste a RMID.
>
> When the allocation needs for a resource control group change, then we
> simply update the allocation constraints of that group without chaning the
> CLOSID. So everything just stays the same.
>
> If you move entities to a different group then of course the CLOSID
> changes and then it's a different story how to deal with monitoring.
>
>> > To gather any useful information for both CPU1 and T1 you need TWO
>> > RMIDs. Everything else is voodoo and crystal ball analysis and we are not
>> > going to support that.
>> >
>>
>> Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU
>> just because the CLOSID changed is wasteful.
>
> Again, the CLOSID only changes if you move entities to a different resource
> control group and in that case the RMID change is the least of your worries.
>
>> Correct. But there may not be a fixed CLOSID association if loads
>> exhibit dynamic behavior and/or system load changes dynamically.
>
> So, you really want to move entities around between resource control groups
> dynamically? I'm not seing why you would want to do that, but I'm all ear
> to get educated on that.

No, I don't want to move entities across resource control groups. I
was confused by the idea of CLOSIDs being married to control groups,
but now is clear even to me that that was never the intention.

Thanks,
David

>
> Thanks,
>
> tglx


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread David Carrillo-Cisneros
On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner  wrote:
>
> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> >
> > If resctrl groups could lift the restriction of one resctl per CLOSID,
> > then the user can create many resctrl in the way perf cgroups are
> > created now. The advantage is that there wont be cgroup hierarchy!
> > making things much simpler. Also no need to optimize perf event
> > context switch to make llc_occupancy work.
>
> So if I understand you correctly, then you want a mechanism to have groups
> of entities (tasks, cpus) and associate them to a particular resource
> control group.
>
> So they share the CLOSID of the control group and each entity group can
> have its own RMID.
>
> Now you want to be able to move the entity groups around between control
> groups without losing the RMID associated to the entity group.
>
> So the whole picture would look like this:
>
> rdt ->  CTRLGRP -> CLOSID
>
> mon ->  MONGRP  -> RMID
>
> And you want to move MONGRP from one CTRLGRP to another.

Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
same thing. Details below.

>
> Can you please write up in a abstract way what the design requirements are
> that you need. So far we are talking about implementation details and
> unspecfied wishlists, but what we really need is an abstract requirement.

My pleasure:


Design Proposal for Monitoring of RDT Allocation Groups.
-

Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
cache bitmask (CBM) per resource. Non-unique CBM are possible although
useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
CLOSIDs are much more scarce than RMIDs.

If we lift the condition of unique CLOSID, then the user can create
multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
would share the CLOSID and RDT_Allocation must maintain the schemata
to CLOSID relationship (similarly to what the previous CAT driver used
to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
now: adding an element removes it from its previous CTRLGRP.


This change would allow further partitioning the allocation groups
into (allocation, monitoring) groups as follows:

With allocation only:
CTRLGRP0 CTRLGRP_ALLOC_ONLY
schemata:  L3:0=0xff0   L3:0=x00f
tasks:   PID0   P0_0,P0_1,P1_0,P1_1
cpus:0x30xC

If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
independently, with the new model we could create:
CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3
schemata:  L3:0=0xff0   L3:0=x00fL3:0=0x00f L3:0=0x00f
tasks:   PID0   P0_0,P0_1 P1_0, P1_1
cpus:0x3   0xC  0x0 0x0

Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0).


Now we can ask perf to monitor any of the CTRLGRPs independently -once
we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
The perf_event will reserve and assign the RMID to the monitored
CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
(CLOSID and RMID), so perf won't have to.

If CTRLGRP's schemata changes, the RDT subsystem will find a new
CLOSID for the new schemata (potentially reusing an existing one) or
fail (just like the old CAT used to). The RMID does not change during
schemata updates.

If a CTRLGRP dies, the monitoring perf_event continues to exists as a
useless wraith, just as happens with cgroup events now.

Since CTRLGRPs have no hierarchy. There is no need to handle that in
the new RDT Monitoring PMU, greatly simplifying it over the previously
proposed versions.

A breaking change in user observed behavior with respect to the
existing CQM PMU is that there wouldn't be task events. A task must be
part of a CTRLGRP and events are created per (CTRLGRP, resource_id)
pair. If an user wants to monitor a task across multiple resources
(e.g. l3_occupancy across two packages), she must create one event per
resource_id and add the two counts.

I see this breaking change as an improvement, since hiding the cache
topology to user space introduced lots of ugliness and complexity to
the CQM PMU without improving accuracy over user space adding the
events.

Implementation ideas:

First idea is to expose one monitoring file per resource in a CTRLGRP,
so the list of CTRLGRP's files would be: schemata, tasks, cpus,
monitor_l3_0, monitor_l3_1, ...

the monitor_ file descriptor is passed to perf_event_open
in the way cgroup file descriptors are passed now. All events to the
same (CTRLGRP,resource_id) share RMID.

The RMID allocation part can either be handled by RDT Allocation or by
the RDT Monitoring PMU. Either ways, the existence of PMU's
perf_events allocates/releases the RMID.

Also, since this new design removes hierarchy and task events, it
allows for a simple solution of the RMID rotation problem. 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread David Carrillo-Cisneros
On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner  wrote:
>
> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> >
> > If resctrl groups could lift the restriction of one resctl per CLOSID,
> > then the user can create many resctrl in the way perf cgroups are
> > created now. The advantage is that there wont be cgroup hierarchy!
> > making things much simpler. Also no need to optimize perf event
> > context switch to make llc_occupancy work.
>
> So if I understand you correctly, then you want a mechanism to have groups
> of entities (tasks, cpus) and associate them to a particular resource
> control group.
>
> So they share the CLOSID of the control group and each entity group can
> have its own RMID.
>
> Now you want to be able to move the entity groups around between control
> groups without losing the RMID associated to the entity group.
>
> So the whole picture would look like this:
>
> rdt ->  CTRLGRP -> CLOSID
>
> mon ->  MONGRP  -> RMID
>
> And you want to move MONGRP from one CTRLGRP to another.

Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the
same thing. Details below.

>
> Can you please write up in a abstract way what the design requirements are
> that you need. So far we are talking about implementation details and
> unspecfied wishlists, but what we really need is an abstract requirement.

My pleasure:


Design Proposal for Monitoring of RDT Allocation Groups.
-

Currently each CTRLGRP has a unique CLOSID and a (most likely) unique
cache bitmask (CBM) per resource. Non-unique CBM are possible although
useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs.
CLOSIDs are much more scarce than RMIDs.

If we lift the condition of unique CLOSID, then the user can create
multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP
would share the CLOSID and RDT_Allocation must maintain the schemata
to CLOSID relationship (similarly to what the previous CAT driver used
to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as
now: adding an element removes it from its previous CTRLGRP.


This change would allow further partitioning the allocation groups
into (allocation, monitoring) groups as follows:

With allocation only:
CTRLGRP0 CTRLGRP_ALLOC_ONLY
schemata:  L3:0=0xff0   L3:0=x00f
tasks:   PID0   P0_0,P0_1,P1_0,P1_1
cpus:0x30xC

If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC
independently, with the new model we could create:
CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3
schemata:  L3:0=0xff0   L3:0=x00fL3:0=0x00f L3:0=0x00f
tasks:   PID0   P0_0,P0_1 P1_0, P1_1
cpus:0x3   0xC  0x0 0x0

Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0).


Now we can ask perf to monitor any of the CTRLGRPs independently -once
we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-.
The perf_event will reserve and assign the RMID to the monitored
CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR
(CLOSID and RMID), so perf won't have to.

If CTRLGRP's schemata changes, the RDT subsystem will find a new
CLOSID for the new schemata (potentially reusing an existing one) or
fail (just like the old CAT used to). The RMID does not change during
schemata updates.

If a CTRLGRP dies, the monitoring perf_event continues to exists as a
useless wraith, just as happens with cgroup events now.

Since CTRLGRPs have no hierarchy. There is no need to handle that in
the new RDT Monitoring PMU, greatly simplifying it over the previously
proposed versions.

A breaking change in user observed behavior with respect to the
existing CQM PMU is that there wouldn't be task events. A task must be
part of a CTRLGRP and events are created per (CTRLGRP, resource_id)
pair. If an user wants to monitor a task across multiple resources
(e.g. l3_occupancy across two packages), she must create one event per
resource_id and add the two counts.

I see this breaking change as an improvement, since hiding the cache
topology to user space introduced lots of ugliness and complexity to
the CQM PMU without improving accuracy over user space adding the
events.

Implementation ideas:

First idea is to expose one monitoring file per resource in a CTRLGRP,
so the list of CTRLGRP's files would be: schemata, tasks, cpus,
monitor_l3_0, monitor_l3_1, ...

the monitor_ file descriptor is passed to perf_event_open
in the way cgroup file descriptors are passed now. All events to the
same (CTRLGRP,resource_id) share RMID.

The RMID allocation part can either be handled by RDT Allocation or by
the RDT Monitoring PMU. Either ways, the existence of PMU's
perf_events allocates/releases the RMID.

Also, since this new design removes hierarchy and task events, it
allows for a simple solution of the RMID rotation problem. The removal
of task 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Stephane Eranian
On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
 wrote:
>
> Resending including Thomas , also with some changes. Sorry for the spam
>
> Based on Thomas and Peterz feedback Can think of two design
> variants which target:
>
> -Support monitoring and allocating using the same resctrl group.
> user can use a resctrl group to allocate resources and also monitor
> them (with respect to tasks or cpu)
>
> -Also allows monitoring outside of resctrl so that user can
> monitor subgroups who use the same closid. This mode can be used
> when user wants to monitor more than just the resctrl groups.
>
> The first design version uses and modifies perf_cgroup, second version
> builds a new interface resmon. The first version is close to the patches
> sent with some additions/changes. This includes details of the design as
> per Thomas/Peterz feedback.
>
> 1> First Design option: without modifying the resctrl and using perf
> 
> 
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same)
>
>
> Monitor cqm using perf
> --
>
> perf can monitor individual tasks using the -t
> option just like before.
>
> # perf stat -e llc_occupancy -t PID1,PID2
>
> user can monitor the cpu occupancy using the -C option in perf:
>
> # perf stat -e llc_occupancy -C 5
>
> Below shows how user can monitor cgroup occupancy:
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g2
> # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks
>
> # perf stat -e intel_cqm/llc_occupancy/ -a -G g2
>
Presented this way, this  does not quite address the use case I
described earlier here.
We want to be able to monitor the cgroup allocations from first thread
creation. What you have above has a large gap. Many apps do allocations
as their very first steps, so if you do:
$ my_test_prg &
[1456]
$ echo 1456 >/sys/fs/cgroup/perf_event/g2/tasks
$ perf stat -e intel_cqm/llc_occupancy/ -a -G g2

You have a race. But if you allow:

$ perf stat -e intel_cqm/llc_occupancy/ -a -G g2 (i.e, on an empty cgroup)
$ echo $$ >/sys/fs/cgroup/perf_event/g2/tasks (put shell in cgroup, so
my_test_prg runs immediately in the cgroup)
$ my_test_prg &

Then there is a way to avoid the gap.

>
> To monitor a resctrl group, user can group the same tasks in resctrl
> group into the cgroup.
>
> To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
> group p1 to cgroup g1
>
> # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks
>
> Introducing a new option for resctrl may complicate monitoring because
> supporting cgroup 'task groups' and resctrl 'task groups' leads to
> situations where:
> if the groups intersect, then there is no way to know what
> l3_allocations contribute to which group.
>
> ex:
> p1 has tasks t1, t2, t3
> g1 has tasks t2, t3, t4
>
> The only way to get occupancy for g1 and p1 would be to allocate an RMID
> for each task which can as well be done with the -t option.
>
> Monitoring cqm cgroups Implementation
> -
>
> When monitoring two different cgroups in the same hierarchy (ex say g11
> has an ancestor g1 which are both being monitored as shown below) we
> need the g11 counts to be considered for g1 as well.
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g1/g11
>
> When measuring for g1 llc_occupancy we cannot write two different RMIDs
> (because we need to count for g11 as well)
> during context switch to measure the occupancy for both g1 and g11.
> Hence the driver maintains this information and writes the RMID of the
> lowest member in the ancestory which is being monitored during ctx
> switch.
>
> The cqm_info is added to the perf_cgroup structure to maintain this
> information. The structure is allocated and destroyed at css_alloc and
> css_free. All the events tied to a cgroup can use the same
> information while reading the counts.
>
> struct perf_cgroup {
> #ifdef CONFIG_INTEL_RDT_M
> void *cqm_info;
> #endif
> ...
>
>  }
>
> struct cqm_info {
>   bool mon_enabled;
>   int level;
>   u32 *rmid;
>   struct cgrp_cqm_info *mfa;
>   struct list_head tskmon_rlist;
>  };
>
> Due to the hierarchical nature of cgroups, every cgroup just
> monitors for the 'nearest monitored ancestor' at all times.
> Since root cgroup is always monitored, all descendents
> at boot time monitor for root and hence all mfa points to root
> except for root->mfa which is NULL.
>
> 1. RMID setup: When cgroup x start monitoring:
>for each descendent y, if y's mfa->level < x->level, then
>y->mfa = x. (Where level of root node = 0...)
> 2. sched_in: During sched_in for x
>  

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Stephane Eranian
On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
 wrote:
>
> Resending including Thomas , also with some changes. Sorry for the spam
>
> Based on Thomas and Peterz feedback Can think of two design
> variants which target:
>
> -Support monitoring and allocating using the same resctrl group.
> user can use a resctrl group to allocate resources and also monitor
> them (with respect to tasks or cpu)
>
> -Also allows monitoring outside of resctrl so that user can
> monitor subgroups who use the same closid. This mode can be used
> when user wants to monitor more than just the resctrl groups.
>
> The first design version uses and modifies perf_cgroup, second version
> builds a new interface resmon. The first version is close to the patches
> sent with some additions/changes. This includes details of the design as
> per Thomas/Peterz feedback.
>
> 1> First Design option: without modifying the resctrl and using perf
> 
> 
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same)
>
>
> Monitor cqm using perf
> --
>
> perf can monitor individual tasks using the -t
> option just like before.
>
> # perf stat -e llc_occupancy -t PID1,PID2
>
> user can monitor the cpu occupancy using the -C option in perf:
>
> # perf stat -e llc_occupancy -C 5
>
> Below shows how user can monitor cgroup occupancy:
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g2
> # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks
>
> # perf stat -e intel_cqm/llc_occupancy/ -a -G g2
>
Presented this way, this  does not quite address the use case I
described earlier here.
We want to be able to monitor the cgroup allocations from first thread
creation. What you have above has a large gap. Many apps do allocations
as their very first steps, so if you do:
$ my_test_prg &
[1456]
$ echo 1456 >/sys/fs/cgroup/perf_event/g2/tasks
$ perf stat -e intel_cqm/llc_occupancy/ -a -G g2

You have a race. But if you allow:

$ perf stat -e intel_cqm/llc_occupancy/ -a -G g2 (i.e, on an empty cgroup)
$ echo $$ >/sys/fs/cgroup/perf_event/g2/tasks (put shell in cgroup, so
my_test_prg runs immediately in the cgroup)
$ my_test_prg &

Then there is a way to avoid the gap.

>
> To monitor a resctrl group, user can group the same tasks in resctrl
> group into the cgroup.
>
> To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
> group p1 to cgroup g1
>
> # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks
>
> Introducing a new option for resctrl may complicate monitoring because
> supporting cgroup 'task groups' and resctrl 'task groups' leads to
> situations where:
> if the groups intersect, then there is no way to know what
> l3_allocations contribute to which group.
>
> ex:
> p1 has tasks t1, t2, t3
> g1 has tasks t2, t3, t4
>
> The only way to get occupancy for g1 and p1 would be to allocate an RMID
> for each task which can as well be done with the -t option.
>
> Monitoring cqm cgroups Implementation
> -
>
> When monitoring two different cgroups in the same hierarchy (ex say g11
> has an ancestor g1 which are both being monitored as shown below) we
> need the g11 counts to be considered for g1 as well.
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g1/g11
>
> When measuring for g1 llc_occupancy we cannot write two different RMIDs
> (because we need to count for g11 as well)
> during context switch to measure the occupancy for both g1 and g11.
> Hence the driver maintains this information and writes the RMID of the
> lowest member in the ancestory which is being monitored during ctx
> switch.
>
> The cqm_info is added to the perf_cgroup structure to maintain this
> information. The structure is allocated and destroyed at css_alloc and
> css_free. All the events tied to a cgroup can use the same
> information while reading the counts.
>
> struct perf_cgroup {
> #ifdef CONFIG_INTEL_RDT_M
> void *cqm_info;
> #endif
> ...
>
>  }
>
> struct cqm_info {
>   bool mon_enabled;
>   int level;
>   u32 *rmid;
>   struct cgrp_cqm_info *mfa;
>   struct list_head tskmon_rlist;
>  };
>
> Due to the hierarchical nature of cgroups, every cgroup just
> monitors for the 'nearest monitored ancestor' at all times.
> Since root cgroup is always monitored, all descendents
> at boot time monitor for root and hence all mfa points to root
> except for root->mfa which is NULL.
>
> 1. RMID setup: When cgroup x start monitoring:
>for each descendent y, if y's mfa->level < x->level, then
>y->mfa = x. (Where level of root node = 0...)
> 2. sched_in: During sched_in for x
>if (x->mon_enabled) choose 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Thomas Gleixner
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> 
> If resctrl groups could lift the restriction of one resctl per CLOSID,
> then the user can create many resctrl in the way perf cgroups are
> created now. The advantage is that there wont be cgroup hierarchy!
> making things much simpler. Also no need to optimize perf event
> context switch to make llc_occupancy work.

So if I understand you correctly, then you want a mechanism to have groups
of entities (tasks, cpus) and associate them to a particular resource
control group.

So they share the CLOSID of the control group and each entity group can
have its own RMID.

Now you want to be able to move the entity groups around between control
groups without losing the RMID associated to the entity group.

So the whole picture would look like this:

rdt ->  CTRLGRP -> CLOSID

mon ->  MONGRP  -> RMID
   
And you want to move MONGRP from one CTRLGRP to another.

Can you please write up in a abstract way what the design requirements are
that you need. So far we are talking about implementation details and
unspecfied wishlists, but what we really need is an abstract requirement.

Thanks,

tglx








Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Thomas Gleixner
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> 
> If resctrl groups could lift the restriction of one resctl per CLOSID,
> then the user can create many resctrl in the way perf cgroups are
> created now. The advantage is that there wont be cgroup hierarchy!
> making things much simpler. Also no need to optimize perf event
> context switch to make llc_occupancy work.

So if I understand you correctly, then you want a mechanism to have groups
of entities (tasks, cpus) and associate them to a particular resource
control group.

So they share the CLOSID of the control group and each entity group can
have its own RMID.

Now you want to be able to move the entity groups around between control
groups without losing the RMID associated to the entity group.

So the whole picture would look like this:

rdt ->  CTRLGRP -> CLOSID

mon ->  MONGRP  -> RMID
   
And you want to move MONGRP from one CTRLGRP to another.

Can you please write up in a abstract way what the design requirements are
that you need. So far we are talking about implementation details and
unspecfied wishlists, but what we really need is an abstract requirement.

Thanks,

tglx








Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Thomas Gleixner
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner  wrote:
> > Above you are talking about the same CLOSID and different RMIDS and not
> > about changing both.
> 
> The scenario I talked about implies changing CLOSID without affecting
> monitoring.
> It happens when the allocation needs for a thread/cgroup/CPU change
> dynamically. Forcing to change the RMID together with the CLOSID would
> give wrong monitoring values unless the old RMID is kept around until
> becomes free, which is ugly and would waste a RMID.

When the allocation needs for a resource control group change, then we
simply update the allocation constraints of that group without chaning the
CLOSID. So everything just stays the same.

If you move entities to a different group then of course the CLOSID
changes and then it's a different story how to deal with monitoring.

> > To gather any useful information for both CPU1 and T1 you need TWO
> > RMIDs. Everything else is voodoo and crystal ball analysis and we are not
> > going to support that.
> >
> 
> Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU
> just because the CLOSID changed is wasteful.

Again, the CLOSID only changes if you move entities to a different resource
control group and in that case the RMID change is the least of your worries.

> Correct. But there may not be a fixed CLOSID association if loads
> exhibit dynamic behavior and/or system load changes dynamically.

So, you really want to move entities around between resource control groups
dynamically? I'm not seing why you would want to do that, but I'm all ear
to get educated on that.
 
Thanks,

tglx


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-20 Thread Thomas Gleixner
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner  wrote:
> > Above you are talking about the same CLOSID and different RMIDS and not
> > about changing both.
> 
> The scenario I talked about implies changing CLOSID without affecting
> monitoring.
> It happens when the allocation needs for a thread/cgroup/CPU change
> dynamically. Forcing to change the RMID together with the CLOSID would
> give wrong monitoring values unless the old RMID is kept around until
> becomes free, which is ugly and would waste a RMID.

When the allocation needs for a resource control group change, then we
simply update the allocation constraints of that group without chaning the
CLOSID. So everything just stays the same.

If you move entities to a different group then of course the CLOSID
changes and then it's a different story how to deal with monitoring.

> > To gather any useful information for both CPU1 and T1 you need TWO
> > RMIDs. Everything else is voodoo and crystal ball analysis and we are not
> > going to support that.
> >
> 
> Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU
> just because the CLOSID changed is wasteful.

Again, the CLOSID only changes if you move entities to a different resource
control group and in that case the RMID change is the least of your worries.

> Correct. But there may not be a fixed CLOSID association if loads
> exhibit dynamic behavior and/or system load changes dynamically.

So, you really want to move entities around between resource control groups
dynamically? I'm not seing why you would want to do that, but I'm all ear
to get educated on that.
 
Thanks,

tglx


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread David Carrillo-Cisneros
On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
 wrote:
> Resending including Thomas , also with some changes. Sorry for the spam
>
> Based on Thomas and Peterz feedback Can think of two design
> variants which target:
>
> -Support monitoring and allocating using the same resctrl group.
> user can use a resctrl group to allocate resources and also monitor
> them (with respect to tasks or cpu)
>
> -Also allows monitoring outside of resctrl so that user can
> monitor subgroups who use the same closid. This mode can be used
> when user wants to monitor more than just the resctrl groups.
>
> The first design version uses and modifies perf_cgroup, second version
> builds a new interface resmon.

The second version would require to build a whole new set of tools,
deploy them and maintain them. Users will have to run perf for certain
events and resmon (or whatever is named the new tool) for rdt. I see
it as too complex and much prefer to keep using perf.

> The first version is close to the patches
> sent with some additions/changes. This includes details of the design as
> per Thomas/Peterz feedback.
>
> 1> First Design option: without modifying the resctrl and using perf
> 
> 
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same)
>
>
> Monitor cqm using perf
> --
>
> perf can monitor individual tasks using the -t
> option just like before.
>
> # perf stat -e llc_occupancy -t PID1,PID2
>
> user can monitor the cpu occupancy using the -C option in perf:
>
> # perf stat -e llc_occupancy -C 5
>
> Below shows how user can monitor cgroup occupancy:
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g2
> # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks
>
> # perf stat -e intel_cqm/llc_occupancy/ -a -G g2
>
> To monitor a resctrl group, user can group the same tasks in resctrl
> group into the cgroup.
>
> To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
> group p1 to cgroup g1
>
> # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks
>
> Introducing a new option for resctrl may complicate monitoring because
> supporting cgroup 'task groups' and resctrl 'task groups' leads to
> situations where:
> if the groups intersect, then there is no way to know what
> l3_allocations contribute to which group.
>
> ex:
> p1 has tasks t1, t2, t3
> g1 has tasks t2, t3, t4
>
> The only way to get occupancy for g1 and p1 would be to allocate an RMID
> for each task which can as well be done with the -t option.

That's simply recreating the resctrl group as a cgroup.

I think that the main advantage of doing allocation first is that we
could use the context switch in rdt allocation and greatly simplify
the pmu side of it.

If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.

Then we only need a way to express that monitoring must happen in a
resctl to the perf_event_open syscall.

My first thought is to have a "rdt_monitor" file per resctl group. A
user passes it to perf_event_open in the way cgroups are passed now.
We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also
cover rdt_monitor files. The syscall can figure if it's a cgroup or a
rdt_group. The rdt_monitoring PMU would only work with rdt_monitor
groups

Then the rdm_monitoring PMU will be pretty dumb, having neither task
nor CPU contexts. Just providing the pmu->read and pmu->event_init
functions.

Task monitoring can be done with resctrl as well by adding the PID to
a new resctl and opening the event on it. And, since we'd allow CLOSID
to be shared between resctrl groups, allocation wouldn't break.

It's a first idea, so please dont hate too hard ;) .

David

>
> Monitoring cqm cgroups Implementation
> -
>
> When monitoring two different cgroups in the same hierarchy (ex say g11
> has an ancestor g1 which are both being monitored as shown below) we
> need the g11 counts to be considered for g1 as well.
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g1/g11
>
> When measuring for g1 llc_occupancy we cannot write two different RMIDs
> (because we need to count for g11 as well)
> during context switch to measure the occupancy for both g1 and g11.
> Hence the driver maintains this information and writes the RMID of the
> lowest member in the ancestory which is being monitored during ctx
> 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread David Carrillo-Cisneros
On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa
 wrote:
> Resending including Thomas , also with some changes. Sorry for the spam
>
> Based on Thomas and Peterz feedback Can think of two design
> variants which target:
>
> -Support monitoring and allocating using the same resctrl group.
> user can use a resctrl group to allocate resources and also monitor
> them (with respect to tasks or cpu)
>
> -Also allows monitoring outside of resctrl so that user can
> monitor subgroups who use the same closid. This mode can be used
> when user wants to monitor more than just the resctrl groups.
>
> The first design version uses and modifies perf_cgroup, second version
> builds a new interface resmon.

The second version would require to build a whole new set of tools,
deploy them and maintain them. Users will have to run perf for certain
events and resmon (or whatever is named the new tool) for rdt. I see
it as too complex and much prefer to keep using perf.

> The first version is close to the patches
> sent with some additions/changes. This includes details of the design as
> per Thomas/Peterz feedback.
>
> 1> First Design option: without modifying the resctrl and using perf
> 
> 
>
> In this design everything in resctrl interface works like
> before (the info, resource group files like task schemata all remain the
> same)
>
>
> Monitor cqm using perf
> --
>
> perf can monitor individual tasks using the -t
> option just like before.
>
> # perf stat -e llc_occupancy -t PID1,PID2
>
> user can monitor the cpu occupancy using the -C option in perf:
>
> # perf stat -e llc_occupancy -C 5
>
> Below shows how user can monitor cgroup occupancy:
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g2
> # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks
>
> # perf stat -e intel_cqm/llc_occupancy/ -a -G g2
>
> To monitor a resctrl group, user can group the same tasks in resctrl
> group into the cgroup.
>
> To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
> group p1 to cgroup g1
>
> # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks
>
> Introducing a new option for resctrl may complicate monitoring because
> supporting cgroup 'task groups' and resctrl 'task groups' leads to
> situations where:
> if the groups intersect, then there is no way to know what
> l3_allocations contribute to which group.
>
> ex:
> p1 has tasks t1, t2, t3
> g1 has tasks t2, t3, t4
>
> The only way to get occupancy for g1 and p1 would be to allocate an RMID
> for each task which can as well be done with the -t option.

That's simply recreating the resctrl group as a cgroup.

I think that the main advantage of doing allocation first is that we
could use the context switch in rdt allocation and greatly simplify
the pmu side of it.

If resctrl groups could lift the restriction of one resctl per CLOSID,
then the user can create many resctrl in the way perf cgroups are
created now. The advantage is that there wont be cgroup hierarchy!
making things much simpler. Also no need to optimize perf event
context switch to make llc_occupancy work.

Then we only need a way to express that monitoring must happen in a
resctl to the perf_event_open syscall.

My first thought is to have a "rdt_monitor" file per resctl group. A
user passes it to perf_event_open in the way cgroups are passed now.
We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also
cover rdt_monitor files. The syscall can figure if it's a cgroup or a
rdt_group. The rdt_monitoring PMU would only work with rdt_monitor
groups

Then the rdm_monitoring PMU will be pretty dumb, having neither task
nor CPU contexts. Just providing the pmu->read and pmu->event_init
functions.

Task monitoring can be done with resctrl as well by adding the PID to
a new resctl and opening the event on it. And, since we'd allow CLOSID
to be shared between resctrl groups, allocation wouldn't break.

It's a first idea, so please dont hate too hard ;) .

David

>
> Monitoring cqm cgroups Implementation
> -
>
> When monitoring two different cgroups in the same hierarchy (ex say g11
> has an ancestor g1 which are both being monitored as shown below) we
> need the g11 counts to be considered for g1 as well.
>
> # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
> # mkdir /sys/fs/cgroup/perf_event/g1
> # mkdir /sys/fs/cgroup/perf_event/g1/g11
>
> When measuring for g1 llc_occupancy we cannot write two different RMIDs
> (because we need to count for g11 as well)
> during context switch to measure the occupancy for both g1 and g11.
> Hence the driver maintains this information and writes the RMID of the
> lowest member in the ancestory which is being monitored during ctx
> switch.
>
> The cqm_info is added 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread David Carrillo-Cisneros
On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner  wrote:
> On Wed, 18 Jan 2017, David Carrillo-Cisneros wrote:
>> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner  wrote:
>> There are use cases where the RMID to CLOSID mapping is not that simple.
>> Some of them are:
>>
>> 1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread
>> during phases that initialize relevant data, while changing it to another 
>> during
>> phases that pollute cache. Yet, we want the RMID to remain the same.
>
> That's fine. I did not say that you need fixed RMD <-> CLOSID mappings. The
> point is that monitoring across different CLOSID domains is pointless.
>
> I have no idea how you want to do that with the proposed implementation to
> switch the RMID of the thread on the fly, but that's a different story.
>
>> A different variation is to change CLOSID to increase/decrease the size of 
>> the
>> allocated cache when high/low contention is detected.
>>
>> 2. Contention detection. I start with:
>>- T1 has RMID 1.
>>- T1 changes RMID to 2.
>>  will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases.
>
> Of course does RMID1 decrease because it's not longer in use. Oh well.
>
>> The rate of change will be relative to the level of cache contention present
>> at the time. This all happens without changing the CLOSID.
>
> See above.
>
>> >
>> > So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
>> > care at all about the occupancy of T1 simply because that is running on a
>> > seperate reservation.
>>
>> It is not useless for scenarios where CLOSID and RMIDs change dynamically
>> See above.
>
> Above you are talking about the same CLOSID and different RMIDS and not
> about changing both.

The scenario I talked about implies changing CLOSID without affecting
monitoring.
It happens when the allocation needs for a thread/cgroup/CPU change
dynamically. Forcing to change the RMID together with the CLOSID would
give wrong monitoring values unless the old RMID is kept around until
becomes free, which is ugly and would waste a RMID.

>
>> > Trying to make that an aggregated value in the first
>> > place is completely wrong. If you want an aggregate, which is pretty much
>> > useless, then user space tools can generate it easily.
>>
>> Not useless, see above.
>
> It is prettey useless, because CPU4 has CLOSID1 while T1 has CLOSID4 and
> making an aggregate over those two has absolutely nothing to do with your
> scenario above.

That's true. It is useless in the case you mentioned. I erroneously
interpreted the "useless" in your comment as a general statement about
aggregating RMID occupancies.

>
> If you want the aggregate value, then create it in user space and oracle
> (or should I say google) out of it whatever you want, but do not impose
> that to the kernel.
>
>> Having user space tools to aggregate implies wasting some of the already
>> scarce RMIDs.
>
> Oh well. Can you please explain how you want to monitor the scenario I
> explained above:
>
> CPU4  CLOSID 1
> T1CLOSID 4
>
> So if T1 runs on CPU4 then it uses CLOSID 4 which does not at all affect
> the cache occupancy of CLOSID 1. So if you use the same RMID then you
> pollute either the information of CPU4 (CLOSID1) or the information of T1
> (CLOSID4)
>
> To gather any useful information for both CPU1 and T1 you need TWO
> RMIDs. Everything else is voodoo and crystal ball analysis and we are not
> going to support that.
>

Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU
just because the CLOSID changed is wasteful.

>> > The whole approach you and David have taken is to whack some desired cgroup
>> > functionality and whatever into CQM without rethinking the overall
>> > design. And that's fundamentaly broken because it does not take cache (and
>> > memory bandwidth) allocation into account.
>>
>> Monitoring and allocation are closely related yet independent.
>
> Independent to some degree. Sure you can claim they are completely
> independent, but lots of the resulting combinations make absolutely no
> sense at all. And we really don't want to support non-sensical measurements
> just because we can. The outcome of this is complexity, inaccuracy and code
> which is too horrible to look at.
>
>> I see the advantages of allowing a per-cpu RMID as you describe in the 
>> example.
>>
>> Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond
>> one simply monitoring occupancy per allocation.
>
> I agree there are use cases where you want to monitor across allocations,
> like monitoring a task which has no CLOSID assigned and runs on different
> CPUs and therefor potentially on different CLOSIDs which are assigned to
> the different CPUs.
>
> That's fine and you want a seperate RMID for this.
>
> But once you have a fixed CLOSID association then reusing and aggregating
> across CLOSID domains is more than useless.
>

Correct. But 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread David Carrillo-Cisneros
On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner  wrote:
> On Wed, 18 Jan 2017, David Carrillo-Cisneros wrote:
>> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner  wrote:
>> There are use cases where the RMID to CLOSID mapping is not that simple.
>> Some of them are:
>>
>> 1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread
>> during phases that initialize relevant data, while changing it to another 
>> during
>> phases that pollute cache. Yet, we want the RMID to remain the same.
>
> That's fine. I did not say that you need fixed RMD <-> CLOSID mappings. The
> point is that monitoring across different CLOSID domains is pointless.
>
> I have no idea how you want to do that with the proposed implementation to
> switch the RMID of the thread on the fly, but that's a different story.
>
>> A different variation is to change CLOSID to increase/decrease the size of 
>> the
>> allocated cache when high/low contention is detected.
>>
>> 2. Contention detection. I start with:
>>- T1 has RMID 1.
>>- T1 changes RMID to 2.
>>  will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases.
>
> Of course does RMID1 decrease because it's not longer in use. Oh well.
>
>> The rate of change will be relative to the level of cache contention present
>> at the time. This all happens without changing the CLOSID.
>
> See above.
>
>> >
>> > So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
>> > care at all about the occupancy of T1 simply because that is running on a
>> > seperate reservation.
>>
>> It is not useless for scenarios where CLOSID and RMIDs change dynamically
>> See above.
>
> Above you are talking about the same CLOSID and different RMIDS and not
> about changing both.

The scenario I talked about implies changing CLOSID without affecting
monitoring.
It happens when the allocation needs for a thread/cgroup/CPU change
dynamically. Forcing to change the RMID together with the CLOSID would
give wrong monitoring values unless the old RMID is kept around until
becomes free, which is ugly and would waste a RMID.

>
>> > Trying to make that an aggregated value in the first
>> > place is completely wrong. If you want an aggregate, which is pretty much
>> > useless, then user space tools can generate it easily.
>>
>> Not useless, see above.
>
> It is prettey useless, because CPU4 has CLOSID1 while T1 has CLOSID4 and
> making an aggregate over those two has absolutely nothing to do with your
> scenario above.

That's true. It is useless in the case you mentioned. I erroneously
interpreted the "useless" in your comment as a general statement about
aggregating RMID occupancies.

>
> If you want the aggregate value, then create it in user space and oracle
> (or should I say google) out of it whatever you want, but do not impose
> that to the kernel.
>
>> Having user space tools to aggregate implies wasting some of the already
>> scarce RMIDs.
>
> Oh well. Can you please explain how you want to monitor the scenario I
> explained above:
>
> CPU4  CLOSID 1
> T1CLOSID 4
>
> So if T1 runs on CPU4 then it uses CLOSID 4 which does not at all affect
> the cache occupancy of CLOSID 1. So if you use the same RMID then you
> pollute either the information of CPU4 (CLOSID1) or the information of T1
> (CLOSID4)
>
> To gather any useful information for both CPU1 and T1 you need TWO
> RMIDs. Everything else is voodoo and crystal ball analysis and we are not
> going to support that.
>

Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU
just because the CLOSID changed is wasteful.

>> > The whole approach you and David have taken is to whack some desired cgroup
>> > functionality and whatever into CQM without rethinking the overall
>> > design. And that's fundamentaly broken because it does not take cache (and
>> > memory bandwidth) allocation into account.
>>
>> Monitoring and allocation are closely related yet independent.
>
> Independent to some degree. Sure you can claim they are completely
> independent, but lots of the resulting combinations make absolutely no
> sense at all. And we really don't want to support non-sensical measurements
> just because we can. The outcome of this is complexity, inaccuracy and code
> which is too horrible to look at.
>
>> I see the advantages of allowing a per-cpu RMID as you describe in the 
>> example.
>>
>> Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond
>> one simply monitoring occupancy per allocation.
>
> I agree there are use cases where you want to monitor across allocations,
> like monitoring a task which has no CLOSID assigned and runs on different
> CPUs and therefor potentially on different CLOSIDs which are assigned to
> the different CPUs.
>
> That's fine and you want a seperate RMID for this.
>
> But once you have a fixed CLOSID association then reusing and aggregating
> across CLOSID domains is more than useless.
>

Correct. But there may not be a fixed CLOSID 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread Vikas Shivappa
Resending including Thomas , also with some changes. Sorry for the spam

Based on Thomas and Peterz feedback Can think of two design 
variants which target:

-Support monitoring and allocating using the same resctrl group.
user can use a resctrl group to allocate resources and also monitor
them (with respect to tasks or cpu)

-Also allows monitoring outside of resctrl so that user can
monitor subgroups who use the same closid. This mode can be used
when user wants to monitor more than just the resctrl groups.

The first design version uses and modifies perf_cgroup, second version
builds a new interface resmon. The first version is close to the patches
sent with some additions/changes. This includes details of the design as
per Thomas/Peterz feedback.

1> First Design option: without modifying the resctrl and using perf



In this design everything in resctrl interface works like
before (the info, resource group files like task schemata all remain the
same)


Monitor cqm using perf
--

perf can monitor individual tasks using the -t
option just like before.

# perf stat -e llc_occupancy -t PID1,PID2

user can monitor the cpu occupancy using the -C option in perf:

# perf stat -e llc_occupancy -C 5

Below shows how user can monitor cgroup occupancy:

# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# mkdir /sys/fs/cgroup/perf_event/g2
# echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks

# perf stat -e intel_cqm/llc_occupancy/ -a -G g2

To monitor a resctrl group, user can group the same tasks in resctrl
group into the cgroup.

To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
group p1 to cgroup g1

# echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks

Introducing a new option for resctrl may complicate monitoring because
supporting cgroup 'task groups' and resctrl 'task groups' leads to
situations where:
if the groups intersect, then there is no way to know what
l3_allocations contribute to which group.

ex:
p1 has tasks t1, t2, t3
g1 has tasks t2, t3, t4

The only way to get occupancy for g1 and p1 would be to allocate an RMID
for each task which can as well be done with the -t option.

Monitoring cqm cgroups Implementation
-

When monitoring two different cgroups in the same hierarchy (ex say g11
has an ancestor g1 which are both being monitored as shown below) we
need the g11 counts to be considered for g1 as well. 

# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# mkdir /sys/fs/cgroup/perf_event/g1/g11

When measuring for g1 llc_occupancy we cannot write two different RMIDs
(because we need to count for g11 as well)
during context switch to measure the occupancy for both g1 and g11.
Hence the driver maintains this information and writes the RMID of the
lowest member in the ancestory which is being monitored during ctx
switch.

The cqm_info is added to the perf_cgroup structure to maintain this
information. The structure is allocated and destroyed at css_alloc and
css_free. All the events tied to a cgroup can use the same
information while reading the counts.

struct perf_cgroup {
#ifdef CONFIG_INTEL_RDT_M
void *cqm_info;
#endif
...

 }

struct cqm_info {
  bool mon_enabled;
  int level;
  u32 *rmid;
  struct cgrp_cqm_info *mfa;
  struct list_head tskmon_rlist;
 };

Due to the hierarchical nature of cgroups, every cgroup just
monitors for the 'nearest monitored ancestor' at all times.
Since root cgroup is always monitored, all descendents
at boot time monitor for root and hence all mfa points to root
except for root->mfa which is NULL.

1. RMID setup: When cgroup x start monitoring:
   for each descendent y, if y's mfa->level < x->level, then
   y->mfa = x. (Where level of root node = 0...)
2. sched_in: During sched_in for x
   if (x->mon_enabled) choose x->rmid
 else choose x->mfa->rmid.
3. read: for each descendent of cgroup x
   if (x->monitored) count += rmid_read(x->rmid).
4. evt_destroy: for each descendent y of x, if (y->mfa == x)
   then y->mfa = x->mfa. Meaning if any descendent was monitoring for
   x, set that descendent to monitor for the cgroup which x was
   monitoring for.

To monitor a task in a cgroup x along with monitoring cgroup x itself
cqm_info maintains a list of tasks that are being monitored in the
cgroup.

When a task which belongs to a cgroup x is being monitored, it
always uses its own task->rmid even if cgroup x is monitored during sched_in.
To account for the counts of such tasks, cgroup keeps this list
and parses it during read.
taskmon_rlist is used to maintain the list. The list is modified when a
task is attached to the cgroup or removed from the group.

Example 1 (Some examples modeled from resctrl ui documentation)
-

A single 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread Vikas Shivappa
Resending including Thomas , also with some changes. Sorry for the spam

Based on Thomas and Peterz feedback Can think of two design 
variants which target:

-Support monitoring and allocating using the same resctrl group.
user can use a resctrl group to allocate resources and also monitor
them (with respect to tasks or cpu)

-Also allows monitoring outside of resctrl so that user can
monitor subgroups who use the same closid. This mode can be used
when user wants to monitor more than just the resctrl groups.

The first design version uses and modifies perf_cgroup, second version
builds a new interface resmon. The first version is close to the patches
sent with some additions/changes. This includes details of the design as
per Thomas/Peterz feedback.

1> First Design option: without modifying the resctrl and using perf



In this design everything in resctrl interface works like
before (the info, resource group files like task schemata all remain the
same)


Monitor cqm using perf
--

perf can monitor individual tasks using the -t
option just like before.

# perf stat -e llc_occupancy -t PID1,PID2

user can monitor the cpu occupancy using the -C option in perf:

# perf stat -e llc_occupancy -C 5

Below shows how user can monitor cgroup occupancy:

# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# mkdir /sys/fs/cgroup/perf_event/g2
# echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks

# perf stat -e intel_cqm/llc_occupancy/ -a -G g2

To monitor a resctrl group, user can group the same tasks in resctrl
group into the cgroup.

To monitor the tasks in p1 in example 2 below, add the tasks in resctrl
group p1 to cgroup g1

# echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks

Introducing a new option for resctrl may complicate monitoring because
supporting cgroup 'task groups' and resctrl 'task groups' leads to
situations where:
if the groups intersect, then there is no way to know what
l3_allocations contribute to which group.

ex:
p1 has tasks t1, t2, t3
g1 has tasks t2, t3, t4

The only way to get occupancy for g1 and p1 would be to allocate an RMID
for each task which can as well be done with the -t option.

Monitoring cqm cgroups Implementation
-

When monitoring two different cgroups in the same hierarchy (ex say g11
has an ancestor g1 which are both being monitored as shown below) we
need the g11 counts to be considered for g1 as well. 

# mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/
# mkdir /sys/fs/cgroup/perf_event/g1
# mkdir /sys/fs/cgroup/perf_event/g1/g11

When measuring for g1 llc_occupancy we cannot write two different RMIDs
(because we need to count for g11 as well)
during context switch to measure the occupancy for both g1 and g11.
Hence the driver maintains this information and writes the RMID of the
lowest member in the ancestory which is being monitored during ctx
switch.

The cqm_info is added to the perf_cgroup structure to maintain this
information. The structure is allocated and destroyed at css_alloc and
css_free. All the events tied to a cgroup can use the same
information while reading the counts.

struct perf_cgroup {
#ifdef CONFIG_INTEL_RDT_M
void *cqm_info;
#endif
...

 }

struct cqm_info {
  bool mon_enabled;
  int level;
  u32 *rmid;
  struct cgrp_cqm_info *mfa;
  struct list_head tskmon_rlist;
 };

Due to the hierarchical nature of cgroups, every cgroup just
monitors for the 'nearest monitored ancestor' at all times.
Since root cgroup is always monitored, all descendents
at boot time monitor for root and hence all mfa points to root
except for root->mfa which is NULL.

1. RMID setup: When cgroup x start monitoring:
   for each descendent y, if y's mfa->level < x->level, then
   y->mfa = x. (Where level of root node = 0...)
2. sched_in: During sched_in for x
   if (x->mon_enabled) choose x->rmid
 else choose x->mfa->rmid.
3. read: for each descendent of cgroup x
   if (x->monitored) count += rmid_read(x->rmid).
4. evt_destroy: for each descendent y of x, if (y->mfa == x)
   then y->mfa = x->mfa. Meaning if any descendent was monitoring for
   x, set that descendent to monitor for the cgroup which x was
   monitoring for.

To monitor a task in a cgroup x along with monitoring cgroup x itself
cqm_info maintains a list of tasks that are being monitored in the
cgroup.

When a task which belongs to a cgroup x is being monitored, it
always uses its own task->rmid even if cgroup x is monitored during sched_in.
To account for the counts of such tasks, cgroup keeps this list
and parses it during read.
taskmon_rlist is used to maintain the list. The list is modified when a
task is attached to the cgroup or removed from the group.

Example 1 (Some examples modeled from resctrl ui documentation)
-

A single 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread Shivappa Vikas


Hello Peterz,

On Wed, 18 Jan 2017, Peter Zijlstra wrote:


On Wed, Jan 18, 2017 at 09:53:02AM +0100, Thomas Gleixner wrote:

The whole approach you and David have taken is to whack some desired cgroup
functionality and whatever into CQM without rethinking the overall
design. And that's fundamentaly broken because it does not take cache (and
memory bandwidth) allocation into account.

I seriously doubt, that the existing CQM/MBM code can be refactored in any
useful way. As Peter Zijlstra said before: Remove the existing cruft
completely and start with completely new design from scratch.

And this new design should start from the allocation angle and then add the
whole other muck on top so far its possible. Allocation related monitoring
must be the primary focus, everything else is just tinkering.


Agreed, the little I have seen of these patches is quite horrible. And
there seems to be a definite lack of design; or at the very least an
utter lack of communication of it.


the 1/12 Documentation patch describes the interface. Basically we are just 
trying to support the task and cgroup monitoring.


By the design document, do you want a document describing how we enable the 
cgroup for cqm since its a special case?
(which would include all the arch_info in the perf_cgroup we add to keep track 
of hierarchy in the driver , etc ..)


Thanks,
Vikas



The approach, in so far that I could make sense of it, seems to utterly
rape perf-cgroup. I think Thomas makes a sensible point in trying to
match it to the CAT stuffs.



Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread Shivappa Vikas


Hello Peterz,

On Wed, 18 Jan 2017, Peter Zijlstra wrote:


On Wed, Jan 18, 2017 at 09:53:02AM +0100, Thomas Gleixner wrote:

The whole approach you and David have taken is to whack some desired cgroup
functionality and whatever into CQM without rethinking the overall
design. And that's fundamentaly broken because it does not take cache (and
memory bandwidth) allocation into account.

I seriously doubt, that the existing CQM/MBM code can be refactored in any
useful way. As Peter Zijlstra said before: Remove the existing cruft
completely and start with completely new design from scratch.

And this new design should start from the allocation angle and then add the
whole other muck on top so far its possible. Allocation related monitoring
must be the primary focus, everything else is just tinkering.


Agreed, the little I have seen of these patches is quite horrible. And
there seems to be a definite lack of design; or at the very least an
utter lack of communication of it.


the 1/12 Documentation patch describes the interface. Basically we are just 
trying to support the task and cgroup monitoring.


By the design document, do you want a document describing how we enable the 
cgroup for cqm since its a special case?
(which would include all the arch_info in the perf_cgroup we add to keep track 
of hierarchy in the driver , etc ..)


Thanks,
Vikas



The approach, in so far that I could make sense of it, seems to utterly
rape perf-cgroup. I think Thomas makes a sensible point in trying to
match it to the CAT stuffs.



Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread Thomas Gleixner
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested
> is very problematic because the number of CLOSIDs is much much smaller than 
> the
> number of RMIDs, and, as Stephane mentioned it's a common use case to want to
> independently monitor many task/cgroups inside an allocation partition.

Again, that was not my intention. I just want to limit the combinations.

> A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of
> cgroup monitoring but the case where CLOSID changes would be messy. In

CLOSIDs of RDT groups do not change. They are allocated when the group is
created.

Thanks,

tglx



Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread Thomas Gleixner
On Wed, 18 Jan 2017, Stephane Eranian wrote:
> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner  wrote:
> >

> Your use case is specific to HPC and not Web workloads we run.  Jobs run
> in cgroups which may span all the CPUs of the machine.  CAT may be used
> to partition the cache. Cgroups would run inside a partition.  There may
> be multiple cgroups running in the same partition. I can understand the
> value of tracking occupancy per CLOSID, however that granularity is not
> enough for our use case.  Inside a partition, we want to know the
> occupancy of each cgroup to be able to assign blame to the top
> consumer. Thus, there needs to be a way to monitor occupancy per
> cgroup. I'd like to understand how your proposal would cover this use
> case.

The point I'm making as I explained to David is that we need to start from
the allocation angle. Of course can you monitor different tasks or task
groups inside an allocation.

> Another important aspect is that CQM measures new allocations, thus to
> get total occupancy you need to be able to monitor the thread, CPU,
> CLOSid or cgroup from the beginning of execution. In the case of a cgroup
> from the moment where the first thread is scheduled into the cgroup. To
> do this a RMID needs to be assigned from the beginning to the entity to
> be monitored.  It could be by creating a CQM event just to cause an RMID
> to be assigned as discussed earlier on this thread. And then if a perf
> stat is launched later it will get the same RMID and report full
> occupancy. But that requires the first event to remain alive, i.e., some
> process must keep the file descriptor open, i.e., need some daemon or a
> perf stat running in the background.

That's fine, but there must be a less convoluted way to do that. The
currently proposed stuff is simply horrible because it lacks any form of
design and is just hacked into submission.

> There are also use cases where you want CQM without necessarily enabling
> CAT, for instance, if you want to know the cache footprint of a workload
> to estimate how if it could be co-located with others.

That's a subset of the other stuff because it's all bound to CLOSID 0. So
you can again monitor tasks or tasks groups seperately.

Thanks,

tglx


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread Thomas Gleixner
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote:
> A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested
> is very problematic because the number of CLOSIDs is much much smaller than 
> the
> number of RMIDs, and, as Stephane mentioned it's a common use case to want to
> independently monitor many task/cgroups inside an allocation partition.

Again, that was not my intention. I just want to limit the combinations.

> A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of
> cgroup monitoring but the case where CLOSID changes would be messy. In

CLOSIDs of RDT groups do not change. They are allocated when the group is
created.

Thanks,

tglx



Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread Thomas Gleixner
On Wed, 18 Jan 2017, Stephane Eranian wrote:
> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner  wrote:
> >

> Your use case is specific to HPC and not Web workloads we run.  Jobs run
> in cgroups which may span all the CPUs of the machine.  CAT may be used
> to partition the cache. Cgroups would run inside a partition.  There may
> be multiple cgroups running in the same partition. I can understand the
> value of tracking occupancy per CLOSID, however that granularity is not
> enough for our use case.  Inside a partition, we want to know the
> occupancy of each cgroup to be able to assign blame to the top
> consumer. Thus, there needs to be a way to monitor occupancy per
> cgroup. I'd like to understand how your proposal would cover this use
> case.

The point I'm making as I explained to David is that we need to start from
the allocation angle. Of course can you monitor different tasks or task
groups inside an allocation.

> Another important aspect is that CQM measures new allocations, thus to
> get total occupancy you need to be able to monitor the thread, CPU,
> CLOSid or cgroup from the beginning of execution. In the case of a cgroup
> from the moment where the first thread is scheduled into the cgroup. To
> do this a RMID needs to be assigned from the beginning to the entity to
> be monitored.  It could be by creating a CQM event just to cause an RMID
> to be assigned as discussed earlier on this thread. And then if a perf
> stat is launched later it will get the same RMID and report full
> occupancy. But that requires the first event to remain alive, i.e., some
> process must keep the file descriptor open, i.e., need some daemon or a
> perf stat running in the background.

That's fine, but there must be a less convoluted way to do that. The
currently proposed stuff is simply horrible because it lacks any form of
design and is just hacked into submission.

> There are also use cases where you want CQM without necessarily enabling
> CAT, for instance, if you want to know the cache footprint of a workload
> to estimate how if it could be co-located with others.

That's a subset of the other stuff because it's all bound to CLOSID 0. So
you can again monitor tasks or tasks groups seperately.

Thanks,

tglx


Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread Thomas Gleixner
On Wed, 18 Jan 2017, David Carrillo-Cisneros wrote:
> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner  wrote:
> There are use cases where the RMID to CLOSID mapping is not that simple.
> Some of them are:
>
> 1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread
> during phases that initialize relevant data, while changing it to another 
> during
> phases that pollute cache. Yet, we want the RMID to remain the same.

That's fine. I did not say that you need fixed RMD <-> CLOSID mappings. The
point is that monitoring across different CLOSID domains is pointless.

I have no idea how you want to do that with the proposed implementation to
switch the RMID of the thread on the fly, but that's a different story.

> A different variation is to change CLOSID to increase/decrease the size of the
> allocated cache when high/low contention is detected.
> 
> 2. Contention detection. I start with:
>- T1 has RMID 1.
>- T1 changes RMID to 2.
>  will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases.

Of course does RMID1 decrease because it's not longer in use. Oh well.

> The rate of change will be relative to the level of cache contention present
> at the time. This all happens without changing the CLOSID.

See above.

> >
> > So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
> > care at all about the occupancy of T1 simply because that is running on a
> > seperate reservation.
> 
> It is not useless for scenarios where CLOSID and RMIDs change dynamically
> See above.

Above you are talking about the same CLOSID and different RMIDS and not
about changing both.

> > Trying to make that an aggregated value in the first
> > place is completely wrong. If you want an aggregate, which is pretty much
> > useless, then user space tools can generate it easily.
> 
> Not useless, see above.

It is prettey useless, because CPU4 has CLOSID1 while T1 has CLOSID4 and
making an aggregate over those two has absolutely nothing to do with your
scenario above.

If you want the aggregate value, then create it in user space and oracle
(or should I say google) out of it whatever you want, but do not impose
that to the kernel.

> Having user space tools to aggregate implies wasting some of the already
> scarce RMIDs.

Oh well. Can you please explain how you want to monitor the scenario I
explained above:

CPU4  CLOSID 1
T1CLOSID 4

So if T1 runs on CPU4 then it uses CLOSID 4 which does not at all affect
the cache occupancy of CLOSID 1. So if you use the same RMID then you
pollute either the information of CPU4 (CLOSID1) or the information of T1
(CLOSID4)

To gather any useful information for both CPU1 and T1 you need TWO
RMIDs. Everything else is voodoo and crystal ball analysis and we are not
going to support that.
 
> > The whole approach you and David have taken is to whack some desired cgroup
> > functionality and whatever into CQM without rethinking the overall
> > design. And that's fundamentaly broken because it does not take cache (and
> > memory bandwidth) allocation into account.
> 
> Monitoring and allocation are closely related yet independent.

Independent to some degree. Sure you can claim they are completely
independent, but lots of the resulting combinations make absolutely no
sense at all. And we really don't want to support non-sensical measurements
just because we can. The outcome of this is complexity, inaccuracy and code
which is too horrible to look at.

> I see the advantages of allowing a per-cpu RMID as you describe in the 
> example.
> 
> Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond
> one simply monitoring occupancy per allocation.

I agree there are use cases where you want to monitor across allocations,
like monitoring a task which has no CLOSID assigned and runs on different
CPUs and therefor potentially on different CLOSIDs which are assigned to
the different CPUs.

That's fine and you want a seperate RMID for this.

But once you have a fixed CLOSID association then reusing and aggregating
across CLOSID domains is more than useless.

> > I seriously doubt, that the existing CQM/MBM code can be refactored in any
> > useful way. As Peter Zijlstra said before: Remove the existing cruft
> > completely and start with completely new design from scratch.
> >
> > And this new design should start from the allocation angle and then add the
> > whole other muck on top so far its possible. Allocation related monitoring
> > must be the primary focus, everything else is just tinkering.
> 
> Assuming that my stated need for more than one RMID per CLOSID or more
> than one CLOSID per RMID is recognized, what would be the advantage of
> starting the design of monitoring from the allocation perspective?
>
> It's quite doable to create a new version of CQM/CMT without all the
> cgroup murk.
>
> We can also create an easy way to open events to monitor CLOSIDs. Yet, I
> don't see the advantage of 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread Thomas Gleixner
On Wed, 18 Jan 2017, David Carrillo-Cisneros wrote:
> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner  wrote:
> There are use cases where the RMID to CLOSID mapping is not that simple.
> Some of them are:
>
> 1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread
> during phases that initialize relevant data, while changing it to another 
> during
> phases that pollute cache. Yet, we want the RMID to remain the same.

That's fine. I did not say that you need fixed RMD <-> CLOSID mappings. The
point is that monitoring across different CLOSID domains is pointless.

I have no idea how you want to do that with the proposed implementation to
switch the RMID of the thread on the fly, but that's a different story.

> A different variation is to change CLOSID to increase/decrease the size of the
> allocated cache when high/low contention is detected.
> 
> 2. Contention detection. I start with:
>- T1 has RMID 1.
>- T1 changes RMID to 2.
>  will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases.

Of course does RMID1 decrease because it's not longer in use. Oh well.

> The rate of change will be relative to the level of cache contention present
> at the time. This all happens without changing the CLOSID.

See above.

> >
> > So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
> > care at all about the occupancy of T1 simply because that is running on a
> > seperate reservation.
> 
> It is not useless for scenarios where CLOSID and RMIDs change dynamically
> See above.

Above you are talking about the same CLOSID and different RMIDS and not
about changing both.

> > Trying to make that an aggregated value in the first
> > place is completely wrong. If you want an aggregate, which is pretty much
> > useless, then user space tools can generate it easily.
> 
> Not useless, see above.

It is prettey useless, because CPU4 has CLOSID1 while T1 has CLOSID4 and
making an aggregate over those two has absolutely nothing to do with your
scenario above.

If you want the aggregate value, then create it in user space and oracle
(or should I say google) out of it whatever you want, but do not impose
that to the kernel.

> Having user space tools to aggregate implies wasting some of the already
> scarce RMIDs.

Oh well. Can you please explain how you want to monitor the scenario I
explained above:

CPU4  CLOSID 1
T1CLOSID 4

So if T1 runs on CPU4 then it uses CLOSID 4 which does not at all affect
the cache occupancy of CLOSID 1. So if you use the same RMID then you
pollute either the information of CPU4 (CLOSID1) or the information of T1
(CLOSID4)

To gather any useful information for both CPU1 and T1 you need TWO
RMIDs. Everything else is voodoo and crystal ball analysis and we are not
going to support that.
 
> > The whole approach you and David have taken is to whack some desired cgroup
> > functionality and whatever into CQM without rethinking the overall
> > design. And that's fundamentaly broken because it does not take cache (and
> > memory bandwidth) allocation into account.
> 
> Monitoring and allocation are closely related yet independent.

Independent to some degree. Sure you can claim they are completely
independent, but lots of the resulting combinations make absolutely no
sense at all. And we really don't want to support non-sensical measurements
just because we can. The outcome of this is complexity, inaccuracy and code
which is too horrible to look at.

> I see the advantages of allowing a per-cpu RMID as you describe in the 
> example.
> 
> Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond
> one simply monitoring occupancy per allocation.

I agree there are use cases where you want to monitor across allocations,
like monitoring a task which has no CLOSID assigned and runs on different
CPUs and therefor potentially on different CLOSIDs which are assigned to
the different CPUs.

That's fine and you want a seperate RMID for this.

But once you have a fixed CLOSID association then reusing and aggregating
across CLOSID domains is more than useless.

> > I seriously doubt, that the existing CQM/MBM code can be refactored in any
> > useful way. As Peter Zijlstra said before: Remove the existing cruft
> > completely and start with completely new design from scratch.
> >
> > And this new design should start from the allocation angle and then add the
> > whole other muck on top so far its possible. Allocation related monitoring
> > must be the primary focus, everything else is just tinkering.
> 
> Assuming that my stated need for more than one RMID per CLOSID or more
> than one CLOSID per RMID is recognized, what would be the advantage of
> starting the design of monitoring from the allocation perspective?
>
> It's quite doable to create a new version of CQM/CMT without all the
> cgroup murk.
>
> We can also create an easy way to open events to monitor CLOSIDs. Yet, I
> don't see the advantage of dissociating 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread David Carrillo-Cisneros
On Wed, Jan 18, 2017 at 6:09 PM, David Carrillo-Cisneros
 wrote:
> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner  wrote:
>> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
>>> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
>>> > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>>> > > - Issue(1): Inaccurate data for per package data, systemwide. Just 
>>> > > prints
>>> > > zeros or arbitrary numbers.
>>> > >
>>> > > Fix: Patches fix this by just throwing an error if the mode is not
>>> > > supported.
>>> > > The modes supported is task monitoring and cgroup monitoring.
>>> > > Also the per package
>>> > > data for say socket x is returned with the -C  -G cgrpy
>>> > > option.
>>> > > The systemwide data can be looked up by monitoring root cgroup.
>>> >
>>> > Fine. That just lacks any comment in the implementation. Otherwise I would
>>> > not have asked the question about cpu monitoring. Though I fundamentaly
>>> > hate the idea of requiring cgroups for this to work.
>>> >
>>> > If I just want to look at CPU X why on earth do I have to set up all that
>>> > cgroup muck? Just because your main focus is cgroups?
>>>
>>> The upstream per cpu data is broken because its not overriding the other 
>>> task
>>> event RMIDs on that cpu with the cpu event RMID.
>>>
>>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized
>>> to the cpu, however then we miss all the task llc_occupancy in that - still
>>> evaluating it.
>>
>> The point here is that CQM is closely connected to the cache allocation
>> technology. After a lengthy discussion we ended up having
>>
>>   - per cpu CLOSID
>>   - per task CLOSID
>>
>> where all tasks which do not have a CLOSID assigned use the CLOSID which is
>> assigned to the CPU they are running on.
>>
>> So if I configure a system by simply partitioning the cache per cpu, which
>> is the proper way to do it for HPC and RT usecases where workloads are
>> partitioned on CPUs as well, then I really want to have an equaly simple
>> way to monitor the occupancy for that reservation.
>>
>> And looking at that from the CAT point of view, which is the proper way to
>> do it, makes it obvious that CQM should be modeled to match CAT.
>>
>> So lets assume the following:
>>
>>CPU 0-3 default CLOSID 0
>>CPU 4   CLOSID 1
>>CPU 5   CLOSID 2
>>CPU 6   CLOSID 3
>>CPU 7   CLOSID 3
>>
>>T1  CLOSID 4
>>T2  CLOSID 5
>>T3  CLOSID 6
>>T4  CLOSID 6
>>
>>All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>>they run on.
>>
>> then the obvious basic monitoring requirement is to have a RMID for each
>> CLOSID.
>>
>> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
>> care at all about the occupancy of T1 simply because that is running on a
>> seperate reservation. Trying to make that an aggregated value in the first
>> place is completely wrong. If you want an aggregate, which is pretty much
>> useless, then user space tools can generate it easily.
>>
>> The whole approach you and David have taken is to whack some desired cgroup
>> functionality and whatever into CQM without rethinking the overall
>> design. And that's fundamentaly broken because it does not take cache (and
>> memory bandwidth) allocation into account.
>>
>> I seriously doubt, that the existing CQM/MBM code can be refactored in any
>> useful way. As Peter Zijlstra said before: Remove the existing cruft
>> completely and start with completely new design from scratch.
>>
>> And this new design should start from the allocation angle and then add the
>> whole other muck on top so far its possible. Allocation related monitoring
>> must be the primary focus, everything else is just tinkering.
>>
>
> If in this email you meant "Resource group" where you wrote "CLOSID", then
> please disregard my previous email. It seems like a good idea to me to have
> a 1:1 mapping between RMIDs and "Resource groups".
>
> The distinction matter because changing the schemata in the resource group
> would likely trigger a change of CLOSID, which is useful.
>

Just realized that the sharing of CLOSIDs is not part of the accepted
version of RDT.
My mental model was still on the old CAT driver that did allow sharing
of CLOSIDs
between cgroups. Now I understand why CLOSID was assumed to be equal with
"Resource groups". Sorry for the noise. Then the comments in my previous email
hold.

In summary and addition to latest emails:

A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested
is very problematic because the number of CLOSIDs is much much smaller than the
number of RMIDs, and, as Stephane mentioned it's a common use case to want to
independently monitor many task/cgroups inside an allocation partition.

A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of
cgroup monitoring but the case 

Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-19 Thread David Carrillo-Cisneros
On Wed, Jan 18, 2017 at 6:09 PM, David Carrillo-Cisneros
 wrote:
> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner  wrote:
>> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
>>> On Tue, 17 Jan 2017, Thomas Gleixner wrote:
>>> > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
>>> > > - Issue(1): Inaccurate data for per package data, systemwide. Just 
>>> > > prints
>>> > > zeros or arbitrary numbers.
>>> > >
>>> > > Fix: Patches fix this by just throwing an error if the mode is not
>>> > > supported.
>>> > > The modes supported is task monitoring and cgroup monitoring.
>>> > > Also the per package
>>> > > data for say socket x is returned with the -C  -G cgrpy
>>> > > option.
>>> > > The systemwide data can be looked up by monitoring root cgroup.
>>> >
>>> > Fine. That just lacks any comment in the implementation. Otherwise I would
>>> > not have asked the question about cpu monitoring. Though I fundamentaly
>>> > hate the idea of requiring cgroups for this to work.
>>> >
>>> > If I just want to look at CPU X why on earth do I have to set up all that
>>> > cgroup muck? Just because your main focus is cgroups?
>>>
>>> The upstream per cpu data is broken because its not overriding the other 
>>> task
>>> event RMIDs on that cpu with the cpu event RMID.
>>>
>>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized
>>> to the cpu, however then we miss all the task llc_occupancy in that - still
>>> evaluating it.
>>
>> The point here is that CQM is closely connected to the cache allocation
>> technology. After a lengthy discussion we ended up having
>>
>>   - per cpu CLOSID
>>   - per task CLOSID
>>
>> where all tasks which do not have a CLOSID assigned use the CLOSID which is
>> assigned to the CPU they are running on.
>>
>> So if I configure a system by simply partitioning the cache per cpu, which
>> is the proper way to do it for HPC and RT usecases where workloads are
>> partitioned on CPUs as well, then I really want to have an equaly simple
>> way to monitor the occupancy for that reservation.
>>
>> And looking at that from the CAT point of view, which is the proper way to
>> do it, makes it obvious that CQM should be modeled to match CAT.
>>
>> So lets assume the following:
>>
>>CPU 0-3 default CLOSID 0
>>CPU 4   CLOSID 1
>>CPU 5   CLOSID 2
>>CPU 6   CLOSID 3
>>CPU 7   CLOSID 3
>>
>>T1  CLOSID 4
>>T2  CLOSID 5
>>T3  CLOSID 6
>>T4  CLOSID 6
>>
>>All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>>they run on.
>>
>> then the obvious basic monitoring requirement is to have a RMID for each
>> CLOSID.
>>
>> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not
>> care at all about the occupancy of T1 simply because that is running on a
>> seperate reservation. Trying to make that an aggregated value in the first
>> place is completely wrong. If you want an aggregate, which is pretty much
>> useless, then user space tools can generate it easily.
>>
>> The whole approach you and David have taken is to whack some desired cgroup
>> functionality and whatever into CQM without rethinking the overall
>> design. And that's fundamentaly broken because it does not take cache (and
>> memory bandwidth) allocation into account.
>>
>> I seriously doubt, that the existing CQM/MBM code can be refactored in any
>> useful way. As Peter Zijlstra said before: Remove the existing cruft
>> completely and start with completely new design from scratch.
>>
>> And this new design should start from the allocation angle and then add the
>> whole other muck on top so far its possible. Allocation related monitoring
>> must be the primary focus, everything else is just tinkering.
>>
>
> If in this email you meant "Resource group" where you wrote "CLOSID", then
> please disregard my previous email. It seems like a good idea to me to have
> a 1:1 mapping between RMIDs and "Resource groups".
>
> The distinction matter because changing the schemata in the resource group
> would likely trigger a change of CLOSID, which is useful.
>

Just realized that the sharing of CLOSIDs is not part of the accepted
version of RDT.
My mental model was still on the old CAT driver that did allow sharing
of CLOSIDs
between cgroups. Now I understand why CLOSID was assumed to be equal with
"Resource groups". Sorry for the noise. Then the comments in my previous email
hold.

In summary and addition to latest emails:

A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested
is very problematic because the number of CLOSIDs is much much smaller than the
number of RMIDs, and, as Stephane mentioned it's a common use case to want to
independently monitor many task/cgroups inside an allocation partition.

A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of
cgroup monitoring but the case where CLOSID changes would be messy. In

  1   2   >