Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Tue, 7 Feb 2017, Stephane Eranian wrote: > > I think the design must ensure that the following usage models can be > monitored: >- the allocations in your CAT partitions >- the allocations from a task (inclusive of children tasks) >- the allocations from a group of tasks (inclusive of children tasks) >- the allocations from a CPU >- the allocations from a group of CPUs What's missing here is: - the allocations of a subset of users (tasks/groups/cpu(s)) of a particular CAT partition Looking at your requirement list, all requirements, except the first point, have no relationship to CAT (at least not from your write up). Now the obvious questions are: - Does it make sense to ignore CAT relations in these sets? - Does it make sense to monitor a task / group of tasks, where the tasks belong to different CAT partitions? - Does it make sense to monitor a CPU / group of CPUs as a whole independent of which CAT partitions have been utilized during the monitoring period? I don't think it makes any sense, unless the resulting information is split up into CAT partitions. I'm happy to be educated on the value of making this CAT unaware, but so far I only came up with results, which need a crystal ball to analyze them. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Tue, 7 Feb 2017, Stephane Eranian wrote: > > I think the design must ensure that the following usage models can be > monitored: >- the allocations in your CAT partitions >- the allocations from a task (inclusive of children tasks) >- the allocations from a group of tasks (inclusive of children tasks) >- the allocations from a CPU >- the allocations from a group of CPUs What's missing here is: - the allocations of a subset of users (tasks/groups/cpu(s)) of a particular CAT partition Looking at your requirement list, all requirements, except the first point, have no relationship to CAT (at least not from your write up). Now the obvious questions are: - Does it make sense to ignore CAT relations in these sets? - Does it make sense to monitor a task / group of tasks, where the tasks belong to different CAT partitions? - Does it make sense to monitor a CPU / group of CPUs as a whole independent of which CAT partitions have been utilized during the monitoring period? I don't think it makes any sense, unless the resulting information is split up into CAT partitions. I'm happy to be educated on the value of making this CAT unaware, but so far I only came up with results, which need a crystal ball to analyze them. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Tony, On Tue, Feb 7, 2017 at 10:52 AM, Luck, Tonywrote: > On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote: >> Hi, >> >> I wanted to take a few steps back and look at the overall goals for >> cache monitoring. >> From the various threads and discussion, my understanding is as follows. >> >> I think the design must ensure that the following usage models can be >> monitored: >>- the allocations in your CAT partitions >>- the allocations from a task (inclusive of children tasks) >>- the allocations from a group of tasks (inclusive of children tasks) >>- the allocations from a CPU >>- the allocations from a group of CPUs >> >> All cases but first one (CAT) are natural usage. So I want to describe >> the CAT in more details. >> The goal, as I understand it, it to monitor what is going on inside >> the CAT partition to detect >> whether it saturates or if it has room to "breathe". Let's take a >> simple example. > > By "natural usage" you mean "like perf(1) provides for other events"? > Yes, people are used to monitoring events per task or per CPU. In that sense, it is the common usage model. Cgroup monitoring is a derivative of per-cpu mode. > But we are trying to figure out requirements here ... what data do people > need to manage caches and memory bandwidth. So from this perspective > monitoring a CAT group is a natural first choice ... did we provision > this group with too much, or too little cache. > I am not saying CAT is not natural. I am saying it is a justified requirement but a new one and thus need to make sure it is understood and that the kernel must track CAT partition and CAT partition cache occupancy monitoring similarly. > From that starting point I can see that a possible next step when > finding that a CAT group has too small a cache is to drill down to > find out how the tasks in the group are using cache. Armed with that > information you could move tasks that hog too much cache (and are believed > to be streaming through memory) into a different CAT group. > This is a valid usage model. But you have people who care about monitoring occupancy but do not necessarily use CAT partitions. Yet in this case, the occupancy data is still very useful to gauge cache footprint of a workload. Therefore this usage model should not be discounted. > What I'm not seeing is how drilling to CPUs helps you. > Looking for imbalance, for instance. Are all the allocations done from only a subset of the CPUs? > Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that > shows that 75% of the cache occupancy is attributed to CPU0, and only > 25% to CPU1. What can you do with this information to improve things? > If it is deemed too complex (from a kernel code perspective) to > implement per-CPU reporting how bad a loss would that be? > It is okay to first focus on per-task and per-CAT partition. What I'd like to see is an API that could possibly be extended later on to do per-CPU only mode. I am okay with having only per-CAT and per-task groups initially to keep things simpler. But the rsrcfs interface should allow extension to per-CPU only mode. Then the kernel implementation would take care of allocating the RMID accordingly. The key is always to ensure allocations can be tracked since inception of the group be it CAT, tasks, CPU. > -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Tony, On Tue, Feb 7, 2017 at 10:52 AM, Luck, Tony wrote: > On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote: >> Hi, >> >> I wanted to take a few steps back and look at the overall goals for >> cache monitoring. >> From the various threads and discussion, my understanding is as follows. >> >> I think the design must ensure that the following usage models can be >> monitored: >>- the allocations in your CAT partitions >>- the allocations from a task (inclusive of children tasks) >>- the allocations from a group of tasks (inclusive of children tasks) >>- the allocations from a CPU >>- the allocations from a group of CPUs >> >> All cases but first one (CAT) are natural usage. So I want to describe >> the CAT in more details. >> The goal, as I understand it, it to monitor what is going on inside >> the CAT partition to detect >> whether it saturates or if it has room to "breathe". Let's take a >> simple example. > > By "natural usage" you mean "like perf(1) provides for other events"? > Yes, people are used to monitoring events per task or per CPU. In that sense, it is the common usage model. Cgroup monitoring is a derivative of per-cpu mode. > But we are trying to figure out requirements here ... what data do people > need to manage caches and memory bandwidth. So from this perspective > monitoring a CAT group is a natural first choice ... did we provision > this group with too much, or too little cache. > I am not saying CAT is not natural. I am saying it is a justified requirement but a new one and thus need to make sure it is understood and that the kernel must track CAT partition and CAT partition cache occupancy monitoring similarly. > From that starting point I can see that a possible next step when > finding that a CAT group has too small a cache is to drill down to > find out how the tasks in the group are using cache. Armed with that > information you could move tasks that hog too much cache (and are believed > to be streaming through memory) into a different CAT group. > This is a valid usage model. But you have people who care about monitoring occupancy but do not necessarily use CAT partitions. Yet in this case, the occupancy data is still very useful to gauge cache footprint of a workload. Therefore this usage model should not be discounted. > What I'm not seeing is how drilling to CPUs helps you. > Looking for imbalance, for instance. Are all the allocations done from only a subset of the CPUs? > Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that > shows that 75% of the cache occupancy is attributed to CPU0, and only > 25% to CPU1. What can you do with this information to improve things? > If it is deemed too complex (from a kernel code perspective) to > implement per-CPU reporting how bad a loss would that be? > It is okay to first focus on per-task and per-CAT partition. What I'd like to see is an API that could possibly be extended later on to do per-CPU only mode. I am okay with having only per-CAT and per-task groups initially to keep things simpler. But the rsrcfs interface should allow extension to per-CPU only mode. Then the kernel implementation would take care of allocating the RMID accordingly. The key is always to ensure allocations can be tracked since inception of the group be it CAT, tasks, CPU. > -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Jan 20, 2017 at 12:11:53PM -0800, David Carrillo-Cisneros wrote: > Implementation ideas: > > First idea is to expose one monitoring file per resource in a CTRLGRP, > so the list of CTRLGRP's files would be: schemata, tasks, cpus, > monitor_l3_0, monitor_l3_1, ... > > the monitor_ file descriptor is passed to perf_event_open > in the way cgroup file descriptors are passed now. All events to the > same (CTRLGRP,resource_id) share RMID. > > The RMID allocation part can either be handled by RDT Allocation or by > the RDT Monitoring PMU. Either ways, the existence of PMU's > perf_events allocates/releases the RMID. So I've had complaints about exactly that behaviour. Someone wanted RMIDs assigned (and start measuring) the moment the grouping got created/tasks started running etc.. So I think the design should also explicitly state how this is supposed to be handled and not left as an implementation detail.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Jan 20, 2017 at 12:11:53PM -0800, David Carrillo-Cisneros wrote: > Implementation ideas: > > First idea is to expose one monitoring file per resource in a CTRLGRP, > so the list of CTRLGRP's files would be: schemata, tasks, cpus, > monitor_l3_0, monitor_l3_1, ... > > the monitor_ file descriptor is passed to perf_event_open > in the way cgroup file descriptors are passed now. All events to the > same (CTRLGRP,resource_id) share RMID. > > The RMID allocation part can either be handled by RDT Allocation or by > the RDT Monitoring PMU. Either ways, the existence of PMU's > perf_events allocates/releases the RMID. So I've had complaints about exactly that behaviour. Someone wanted RMIDs assigned (and start measuring) the moment the grouping got created/tasks started running etc.. So I think the design should also explicitly state how this is supposed to be handled and not left as an implementation detail.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Jan 20, 2017 at 03:51:48PM -0800, Shivappa Vikas wrote: > I think the email thread is going very long and we should just meet f2f > probably next week to iron out the requirements and chalk out a design > proposal. The thread isn't the problem; you lot not trimming your emails is however.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Jan 20, 2017 at 03:51:48PM -0800, Shivappa Vikas wrote: > I think the email thread is going very long and we should just meet f2f > probably next week to iron out the requirements and chalk out a design > proposal. The thread isn't the problem; you lot not trimming your emails is however.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Tue, 7 Feb 2017, Stephane Eranian wrote: Hi, I wanted to take a few steps back and look at the overall goals for cache monitoring. From the various threads and discussion, my understanding is as follows. I think the design must ensure that the following usage models can be monitored: - the allocations in your CAT partitions - the allocations from a task (inclusive of children tasks) - the allocations from a group of tasks (inclusive of children tasks) - the allocations from a CPU - the allocations from a group of CPUs All cases but first one (CAT) are natural usage. So I want to describe the CAT in more details. The goal, as I understand it, it to monitor what is going on inside the CAT partition to detect whether it saturates or if it has room to "breathe". Let's take a simple example. Suppose, we have a CAT group, cat1: cat1: 20MB partition (CLOSID1) CPUs=CPU0,CPU1 TASKs=PID20 There can only be one CLOSID active on a CPU at a time. The kernel chooses to prioritize tasks over CPU when enforcing cases with multiple CLOSIDs. Let's review how this works for cat1 and for each scenario look at how the kernel enforces or not the cache partition: 1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1 2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1 3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1 4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1) 5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID Now, let's review how we could track the allocations done in cat1 using a single RMID. There can only be one RMID active at a time per CPU. The kernel chooses to prioritize tasks over CPU: cat1: 20MB partition (CLOSID1, RMID1) CPUs=CPU0,CPU1 TASKs=PID20 1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1 2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1 3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1 4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1) 5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID To make sense to a user, the cases where the hardware monitors MUST be the same as the cases where the hardware enforces the cache partitioning. Here we see that it works using a single RMID. However doing so limits certain monitoring modes where a user might want to get a breakdown per CPU of the allocations, such as with: $ perf stat -a -A -e llc_occupancy -R cat1 (where -R points to the monitoring group in rsrcfs). Here this mode would not be possible because the two CPUs in the group share the same RMID. In the requirements here https://marc.info/?l=linux-kernel=148597969808732 8) Can get measurements for subsets of tasks in a CAT group (to find the guys hogging the resources). This should also applies to the subsets of cpus. That would let you monitor on CPUs that is a subset or different from a CAT group. That should let you create mon groups like in the second example you mention along with the control groups above. mon0: RMID0 CPUs=CPU0 mon1: RMID1 CPUs=CPU1 mon2: RMID2 CPUs=CPU2 ... Now let's take another scenario, and suppose you have two monitoring groups as follows: mon1: RMID1 CPUs=CPU0,CPU1 mon2: RMID2 TASKS=PID20 If PID20 runs on CP0, then RMID2 is activated, and thus allocations done by PID20 are not counted towards RMID1. There is a blind spot. Whether or not this is a problem depends on the semantic exported by the interface for CPU mode: 1-Count all allocations from any tasks running on CPU 2-Count all allocations from tasks which are NOT monitoring themselves If the kernel choses 1, then there is a blind spot and the measurement is not as accurate as it could be because of the decision to use only one RDMID. But if the kernel choses 2, then everything works fine with a single RMID. If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e., measure any activity from any thread (choice 1), then the single RMID per group does not work. If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a CPU, i.e., measures only when threads of the cgroup run on that CPU, then using a single RMID per group works. Agree there are blind spots in both. But the requirements is trying to be based on the resctrl allocation as Thomas suggested. Which is aligned to monitoring real time tasks as i understand. for the above example, some tasks which donot have an RMID(say in the root group) are the real time tasks that are specially configured to running on a cpux which need to be allocated or monitored. Hope this helps clarifies the usage model and design choices.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Tue, 7 Feb 2017, Stephane Eranian wrote: Hi, I wanted to take a few steps back and look at the overall goals for cache monitoring. From the various threads and discussion, my understanding is as follows. I think the design must ensure that the following usage models can be monitored: - the allocations in your CAT partitions - the allocations from a task (inclusive of children tasks) - the allocations from a group of tasks (inclusive of children tasks) - the allocations from a CPU - the allocations from a group of CPUs All cases but first one (CAT) are natural usage. So I want to describe the CAT in more details. The goal, as I understand it, it to monitor what is going on inside the CAT partition to detect whether it saturates or if it has room to "breathe". Let's take a simple example. Suppose, we have a CAT group, cat1: cat1: 20MB partition (CLOSID1) CPUs=CPU0,CPU1 TASKs=PID20 There can only be one CLOSID active on a CPU at a time. The kernel chooses to prioritize tasks over CPU when enforcing cases with multiple CLOSIDs. Let's review how this works for cat1 and for each scenario look at how the kernel enforces or not the cache partition: 1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1 2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1 3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1 4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1) 5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID Now, let's review how we could track the allocations done in cat1 using a single RMID. There can only be one RMID active at a time per CPU. The kernel chooses to prioritize tasks over CPU: cat1: 20MB partition (CLOSID1, RMID1) CPUs=CPU0,CPU1 TASKs=PID20 1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1 2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1 3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1 4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1) 5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID To make sense to a user, the cases where the hardware monitors MUST be the same as the cases where the hardware enforces the cache partitioning. Here we see that it works using a single RMID. However doing so limits certain monitoring modes where a user might want to get a breakdown per CPU of the allocations, such as with: $ perf stat -a -A -e llc_occupancy -R cat1 (where -R points to the monitoring group in rsrcfs). Here this mode would not be possible because the two CPUs in the group share the same RMID. In the requirements here https://marc.info/?l=linux-kernel=148597969808732 8) Can get measurements for subsets of tasks in a CAT group (to find the guys hogging the resources). This should also applies to the subsets of cpus. That would let you monitor on CPUs that is a subset or different from a CAT group. That should let you create mon groups like in the second example you mention along with the control groups above. mon0: RMID0 CPUs=CPU0 mon1: RMID1 CPUs=CPU1 mon2: RMID2 CPUs=CPU2 ... Now let's take another scenario, and suppose you have two monitoring groups as follows: mon1: RMID1 CPUs=CPU0,CPU1 mon2: RMID2 TASKS=PID20 If PID20 runs on CP0, then RMID2 is activated, and thus allocations done by PID20 are not counted towards RMID1. There is a blind spot. Whether or not this is a problem depends on the semantic exported by the interface for CPU mode: 1-Count all allocations from any tasks running on CPU 2-Count all allocations from tasks which are NOT monitoring themselves If the kernel choses 1, then there is a blind spot and the measurement is not as accurate as it could be because of the decision to use only one RDMID. But if the kernel choses 2, then everything works fine with a single RMID. If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e., measure any activity from any thread (choice 1), then the single RMID per group does not work. If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a CPU, i.e., measures only when threads of the cgroup run on that CPU, then using a single RMID per group works. Agree there are blind spots in both. But the requirements is trying to be based on the resctrl allocation as Thomas suggested. Which is aligned to monitoring real time tasks as i understand. for the above example, some tasks which donot have an RMID(say in the root group) are the real time tasks that are specially configured to running on a cpux which need to be allocated or monitored. Hope this helps clarifies the usage model and design choices.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote: > Hi, > > I wanted to take a few steps back and look at the overall goals for > cache monitoring. > From the various threads and discussion, my understanding is as follows. > > I think the design must ensure that the following usage models can be > monitored: >- the allocations in your CAT partitions >- the allocations from a task (inclusive of children tasks) >- the allocations from a group of tasks (inclusive of children tasks) >- the allocations from a CPU >- the allocations from a group of CPUs > > All cases but first one (CAT) are natural usage. So I want to describe > the CAT in more details. > The goal, as I understand it, it to monitor what is going on inside > the CAT partition to detect > whether it saturates or if it has room to "breathe". Let's take a > simple example. By "natural usage" you mean "like perf(1) provides for other events"? But we are trying to figure out requirements here ... what data do people need to manage caches and memory bandwidth. So from this perspective monitoring a CAT group is a natural first choice ... did we provision this group with too much, or too little cache. >From that starting point I can see that a possible next step when finding that a CAT group has too small a cache is to drill down to find out how the tasks in the group are using cache. Armed with that information you could move tasks that hog too much cache (and are believed to be streaming through memory) into a different CAT group. What I'm not seeing is how drilling to CPUs helps you. Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that shows that 75% of the cache occupancy is attributed to CPU0, and only 25% to CPU1. What can you do with this information to improve things? If it is deemed too complex (from a kernel code perspective) to implement per-CPU reporting how bad a loss would that be? -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Tue, Feb 07, 2017 at 12:08:09AM -0800, Stephane Eranian wrote: > Hi, > > I wanted to take a few steps back and look at the overall goals for > cache monitoring. > From the various threads and discussion, my understanding is as follows. > > I think the design must ensure that the following usage models can be > monitored: >- the allocations in your CAT partitions >- the allocations from a task (inclusive of children tasks) >- the allocations from a group of tasks (inclusive of children tasks) >- the allocations from a CPU >- the allocations from a group of CPUs > > All cases but first one (CAT) are natural usage. So I want to describe > the CAT in more details. > The goal, as I understand it, it to monitor what is going on inside > the CAT partition to detect > whether it saturates or if it has room to "breathe". Let's take a > simple example. By "natural usage" you mean "like perf(1) provides for other events"? But we are trying to figure out requirements here ... what data do people need to manage caches and memory bandwidth. So from this perspective monitoring a CAT group is a natural first choice ... did we provision this group with too much, or too little cache. >From that starting point I can see that a possible next step when finding that a CAT group has too small a cache is to drill down to find out how the tasks in the group are using cache. Armed with that information you could move tasks that hog too much cache (and are believed to be streaming through memory) into a different CAT group. What I'm not seeing is how drilling to CPUs helps you. Say you have CPUs=CPU0,CPU1 in the CAT group and you collect data that shows that 75% of the cache occupancy is attributed to CPU0, and only 25% to CPU1. What can you do with this information to improve things? If it is deemed too complex (from a kernel code perspective) to implement per-CPU reporting how bad a loss would that be? -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Hi, I wanted to take a few steps back and look at the overall goals for cache monitoring. >From the various threads and discussion, my understanding is as follows. I think the design must ensure that the following usage models can be monitored: - the allocations in your CAT partitions - the allocations from a task (inclusive of children tasks) - the allocations from a group of tasks (inclusive of children tasks) - the allocations from a CPU - the allocations from a group of CPUs All cases but first one (CAT) are natural usage. So I want to describe the CAT in more details. The goal, as I understand it, it to monitor what is going on inside the CAT partition to detect whether it saturates or if it has room to "breathe". Let's take a simple example. Suppose, we have a CAT group, cat1: cat1: 20MB partition (CLOSID1) CPUs=CPU0,CPU1 TASKs=PID20 There can only be one CLOSID active on a CPU at a time. The kernel chooses to prioritize tasks over CPU when enforcing cases with multiple CLOSIDs. Let's review how this works for cat1 and for each scenario look at how the kernel enforces or not the cache partition: 1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1 2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1 3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1 4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1) 5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID Now, let's review how we could track the allocations done in cat1 using a single RMID. There can only be one RMID active at a time per CPU. The kernel chooses to prioritize tasks over CPU: cat1: 20MB partition (CLOSID1, RMID1) CPUs=CPU0,CPU1 TASKs=PID20 1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1 2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1 3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1 4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1) 5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID To make sense to a user, the cases where the hardware monitors MUST be the same as the cases where the hardware enforces the cache partitioning. Here we see that it works using a single RMID. However doing so limits certain monitoring modes where a user might want to get a breakdown per CPU of the allocations, such as with: $ perf stat -a -A -e llc_occupancy -R cat1 (where -R points to the monitoring group in rsrcfs). Here this mode would not be possible because the two CPUs in the group share the same RMID. Now let's take another scenario, and suppose you have two monitoring groups as follows: mon1: RMID1 CPUs=CPU0,CPU1 mon2: RMID2 TASKS=PID20 If PID20 runs on CP0, then RMID2 is activated, and thus allocations done by PID20 are not counted towards RMID1. There is a blind spot. Whether or not this is a problem depends on the semantic exported by the interface for CPU mode: 1-Count all allocations from any tasks running on CPU 2-Count all allocations from tasks which are NOT monitoring themselves If the kernel choses 1, then there is a blind spot and the measurement is not as accurate as it could be because of the decision to use only one RDMID. But if the kernel choses 2, then everything works fine with a single RMID. If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e., measure any activity from any thread (choice 1), then the single RMID per group does not work. If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a CPU, i.e., measures only when threads of the cgroup run on that CPU, then using a single RMID per group works. Hope this helps clarifies the usage model and design choices.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Hi, I wanted to take a few steps back and look at the overall goals for cache monitoring. >From the various threads and discussion, my understanding is as follows. I think the design must ensure that the following usage models can be monitored: - the allocations in your CAT partitions - the allocations from a task (inclusive of children tasks) - the allocations from a group of tasks (inclusive of children tasks) - the allocations from a CPU - the allocations from a group of CPUs All cases but first one (CAT) are natural usage. So I want to describe the CAT in more details. The goal, as I understand it, it to monitor what is going on inside the CAT partition to detect whether it saturates or if it has room to "breathe". Let's take a simple example. Suppose, we have a CAT group, cat1: cat1: 20MB partition (CLOSID1) CPUs=CPU0,CPU1 TASKs=PID20 There can only be one CLOSID active on a CPU at a time. The kernel chooses to prioritize tasks over CPU when enforcing cases with multiple CLOSIDs. Let's review how this works for cat1 and for each scenario look at how the kernel enforces or not the cache partition: 1. ENFORCED: PIDx with no CLOSID runs on CPU0 or CPU1 2. NOT ENFORCED: PIDx with CLOSIDx (x!=1) runs on CPU0, CPU1 3. ENFORCED: PID20 runs with CLOSID1 on CPU0, CPU1 4. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with CPU CLOSIDx (x!=1) 5. ENFORCED: PID20 runs with CLOSID1 on CPUx (x!=0,1) with no CLOSID Now, let's review how we could track the allocations done in cat1 using a single RMID. There can only be one RMID active at a time per CPU. The kernel chooses to prioritize tasks over CPU: cat1: 20MB partition (CLOSID1, RMID1) CPUs=CPU0,CPU1 TASKs=PID20 1. MONITORED: PIDx with no RMID runs on CPU0 or CPU1 2. NOT MONITORED: PIDx with RMIDx (x!=1) runs on CPU0, CPU1 3. MONITORED: PID20 with RMID1 runs on CPU0, CPU1 4. MONITORED: PID20 with RMD1 runs on CPUx (x!=0,1) with CPU RMIDx (x!=1) 5. MONITORED: PID20 runs with RMID1 on CPUx (x!=0,1) with no RMID To make sense to a user, the cases where the hardware monitors MUST be the same as the cases where the hardware enforces the cache partitioning. Here we see that it works using a single RMID. However doing so limits certain monitoring modes where a user might want to get a breakdown per CPU of the allocations, such as with: $ perf stat -a -A -e llc_occupancy -R cat1 (where -R points to the monitoring group in rsrcfs). Here this mode would not be possible because the two CPUs in the group share the same RMID. Now let's take another scenario, and suppose you have two monitoring groups as follows: mon1: RMID1 CPUs=CPU0,CPU1 mon2: RMID2 TASKS=PID20 If PID20 runs on CP0, then RMID2 is activated, and thus allocations done by PID20 are not counted towards RMID1. There is a blind spot. Whether or not this is a problem depends on the semantic exported by the interface for CPU mode: 1-Count all allocations from any tasks running on CPU 2-Count all allocations from tasks which are NOT monitoring themselves If the kernel choses 1, then there is a blind spot and the measurement is not as accurate as it could be because of the decision to use only one RDMID. But if the kernel choses 2, then everything works fine with a single RMID. If the kernel treats occupancy monitoring as measuring cycles on a CPU, i.e., measure any activity from any thread (choice 1), then the single RMID per group does not work. If the kernel treats occupancy monitoring as measuring cycles in a cgroup on a CPU, i.e., measures only when threads of the cgroup run on that CPU, then using a single RMID per group works. Hope this helps clarifies the usage model and design choices.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Mon, Feb 6, 2017 at 3:27 PM, Luck, Tonywrote: >> cgroup mode gives a per-CPU breakdown of event and running time, the >> tool aggregates it into running time vs event count. Both per-cpu >> breakdown and the aggregate are useful. >> >> Piggy-backing on perf's cgroup mode would give us all the above for free. > > Do you have some sample output from a perf run on a cgroup measuring a > "normal" event showing what you get? # perf stat -I 1000 -e cycles -a -C 0-1 -A -x, -G / 1.000116648,CPU0,20677864,,cycles,/ 1.000169948,CPU1,24760887,,cycles,/ 2.000453849,CPU0,36120862,,cycles,/ 2.000480259,CPU1,12535575,,cycles,/ 3.000664762,CPU0,7564504,,cycles,/ 3.000692552,CPU1,7307480,,cycles,/ > > I think that requires that we still go through perf ->start() and ->stop() > functions > to know how much time we spent running. I thought we were looking at bundling > the RMID updates into the same spot in sched() where we switch the CLOSID. > More or less at the "start" point, but there is no "stop". If we are > switching between > runnable processes, it amounts to pretty much the same thing ... except we > bill > to someone all the time instead of having a gap in the context switch where we > stopped billing to the old task and haven't started billing to the new one > yet. Another problem is that it will require a perf event all the time for timing measurements to be consistent with RMID measurements. The only sane option I can come up is to do timing in RDT the way perf cgroup does it (keep a per-cpu time that increases with local clock's delta). A reader can add the times for all CPUs in cpu_mask. > > But if we idle ... then we don't "stop". Shouldn't matter much from a > measurement > perspective because idle won't use cache or consume bandwidth. But we'd count > that time as "on cpu" for the last process to run. I may be missing something basic but isn't __switch_to called when switching to the idle task? that will update the CLOSID and RMID to whatever the idle task in on, isnt it? Thanks, David
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Mon, Feb 6, 2017 at 3:27 PM, Luck, Tony wrote: >> cgroup mode gives a per-CPU breakdown of event and running time, the >> tool aggregates it into running time vs event count. Both per-cpu >> breakdown and the aggregate are useful. >> >> Piggy-backing on perf's cgroup mode would give us all the above for free. > > Do you have some sample output from a perf run on a cgroup measuring a > "normal" event showing what you get? # perf stat -I 1000 -e cycles -a -C 0-1 -A -x, -G / 1.000116648,CPU0,20677864,,cycles,/ 1.000169948,CPU1,24760887,,cycles,/ 2.000453849,CPU0,36120862,,cycles,/ 2.000480259,CPU1,12535575,,cycles,/ 3.000664762,CPU0,7564504,,cycles,/ 3.000692552,CPU1,7307480,,cycles,/ > > I think that requires that we still go through perf ->start() and ->stop() > functions > to know how much time we spent running. I thought we were looking at bundling > the RMID updates into the same spot in sched() where we switch the CLOSID. > More or less at the "start" point, but there is no "stop". If we are > switching between > runnable processes, it amounts to pretty much the same thing ... except we > bill > to someone all the time instead of having a gap in the context switch where we > stopped billing to the old task and haven't started billing to the new one > yet. Another problem is that it will require a perf event all the time for timing measurements to be consistent with RMID measurements. The only sane option I can come up is to do timing in RDT the way perf cgroup does it (keep a per-cpu time that increases with local clock's delta). A reader can add the times for all CPUs in cpu_mask. > > But if we idle ... then we don't "stop". Shouldn't matter much from a > measurement > perspective because idle won't use cache or consume bandwidth. But we'd count > that time as "on cpu" for the last process to run. I may be missing something basic but isn't __switch_to called when switching to the idle task? that will update the CLOSID and RMID to whatever the idle task in on, isnt it? Thanks, David
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
> cgroup mode gives a per-CPU breakdown of event and running time, the > tool aggregates it into running time vs event count. Both per-cpu > breakdown and the aggregate are useful. > > Piggy-backing on perf's cgroup mode would give us all the above for free. Do you have some sample output from a perf run on a cgroup measuring a "normal" event showing what you get? I think that requires that we still go through perf ->start() and ->stop() functions to know how much time we spent running. I thought we were looking at bundling the RMID updates into the same spot in sched() where we switch the CLOSID. More or less at the "start" point, but there is no "stop". If we are switching between runnable processes, it amounts to pretty much the same thing ... except we bill to someone all the time instead of having a gap in the context switch where we stopped billing to the old task and haven't started billing to the new one yet. But if we idle ... then we don't "stop". Shouldn't matter much from a measurement perspective because idle won't use cache or consume bandwidth. But we'd count that time as "on cpu" for the last process to run. -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
> cgroup mode gives a per-CPU breakdown of event and running time, the > tool aggregates it into running time vs event count. Both per-cpu > breakdown and the aggregate are useful. > > Piggy-backing on perf's cgroup mode would give us all the above for free. Do you have some sample output from a perf run on a cgroup measuring a "normal" event showing what you get? I think that requires that we still go through perf ->start() and ->stop() functions to know how much time we spent running. I thought we were looking at bundling the RMID updates into the same spot in sched() where we switch the CLOSID. More or less at the "start" point, but there is no "stop". If we are switching between runnable processes, it amounts to pretty much the same thing ... except we bill to someone all the time instead of having a gap in the context switch where we stopped billing to the old task and haven't started billing to the new one yet. But if we idle ... then we don't "stop". Shouldn't matter much from a measurement perspective because idle won't use cache or consume bandwidth. But we'd count that time as "on cpu" for the last process to run. -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Mon, Feb 6, 2017 at 1:22 PM, Luck, Tonywrote: >> 12) Whatever fs or syscall is provided instead of perf syscalls, it >> should provide total_time_enabled in the way perf does, otherwise is >> hard to interpret MBM values. > > It seems that it is hard to define what we even mean by memory bandwidth. > > If you are measuring just one task and you find that the total number of bytes > read is 1GB at some point, and one second later the total bytes is 2GB, then > it is clear that the average bandwidth for this process is 1GB/s. If you know > that the task was only running for 50% of the cycles during that 1s interval, > you could say that it is doing 2GB/s ... which is I believe what you were > thinking when you wrote #12 above. Yes, that's one of the cases. > But whether that is right depends a > bit on *why* it only ran 50% of the time. If it was time-sliced out by the > scheduler ... then it may have been trying to be a 2GB/s app. But if it > was waiting for packets from the network, then it really is using 1 GB/s. IMO, "right" means that measured bandwidth and running time are correct. The *why* is a bigger question. > > All bets are off if you are measuring a service that consists of several > tasks running concurrently. All you can really talk about is the aggregate > average bandwidth (total bytes / wall-clock time). It makes no sense to > try and factor in how much cpu time each of the individual tasks got. cgroup mode gives a per-CPU breakdown of event and running time, the tool aggregates it into running time vs event count. Both per-cpu breakdown and the aggregate are useful. Piggy-backing on perf's cgroup mode would give us all the above for free. > > -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Mon, Feb 6, 2017 at 1:22 PM, Luck, Tony wrote: >> 12) Whatever fs or syscall is provided instead of perf syscalls, it >> should provide total_time_enabled in the way perf does, otherwise is >> hard to interpret MBM values. > > It seems that it is hard to define what we even mean by memory bandwidth. > > If you are measuring just one task and you find that the total number of bytes > read is 1GB at some point, and one second later the total bytes is 2GB, then > it is clear that the average bandwidth for this process is 1GB/s. If you know > that the task was only running for 50% of the cycles during that 1s interval, > you could say that it is doing 2GB/s ... which is I believe what you were > thinking when you wrote #12 above. Yes, that's one of the cases. > But whether that is right depends a > bit on *why* it only ran 50% of the time. If it was time-sliced out by the > scheduler ... then it may have been trying to be a 2GB/s app. But if it > was waiting for packets from the network, then it really is using 1 GB/s. IMO, "right" means that measured bandwidth and running time are correct. The *why* is a bigger question. > > All bets are off if you are measuring a service that consists of several > tasks running concurrently. All you can really talk about is the aggregate > average bandwidth (total bytes / wall-clock time). It makes no sense to > try and factor in how much cpu time each of the individual tasks got. cgroup mode gives a per-CPU breakdown of event and running time, the tool aggregates it into running time vs event count. Both per-cpu breakdown and the aggregate are useful. Piggy-backing on perf's cgroup mode would give us all the above for free. > > -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Mon, Feb 6, 2017 at 1:36 PM, Shivappa Vikaswrote: > > > On Mon, 6 Feb 2017, Luck, Tony wrote: > >>> 12) Whatever fs or syscall is provided instead of perf syscalls, it >>> should provide total_time_enabled in the way perf does, otherwise is >>> hard to interpret MBM values. >> >> >> It seems that it is hard to define what we even mean by memory bandwidth. >> >> If you are measuring just one task and you find that the total number of >> bytes >> read is 1GB at some point, and one second later the total bytes is 2GB, >> then >> it is clear that the average bandwidth for this process is 1GB/s. If you >> know >> that the task was only running for 50% of the cycles during that 1s >> interval, >> you could say that it is doing 2GB/s ... which is I believe what you were >> thinking when you wrote #12 above. But whether that is right depends a >> bit on *why* it only ran 50% of the time. If it was time-sliced out by the >> scheduler ... then it may have been trying to be a 2GB/s app. But if it >> was waiting for packets from the network, then it really is using 1 GB/s. > > > Is the requirement is to have both enabled and run time or just enabled time > (enabled time must be easy to report - just the wall time from start trace > to end trace)? Both, but since the original requirements dropped rotation, then total_running == total_enabled. > > This is not reported correctly in the upstream perf cqm and for > cgroup -C we dont report it either (since we report the package). using the -x option shows the run time and the % enabled. Many tools uses that csv output. > > Thanks, > Vikas > > >> >> All bets are off if you are measuring a service that consists of several >> tasks running concurrently. All you can really talk about is the aggregate >> average bandwidth (total bytes / wall-clock time). It makes no sense to >> try and factor in how much cpu time each of the individual tasks got. >> >> -Tony >> >
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Mon, Feb 6, 2017 at 1:36 PM, Shivappa Vikas wrote: > > > On Mon, 6 Feb 2017, Luck, Tony wrote: > >>> 12) Whatever fs or syscall is provided instead of perf syscalls, it >>> should provide total_time_enabled in the way perf does, otherwise is >>> hard to interpret MBM values. >> >> >> It seems that it is hard to define what we even mean by memory bandwidth. >> >> If you are measuring just one task and you find that the total number of >> bytes >> read is 1GB at some point, and one second later the total bytes is 2GB, >> then >> it is clear that the average bandwidth for this process is 1GB/s. If you >> know >> that the task was only running for 50% of the cycles during that 1s >> interval, >> you could say that it is doing 2GB/s ... which is I believe what you were >> thinking when you wrote #12 above. But whether that is right depends a >> bit on *why* it only ran 50% of the time. If it was time-sliced out by the >> scheduler ... then it may have been trying to be a 2GB/s app. But if it >> was waiting for packets from the network, then it really is using 1 GB/s. > > > Is the requirement is to have both enabled and run time or just enabled time > (enabled time must be easy to report - just the wall time from start trace > to end trace)? Both, but since the original requirements dropped rotation, then total_running == total_enabled. > > This is not reported correctly in the upstream perf cqm and for > cgroup -C we dont report it either (since we report the package). using the -x option shows the run time and the % enabled. Many tools uses that csv output. > > Thanks, > Vikas > > >> >> All bets are off if you are measuring a service that consists of several >> tasks running concurrently. All you can really talk about is the aggregate >> average bandwidth (total bytes / wall-clock time). It makes no sense to >> try and factor in how much cpu time each of the individual tasks got. >> >> -Tony >> >
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Mon, 6 Feb 2017, Luck, Tony wrote: 12) Whatever fs or syscall is provided instead of perf syscalls, it should provide total_time_enabled in the way perf does, otherwise is hard to interpret MBM values. It seems that it is hard to define what we even mean by memory bandwidth. If you are measuring just one task and you find that the total number of bytes read is 1GB at some point, and one second later the total bytes is 2GB, then it is clear that the average bandwidth for this process is 1GB/s. If you know that the task was only running for 50% of the cycles during that 1s interval, you could say that it is doing 2GB/s ... which is I believe what you were thinking when you wrote #12 above. But whether that is right depends a bit on *why* it only ran 50% of the time. If it was time-sliced out by the scheduler ... then it may have been trying to be a 2GB/s app. But if it was waiting for packets from the network, then it really is using 1 GB/s. Is the requirement is to have both enabled and run time or just enabled time (enabled time must be easy to report - just the wall time from start trace to end trace)? This is not reported correctly in the upstream perf cqm and for cgroup -C we dont report it either (since we report the package). Thanks, Vikas All bets are off if you are measuring a service that consists of several tasks running concurrently. All you can really talk about is the aggregate average bandwidth (total bytes / wall-clock time). It makes no sense to try and factor in how much cpu time each of the individual tasks got. -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Mon, 6 Feb 2017, Luck, Tony wrote: 12) Whatever fs or syscall is provided instead of perf syscalls, it should provide total_time_enabled in the way perf does, otherwise is hard to interpret MBM values. It seems that it is hard to define what we even mean by memory bandwidth. If you are measuring just one task and you find that the total number of bytes read is 1GB at some point, and one second later the total bytes is 2GB, then it is clear that the average bandwidth for this process is 1GB/s. If you know that the task was only running for 50% of the cycles during that 1s interval, you could say that it is doing 2GB/s ... which is I believe what you were thinking when you wrote #12 above. But whether that is right depends a bit on *why* it only ran 50% of the time. If it was time-sliced out by the scheduler ... then it may have been trying to be a 2GB/s app. But if it was waiting for packets from the network, then it really is using 1 GB/s. Is the requirement is to have both enabled and run time or just enabled time (enabled time must be easy to report - just the wall time from start trace to end trace)? This is not reported correctly in the upstream perf cqm and for cgroup -C we dont report it either (since we report the package). Thanks, Vikas All bets are off if you are measuring a service that consists of several tasks running concurrently. All you can really talk about is the aggregate average bandwidth (total bytes / wall-clock time). It makes no sense to try and factor in how much cpu time each of the individual tasks got. -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
> 12) Whatever fs or syscall is provided instead of perf syscalls, it > should provide total_time_enabled in the way perf does, otherwise is > hard to interpret MBM values. It seems that it is hard to define what we even mean by memory bandwidth. If you are measuring just one task and you find that the total number of bytes read is 1GB at some point, and one second later the total bytes is 2GB, then it is clear that the average bandwidth for this process is 1GB/s. If you know that the task was only running for 50% of the cycles during that 1s interval, you could say that it is doing 2GB/s ... which is I believe what you were thinking when you wrote #12 above. But whether that is right depends a bit on *why* it only ran 50% of the time. If it was time-sliced out by the scheduler ... then it may have been trying to be a 2GB/s app. But if it was waiting for packets from the network, then it really is using 1 GB/s. All bets are off if you are measuring a service that consists of several tasks running concurrently. All you can really talk about is the aggregate average bandwidth (total bytes / wall-clock time). It makes no sense to try and factor in how much cpu time each of the individual tasks got. -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
> 12) Whatever fs or syscall is provided instead of perf syscalls, it > should provide total_time_enabled in the way perf does, otherwise is > hard to interpret MBM values. It seems that it is hard to define what we even mean by memory bandwidth. If you are measuring just one task and you find that the total number of bytes read is 1GB at some point, and one second later the total bytes is 2GB, then it is clear that the average bandwidth for this process is 1GB/s. If you know that the task was only running for 50% of the cycles during that 1s interval, you could say that it is doing 2GB/s ... which is I believe what you were thinking when you wrote #12 above. But whether that is right depends a bit on *why* it only ran 50% of the time. If it was time-sliced out by the scheduler ... then it may have been trying to be a 2GB/s app. But if it was waiting for packets from the network, then it really is using 1 GB/s. All bets are off if you are measuring a service that consists of several tasks running concurrently. All you can really talk about is the aggregate average bandwidth (total bytes / wall-clock time). It makes no sense to try and factor in how much cpu time each of the individual tasks got. -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Digging through the e-mails from last week to generate a new version of the requirements I looked harder at this: > 12) Whatever fs or syscall is provided instead of perf syscalls, it > should provide total_time_enabled in the way perf does, otherwise is > hard to interpret MBM values. This looks tricky if we are piggy-backing on the CAT code to switch RMID along with CLOSID at context switch time. We could get an approximation by adding: if (newRMID != oldRMID) { now = grab current time in some format atomic_add(rmid_enabled_time[oldRMID], now - this_cpu_read(rmid_time)); this_cpu_write(rmid_time, now); } but: 1) that would only work on a single socket machine (we'd really want rmid_enabled_time separately for each socket) 2) when we want to read that enabled time, we'd really need to add time for all the threads currently running on CPUs across the system since we last switched RMID 3) reading the time and doing atomic ops in context switch code won't be popular :-( -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Digging through the e-mails from last week to generate a new version of the requirements I looked harder at this: > 12) Whatever fs or syscall is provided instead of perf syscalls, it > should provide total_time_enabled in the way perf does, otherwise is > hard to interpret MBM values. This looks tricky if we are piggy-backing on the CAT code to switch RMID along with CLOSID at context switch time. We could get an approximation by adding: if (newRMID != oldRMID) { now = grab current time in some format atomic_add(rmid_enabled_time[oldRMID], now - this_cpu_read(rmid_time)); this_cpu_write(rmid_time, now); } but: 1) that would only work on a single socket machine (we'd really want rmid_enabled_time separately for each socket) 2) when we want to read that enabled time, we'd really need to add time for all the threads currently running on CPUs across the system since we last switched RMID 3) reading the time and doing atomic ops in context switch code won't be popular :-( -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Feb 03, 2017 at 01:08:05PM -0800, David Carrillo-Cisneros wrote: > On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tonywrote: > > On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote: > >> If we tie allocation groups and monitoring groups, we are tying the > >> meaning of CPUs and we'll have to choose between the CAT meaning or > >> the perf meaning. > >> > >> Let's allow semantics that will allow perf like monitoring to > >> eventually work, even if its not immediately supported. > > > > Would it work to make monitor groups be "task list only" or "cpu mask only" > > (unlike control groups that allow mixing). > > That works, but please don't use chmod. Make it explicit by the group > position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1). I had been thinking that after writing a PID to "tasks" we'd disallow writes to "cpus". But is sounds nicer for the user to declare their intention upfront. Counter propsosal in the naming war: .../monitor/bytask/{groupname} .../monitor/bycpu/{groupname} > > Then the intel_rdt_sched_in() code could pick the RMID in ways that > > give you the perf(1) meaning. I.e. if you create a monitor group and assign > > some CPUs to it, then we will always load the RMID for that monitor group > > when running on those cpus, regardless of what group(s) the current process > > belongs to. But if you didn't create any cpu-only monitor groups, then we'd > > assign RMID using same rules as CLOSID (so measurements from a control group > > would track allocation policies). > > I think that's very confusing for the user. A group's observed > behavior should be determined by its attributes and not change > depending on how other groups are configured. Think on multiple users > monitoring simultaneously. > > > > > We are already planning that creating monitor only groups will change > > what is reported in the control group (e.g. you pull some tasks out of > > the control group to monitor them separately, so the control group only > > reports the tasks that you didn't move out for monitoring). > > That's also confusing, and the work-around that Vikas proposed of two > separate files to enumerate tasks (one for control and one for > monitoring) breaks the concept of a task group. There are some simple cases where we can make the data shown in the original control group look the same. E.g. we move a few tasks over to a /bytask/ group (or several groups if we want a very fine breakdown) and then have the report from the control group sum the RMIDs from the monitor groups and add to the total from the native RMID of the control group. But this falls apart if the user asks a single monitor group to monitor tasks from multiple control groups. Perhaps we could disallow this (when we assign the first task to a monitor group, capture the CLOSID and then only allow other tasks with the same CLOSID to be added ... unless the group becomes empty, and which point we can latch onto a new CLOSID). /bycpu/ monitoring is very resource intensive if we have to preserve the control group reports. We'd need to allocate MAXCLOSID[1] RMIDs for each group so that we can keep separate counts for tasks from each control group that run on our CPUs and then sum them to report the /bycpu/ data (instead of just one RMID, and no math). This also puts more memory references into the sched_in path while we figure out which RMID to load into PQR_ASSOC. I'd want to warn the user in the Documentation that splitting off too many monitor groups from a control group will result in less than stellar accuracy in reporting as the kernel cannot read multiple RMIDs atomically and data is changing between reads. > I know the present implementation scope is limited, so you could: > - support 1) and/or 2) only > - do a simple RMID management (e.g. same RMID all packages, allocate > RMID on creation or fail) > - do the custom fs based tool that Vikas mentioned instead of using > perf_event_open (if it's somehow easier to build and maintain a new > tool rather than reuse perf(1) ). > > any or all of the above are fine. But please don't choose group > semantics that will prevent us from eventually supporting full > perf-like behavior or that we already know explode in user's face. I'm trying hard to find a way to do this. I.e. start with a patch that has limited capabilities and needs a custom tool, but can later grow into something that meets your needs. -Tony [1] Lazy allocation means finding we can't find a free RMID in the middle of context switch ... not willing to go there.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Feb 03, 2017 at 01:08:05PM -0800, David Carrillo-Cisneros wrote: > On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tony wrote: > > On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote: > >> If we tie allocation groups and monitoring groups, we are tying the > >> meaning of CPUs and we'll have to choose between the CAT meaning or > >> the perf meaning. > >> > >> Let's allow semantics that will allow perf like monitoring to > >> eventually work, even if its not immediately supported. > > > > Would it work to make monitor groups be "task list only" or "cpu mask only" > > (unlike control groups that allow mixing). > > That works, but please don't use chmod. Make it explicit by the group > position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1). I had been thinking that after writing a PID to "tasks" we'd disallow writes to "cpus". But is sounds nicer for the user to declare their intention upfront. Counter propsosal in the naming war: .../monitor/bytask/{groupname} .../monitor/bycpu/{groupname} > > Then the intel_rdt_sched_in() code could pick the RMID in ways that > > give you the perf(1) meaning. I.e. if you create a monitor group and assign > > some CPUs to it, then we will always load the RMID for that monitor group > > when running on those cpus, regardless of what group(s) the current process > > belongs to. But if you didn't create any cpu-only monitor groups, then we'd > > assign RMID using same rules as CLOSID (so measurements from a control group > > would track allocation policies). > > I think that's very confusing for the user. A group's observed > behavior should be determined by its attributes and not change > depending on how other groups are configured. Think on multiple users > monitoring simultaneously. > > > > > We are already planning that creating monitor only groups will change > > what is reported in the control group (e.g. you pull some tasks out of > > the control group to monitor them separately, so the control group only > > reports the tasks that you didn't move out for monitoring). > > That's also confusing, and the work-around that Vikas proposed of two > separate files to enumerate tasks (one for control and one for > monitoring) breaks the concept of a task group. There are some simple cases where we can make the data shown in the original control group look the same. E.g. we move a few tasks over to a /bytask/ group (or several groups if we want a very fine breakdown) and then have the report from the control group sum the RMIDs from the monitor groups and add to the total from the native RMID of the control group. But this falls apart if the user asks a single monitor group to monitor tasks from multiple control groups. Perhaps we could disallow this (when we assign the first task to a monitor group, capture the CLOSID and then only allow other tasks with the same CLOSID to be added ... unless the group becomes empty, and which point we can latch onto a new CLOSID). /bycpu/ monitoring is very resource intensive if we have to preserve the control group reports. We'd need to allocate MAXCLOSID[1] RMIDs for each group so that we can keep separate counts for tasks from each control group that run on our CPUs and then sum them to report the /bycpu/ data (instead of just one RMID, and no math). This also puts more memory references into the sched_in path while we figure out which RMID to load into PQR_ASSOC. I'd want to warn the user in the Documentation that splitting off too many monitor groups from a control group will result in less than stellar accuracy in reporting as the kernel cannot read multiple RMIDs atomically and data is changing between reads. > I know the present implementation scope is limited, so you could: > - support 1) and/or 2) only > - do a simple RMID management (e.g. same RMID all packages, allocate > RMID on creation or fail) > - do the custom fs based tool that Vikas mentioned instead of using > perf_event_open (if it's somehow easier to build and maintain a new > tool rather than reuse perf(1) ). > > any or all of the above are fine. But please don't choose group > semantics that will prevent us from eventually supporting full > perf-like behavior or that we already know explode in user's face. I'm trying hard to find a way to do this. I.e. start with a patch that has limited capabilities and needs a custom tool, but can later grow into something that meets your needs. -Tony [1] Lazy allocation means finding we can't find a free RMID in the middle of context switch ... not willing to go there.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tonywrote: > On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote: >> If we tie allocation groups and monitoring groups, we are tying the >> meaning of CPUs and we'll have to choose between the CAT meaning or >> the perf meaning. >> >> Let's allow semantics that will allow perf like monitoring to >> eventually work, even if its not immediately supported. > > Would it work to make monitor groups be "task list only" or "cpu mask only" > (unlike control groups that allow mixing). That works, but please don't use chmod. Make it explicit by the group position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1). > > Then the intel_rdt_sched_in() code could pick the RMID in ways that > give you the perf(1) meaning. I.e. if you create a monitor group and assign > some CPUs to it, then we will always load the RMID for that monitor group > when running on those cpus, regardless of what group(s) the current process > belongs to. But if you didn't create any cpu-only monitor groups, then we'd > assign RMID using same rules as CLOSID (so measurements from a control group > would track allocation policies). I think that's very confusing for the user. A group's observed behavior should be determined by its attributes and not change depending on how other groups are configured. Think on multiple users monitoring simultaneously. > > We are already planning that creating monitor only groups will change > what is reported in the control group (e.g. you pull some tasks out of > the control group to monitor them separately, so the control group only > reports the tasks that you didn't move out for monitoring). That's also confusing, and the work-around that Vikas proposed of two separate files to enumerate tasks (one for control and one for monitoring) breaks the concept of a task group. >From our discussions, we can support the use cases we care about without weird-corner cases, by having: - A set of allocation group as stand now. Either use the current resctrl, or rename it to something like resdir/ctrl (before v4.10 sails). - A set of monitoring task groups. Either in a "tasks" folder in a resmon fs or in resdir/mon/tasks. - A set of monitoring CPU groups. Either in a "cpus" folder in a resmon fs or in resdir/mon/cpus. So when a user measures a group (shown using the -G option, it could as well be the -R Vikas wants): 1) perf stat -e llc_occupancy -G resdir/ctrl/g1 measures the CAT allocation group as if RMIDs were managed like CLOSIDs. 2) perf stat -e llc_occupancy -G resdir/mon/tasks/g1 measures the combined occupancy of all tasks in g1 (like a cgroups in present perf). 3) perf stat -e llc_occupancy -C *XOR* perf stat -e llc_occupancy -G resdir/mon/cpus/g1 measures the combined occupancy of all tasks while ran in any CPU in g1 (perf-like filtering, not the CAT way). I know the present implementation scope is limited, so you could: - support 1) and/or 2) only - do a simple RMID management (e.g. same RMID all packages, allocate RMID on creation or fail) - do the custom fs based tool that Vikas mentioned instead of using perf_event_open (if it's somehow easier to build and maintain a new tool rather than reuse perf(1) ). any or all of the above are fine. But please don't choose group semantics that will prevent us from eventually supporting full perf-like behavior or that we already know explode in user's face. Thanks, David
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Feb 3, 2017 at 9:52 AM, Luck, Tony wrote: > On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote: >> If we tie allocation groups and monitoring groups, we are tying the >> meaning of CPUs and we'll have to choose between the CAT meaning or >> the perf meaning. >> >> Let's allow semantics that will allow perf like monitoring to >> eventually work, even if its not immediately supported. > > Would it work to make monitor groups be "task list only" or "cpu mask only" > (unlike control groups that allow mixing). That works, but please don't use chmod. Make it explicit by the group position (i.e. mon/cpus/grpCPU1, mon/tasks/grpTasks1). > > Then the intel_rdt_sched_in() code could pick the RMID in ways that > give you the perf(1) meaning. I.e. if you create a monitor group and assign > some CPUs to it, then we will always load the RMID for that monitor group > when running on those cpus, regardless of what group(s) the current process > belongs to. But if you didn't create any cpu-only monitor groups, then we'd > assign RMID using same rules as CLOSID (so measurements from a control group > would track allocation policies). I think that's very confusing for the user. A group's observed behavior should be determined by its attributes and not change depending on how other groups are configured. Think on multiple users monitoring simultaneously. > > We are already planning that creating monitor only groups will change > what is reported in the control group (e.g. you pull some tasks out of > the control group to monitor them separately, so the control group only > reports the tasks that you didn't move out for monitoring). That's also confusing, and the work-around that Vikas proposed of two separate files to enumerate tasks (one for control and one for monitoring) breaks the concept of a task group. >From our discussions, we can support the use cases we care about without weird-corner cases, by having: - A set of allocation group as stand now. Either use the current resctrl, or rename it to something like resdir/ctrl (before v4.10 sails). - A set of monitoring task groups. Either in a "tasks" folder in a resmon fs or in resdir/mon/tasks. - A set of monitoring CPU groups. Either in a "cpus" folder in a resmon fs or in resdir/mon/cpus. So when a user measures a group (shown using the -G option, it could as well be the -R Vikas wants): 1) perf stat -e llc_occupancy -G resdir/ctrl/g1 measures the CAT allocation group as if RMIDs were managed like CLOSIDs. 2) perf stat -e llc_occupancy -G resdir/mon/tasks/g1 measures the combined occupancy of all tasks in g1 (like a cgroups in present perf). 3) perf stat -e llc_occupancy -C *XOR* perf stat -e llc_occupancy -G resdir/mon/cpus/g1 measures the combined occupancy of all tasks while ran in any CPU in g1 (perf-like filtering, not the CAT way). I know the present implementation scope is limited, so you could: - support 1) and/or 2) only - do a simple RMID management (e.g. same RMID all packages, allocate RMID on creation or fail) - do the custom fs based tool that Vikas mentioned instead of using perf_event_open (if it's somehow easier to build and maintain a new tool rather than reuse perf(1) ). any or all of the above are fine. But please don't choose group semantics that will prevent us from eventually supporting full perf-like behavior or that we already know explode in user's face. Thanks, David
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote: > If we tie allocation groups and monitoring groups, we are tying the > meaning of CPUs and we'll have to choose between the CAT meaning or > the perf meaning. > > Let's allow semantics that will allow perf like monitoring to > eventually work, even if its not immediately supported. Would it work to make monitor groups be "task list only" or "cpu mask only" (unlike control groups that allow mixing). Then the intel_rdt_sched_in() code could pick the RMID in ways that give you the perf(1) meaning. I.e. if you create a monitor group and assign some CPUs to it, then we will always load the RMID for that monitor group when running on those cpus, regardless of what group(s) the current process belongs to. But if you didn't create any cpu-only monitor groups, then we'd assign RMID using same rules as CLOSID (so measurements from a control group would track allocation policies). We are already planning that creating monitor only groups will change what is reported in the control group (e.g. you pull some tasks out of the control group to monitor them separately, so the control group only reports the tasks that you didn't move out for monitoring). -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Feb 02, 2017 at 06:14:05PM -0800, David Carrillo-Cisneros wrote: > If we tie allocation groups and monitoring groups, we are tying the > meaning of CPUs and we'll have to choose between the CAT meaning or > the perf meaning. > > Let's allow semantics that will allow perf like monitoring to > eventually work, even if its not immediately supported. Would it work to make monitor groups be "task list only" or "cpu mask only" (unlike control groups that allow mixing). Then the intel_rdt_sched_in() code could pick the RMID in ways that give you the perf(1) meaning. I.e. if you create a monitor group and assign some CPUs to it, then we will always load the RMID for that monitor group when running on those cpus, regardless of what group(s) the current process belongs to. But if you didn't create any cpu-only monitor groups, then we'd assign RMID using same rules as CLOSID (so measurements from a control group would track allocation policies). We are already planning that creating monitor only groups will change what is reported in the control group (e.g. you pull some tasks out of the control group to monitor them separately, so the control group only reports the tasks that you didn't move out for monitoring). -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Something to be aware is that CAT cpus don't work the way CPU filtering works in perf: If I have the following CAT groups: - default group with task TD - group GC1 with CPU0 and CLOSID 1 - group GT1 with no CPUs and task T1 and CLOSID2 - TD and T1 run on CPU0. Then T1 will use CLOSID2 and TD CLOSID1. Some allocations done in CPU0 did not use CLOSID1. Now, if I have the same setup in monitoring groups and I were to read llc_occupancy in the RMID of GC1, I'd read llc_occupancy for TD only, and have a blind spot on T1. That's not how CPU events work on perf. So CPUs have a different meaning on CAT than on perf. The above is another reason to separate the allocation and the monitoring groups. Having - Independent allocation and monitoring groups. - Independent CPU and task grouping. would allow us semantics that monitor CAT groups and eventually can be extended to also monitor the perf way, this is support: - filter by task - filter by task group (cgroup or monitoring group or whatever). - filter by CPU (the perf way) - combinations of task/task_group and CPU (the perf way) If we tie allocation groups and monitoring groups, we are tying the meaning of CPUs and we'll have to choose between the CAT meaning or the perf meaning. Let's allow semantics that will allow perf like monitoring to eventually work, even if its not immediately supported. Thanks, David
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Something to be aware is that CAT cpus don't work the way CPU filtering works in perf: If I have the following CAT groups: - default group with task TD - group GC1 with CPU0 and CLOSID 1 - group GT1 with no CPUs and task T1 and CLOSID2 - TD and T1 run on CPU0. Then T1 will use CLOSID2 and TD CLOSID1. Some allocations done in CPU0 did not use CLOSID1. Now, if I have the same setup in monitoring groups and I were to read llc_occupancy in the RMID of GC1, I'd read llc_occupancy for TD only, and have a blind spot on T1. That's not how CPU events work on perf. So CPUs have a different meaning on CAT than on perf. The above is another reason to separate the allocation and the monitoring groups. Having - Independent allocation and monitoring groups. - Independent CPU and task grouping. would allow us semantics that monitor CAT groups and eventually can be extended to also monitor the perf way, this is support: - filter by task - filter by task group (cgroup or monitoring group or whatever). - filter by CPU (the perf way) - combinations of task/task_group and CPU (the perf way) If we tie allocation groups and monitoring groups, we are tying the meaning of CPUs and we'll have to choose between the CAT meaning or the perf meaning. Let's allow semantics that will allow perf like monitoring to eventually work, even if its not immediately supported. Thanks, David
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Feb 2, 2017 at 3:41 PM, Luck, Tonywrote: > On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote: >> There is no need to change perf(1) to support >> # perf stat -I 1000 -e intel_cqm/llc_occupancy {command} >> >> the PMU can work with resctrl to provide the support through >> perf_event_open, with the advantage that tools other than perf could >> also use it. > > I agree it would be better to expose the counters through > a standard perf_event_open() interface ... but we don't seem > to have had much luck doing that so far. > > That would need the requirements to be re-written with the > focus of what does resctrl need to do to support each of the > perf(1) command line modes of operation. The fact that these > counters work rather differently from normal h/w counters > has resulted in massively complex volumes of code trying > to map them into what perf_event_open() expects. > > The key points of weirdness seem to be: > > 1) We need to allocate an RMID for the duration of monitoring. While >there are quite a lot of RMIDs, it is easy to envision scenarios >where there are not enough. > > 2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a > process >of interest is running. > > 3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events > > 4) For llc_occupancy the count can change even when none of the processes >are running becauase cache lines are evicted > > 5) llc_occupancy measures the delta, not the absolute occupancy. To >get a good result requires monitoring from process creation (or >lots of patience, or the nuclear option "wbinvd"). > > 6) RMID counters are package scoped > > > These result in all sorts of hard to resolve situations. E.g. you are > monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm > looking at the cache occupancy of PID=234 using RMID=45. The scheduler > decides to run my proocess on your CPU. We can only load one RMID, so > one of us will be disappointed (unless we have some crazy complex code > where your instance of perf borrows RMID=45 and reads out the local > byte count on sched_in() and sched_out() to add to the runing count > you were keeping against RMID=22). > > How can we document such restrictions for people who haven't been > digging in this code for over a year? > > I think a perf_event_open() interface would make some simple cases > work, but result in some swearing once people start running multiple > complex monitors at the same time. More problems: 7) Time multiplexing of RMIDs is hard because llc_occupancy cannot be reset. 8) Only one RMID per CPU can be loaded at a time into PQR_ASSOC. Most of the complexity in past attempts were mainly caused by: A. Task events being defined as system-wide and not package-wide. What you describe in points (4) and (6) made this complicated. B. The cgroup hierarchy, due to (7) and (8). A and B caused the bulk of the code by complicating RMID assignment, reading and rotation. Now that we've learned from the past experience, we have defined per-domain monitoring and use flat groups. FWICT, that enough to allow a simple implementation that can be expressed through perf_event_open.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Feb 2, 2017 at 3:41 PM, Luck, Tony wrote: > On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote: >> There is no need to change perf(1) to support >> # perf stat -I 1000 -e intel_cqm/llc_occupancy {command} >> >> the PMU can work with resctrl to provide the support through >> perf_event_open, with the advantage that tools other than perf could >> also use it. > > I agree it would be better to expose the counters through > a standard perf_event_open() interface ... but we don't seem > to have had much luck doing that so far. > > That would need the requirements to be re-written with the > focus of what does resctrl need to do to support each of the > perf(1) command line modes of operation. The fact that these > counters work rather differently from normal h/w counters > has resulted in massively complex volumes of code trying > to map them into what perf_event_open() expects. > > The key points of weirdness seem to be: > > 1) We need to allocate an RMID for the duration of monitoring. While >there are quite a lot of RMIDs, it is easy to envision scenarios >where there are not enough. > > 2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a > process >of interest is running. > > 3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events > > 4) For llc_occupancy the count can change even when none of the processes >are running becauase cache lines are evicted > > 5) llc_occupancy measures the delta, not the absolute occupancy. To >get a good result requires monitoring from process creation (or >lots of patience, or the nuclear option "wbinvd"). > > 6) RMID counters are package scoped > > > These result in all sorts of hard to resolve situations. E.g. you are > monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm > looking at the cache occupancy of PID=234 using RMID=45. The scheduler > decides to run my proocess on your CPU. We can only load one RMID, so > one of us will be disappointed (unless we have some crazy complex code > where your instance of perf borrows RMID=45 and reads out the local > byte count on sched_in() and sched_out() to add to the runing count > you were keeping against RMID=22). > > How can we document such restrictions for people who haven't been > digging in this code for over a year? > > I think a perf_event_open() interface would make some simple cases > work, but result in some swearing once people start running multiple > complex monitors at the same time. More problems: 7) Time multiplexing of RMIDs is hard because llc_occupancy cannot be reset. 8) Only one RMID per CPU can be loaded at a time into PQR_ASSOC. Most of the complexity in past attempts were mainly caused by: A. Task events being defined as system-wide and not package-wide. What you describe in points (4) and (6) made this complicated. B. The cgroup hierarchy, due to (7) and (8). A and B caused the bulk of the code by complicating RMID assignment, reading and rotation. Now that we've learned from the past experience, we have defined per-domain monitoring and use flat groups. FWICT, that enough to allow a simple implementation that can be expressed through perf_event_open.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote: > There is no need to change perf(1) to support > # perf stat -I 1000 -e intel_cqm/llc_occupancy {command} > > the PMU can work with resctrl to provide the support through > perf_event_open, with the advantage that tools other than perf could > also use it. I agree it would be better to expose the counters through a standard perf_event_open() interface ... but we don't seem to have had much luck doing that so far. That would need the requirements to be re-written with the focus of what does resctrl need to do to support each of the perf(1) command line modes of operation. The fact that these counters work rather differently from normal h/w counters has resulted in massively complex volumes of code trying to map them into what perf_event_open() expects. The key points of weirdness seem to be: 1) We need to allocate an RMID for the duration of monitoring. While there are quite a lot of RMIDs, it is easy to envision scenarios where there are not enough. 2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a process of interest is running. 3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events 4) For llc_occupancy the count can change even when none of the processes are running becauase cache lines are evicted 5) llc_occupancy measures the delta, not the absolute occupancy. To get a good result requires monitoring from process creation (or lots of patience, or the nuclear option "wbinvd"). 6) RMID counters are package scoped These result in all sorts of hard to resolve situations. E.g. you are monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm looking at the cache occupancy of PID=234 using RMID=45. The scheduler decides to run my proocess on your CPU. We can only load one RMID, so one of us will be disappointed (unless we have some crazy complex code where your instance of perf borrows RMID=45 and reads out the local byte count on sched_in() and sched_out() to add to the runing count you were keeping against RMID=22). How can we document such restrictions for people who haven't been digging in this code for over a year? I think a perf_event_open() interface would make some simple cases work, but result in some swearing once people start running multiple complex monitors at the same time. -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Feb 02, 2017 at 12:22:42PM -0800, David Carrillo-Cisneros wrote: > There is no need to change perf(1) to support > # perf stat -I 1000 -e intel_cqm/llc_occupancy {command} > > the PMU can work with resctrl to provide the support through > perf_event_open, with the advantage that tools other than perf could > also use it. I agree it would be better to expose the counters through a standard perf_event_open() interface ... but we don't seem to have had much luck doing that so far. That would need the requirements to be re-written with the focus of what does resctrl need to do to support each of the perf(1) command line modes of operation. The fact that these counters work rather differently from normal h/w counters has resulted in massively complex volumes of code trying to map them into what perf_event_open() expects. The key points of weirdness seem to be: 1) We need to allocate an RMID for the duration of monitoring. While there are quite a lot of RMIDs, it is easy to envision scenarios where there are not enough. 2) We need to load that RMID into PQR_ASSOC on a logical CPU whenever a process of interest is running. 3) An RMID is shared by llc_occupancy, local_bytes and total_bytes events 4) For llc_occupancy the count can change even when none of the processes are running becauase cache lines are evicted 5) llc_occupancy measures the delta, not the absolute occupancy. To get a good result requires monitoring from process creation (or lots of patience, or the nuclear option "wbinvd"). 6) RMID counters are package scoped These result in all sorts of hard to resolve situations. E.g. you are monitoring local bandwidth coming from logical CPU2 using RMID=22. I'm looking at the cache occupancy of PID=234 using RMID=45. The scheduler decides to run my proocess on your CPU. We can only load one RMID, so one of us will be disappointed (unless we have some crazy complex code where your instance of perf borrows RMID=45 and reads out the local byte count on sched_in() and sched_out() to add to the runing count you were keeping against RMID=22). How can we document such restrictions for people who haven't been digging in this code for over a year? I think a perf_event_open() interface would make some simple cases work, but result in some swearing once people start running multiple complex monitors at the same time. -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Feb 2, 2017 at 11:33 AM, Luck, Tonywrote: >>> Nice to have: >>> 1) Readout using "perf(1)" [subset of modes that make sense ... tying >>> monitoring >>> to resctrl file system will make most command line usage of perf(1) close >>> to impossible. >> >> >> We discussed this offline and I still disagree that it is close to >> impossible to use perf and perf_event_open. In fact, I think it's very >> simple : > > Maybe s/most/many/ ? > > The issue here is that we are going to define which tasks and cpus are being > monitored *outside* of the perf command. So usage like: > > # perf stat -I 1000 -e intel_cqm/llc_occupancy {command} > > are completely out of scope ... we aren't planning to change the perf(1) > command to know about creating a CQM monitor group, assigning the > PID of {command} to it, and then report on llc_occupancy. > > So perf(1) usage is only going to support modes where it attaches to some > monitor group that was previously established. The "-C 2" option to monitor > CPU 2 is certainly plausible ... assuming you set up a monitor group to track > what is happening on CPU 2 ... I just don't know how perf(1) would know the > name of that group. There is no need to change perf(1) to support # perf stat -I 1000 -e intel_cqm/llc_occupancy {command} the PMU can work with resctrl to provide the support through perf_event_open, with the advantage that tools other than perf could also use it. I'd argue is more stable and has less corner cases if the task_mongroups get extra RMIDs for the task events attached to them than having userspace tools create and destroy groups and move tasks behind the scenes. I provided implementation details on the write-up I shared offline on Monday. If "easy monitoring" of stand-alone task becomes a requirement, we can dig on the pros and cons of implementing in kernel vs user space. > > Vikas is pushing for "-R rdtgroup" ... though our offline discussions included > overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups > depending on event type. I see it more like generalizing the -G option to represent a task group that can be a cgroup or a PMU specific one. Currently the perf(1) simply translates the argument of the -G option into a file descriptor. My idea doesn't change that, just makes perf tool to look for a "task_group_root" file in the PMU folder and use it to find as base path for the file descriptor. If a PMU doesnt have such file, then perf(1) uses the perf cgroup mounting point, as it does now. That makes for a very simple implementation on the perf tool side.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Feb 2, 2017 at 11:33 AM, Luck, Tony wrote: >>> Nice to have: >>> 1) Readout using "perf(1)" [subset of modes that make sense ... tying >>> monitoring >>> to resctrl file system will make most command line usage of perf(1) close >>> to impossible. >> >> >> We discussed this offline and I still disagree that it is close to >> impossible to use perf and perf_event_open. In fact, I think it's very >> simple : > > Maybe s/most/many/ ? > > The issue here is that we are going to define which tasks and cpus are being > monitored *outside* of the perf command. So usage like: > > # perf stat -I 1000 -e intel_cqm/llc_occupancy {command} > > are completely out of scope ... we aren't planning to change the perf(1) > command to know about creating a CQM monitor group, assigning the > PID of {command} to it, and then report on llc_occupancy. > > So perf(1) usage is only going to support modes where it attaches to some > monitor group that was previously established. The "-C 2" option to monitor > CPU 2 is certainly plausible ... assuming you set up a monitor group to track > what is happening on CPU 2 ... I just don't know how perf(1) would know the > name of that group. There is no need to change perf(1) to support # perf stat -I 1000 -e intel_cqm/llc_occupancy {command} the PMU can work with resctrl to provide the support through perf_event_open, with the advantage that tools other than perf could also use it. I'd argue is more stable and has less corner cases if the task_mongroups get extra RMIDs for the task events attached to them than having userspace tools create and destroy groups and move tasks behind the scenes. I provided implementation details on the write-up I shared offline on Monday. If "easy monitoring" of stand-alone task becomes a requirement, we can dig on the pros and cons of implementing in kernel vs user space. > > Vikas is pushing for "-R rdtgroup" ... though our offline discussions included > overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups > depending on event type. I see it more like generalizing the -G option to represent a task group that can be a cgroup or a PMU specific one. Currently the perf(1) simply translates the argument of the -G option into a file descriptor. My idea doesn't change that, just makes perf tool to look for a "task_group_root" file in the PMU folder and use it to find as base path for the file descriptor. If a PMU doesnt have such file, then perf(1) uses the perf cgroup mounting point, as it does now. That makes for a very simple implementation on the perf tool side.
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Hello Peterz/Andi, On Thu, 2 Feb 2017, Luck, Tony wrote: Nice to have: 1) Readout using "perf(1)" [subset of modes that make sense ... tying monitoring to resctrl file system will make most command line usage of perf(1) close to impossible. Vikas is pushing for "-R rdtgroup" ... though our offline discussions included overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups depending on event type. Assume we build support to monitor the existing resctrl CAT groups like Thomas suggested. For the perf interface would something like below seems reasonable or a disaster(given that we have a new -R option specific to the PMU/which works only on this PMU) ? # mount -t resctrl resctrl /sys/fs/resctrl # cd /sys/fs/resctrl # mkdir p0 p1 # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata Now monitor the group p1 using perf. perf would have a new option -R to monitor the resctrl groups. perf would still have a cqm event like today intel_cqm/llc_occupancy which supports however only one mode -R and not any of -C,-t,-G etc. So pretty much the -R works like a -G .. except that it works on the resctrl fs and not perf_cgroup. PMU would have a flag to indicate the perf user mode to check only the llc_occupancy event is supported for the -R. # perf stat -e intel_cqm/llc_occupancy -R p1 -Vikas -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Hello Peterz/Andi, On Thu, 2 Feb 2017, Luck, Tony wrote: Nice to have: 1) Readout using "perf(1)" [subset of modes that make sense ... tying monitoring to resctrl file system will make most command line usage of perf(1) close to impossible. Vikas is pushing for "-R rdtgroup" ... though our offline discussions included overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups depending on event type. Assume we build support to monitor the existing resctrl CAT groups like Thomas suggested. For the perf interface would something like below seems reasonable or a disaster(given that we have a new -R option specific to the PMU/which works only on this PMU) ? # mount -t resctrl resctrl /sys/fs/resctrl # cd /sys/fs/resctrl # mkdir p0 p1 # echo "L3:0=3;1=c" > /sys/fs/resctrl/p0/schemata # echo "L3:0=3;1=3" > /sys/fs/resctrl/p1/schemata Now monitor the group p1 using perf. perf would have a new option -R to monitor the resctrl groups. perf would still have a cqm event like today intel_cqm/llc_occupancy which supports however only one mode -R and not any of -C,-t,-G etc. So pretty much the -R works like a -G .. except that it works on the resctrl fs and not perf_cgroup. PMU would have a flag to indicate the perf user mode to check only the llc_occupancy event is supported for the -R. # perf stat -e intel_cqm/llc_occupancy -R p1 -Vikas -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
>> Nice to have: >> 1) Readout using "perf(1)" [subset of modes that make sense ... tying >> monitoring >> to resctrl file system will make most command line usage of perf(1) close to >> impossible. > > > We discussed this offline and I still disagree that it is close to > impossible to use perf and perf_event_open. In fact, I think it's very > simple : Maybe s/most/many/ ? The issue here is that we are going to define which tasks and cpus are being monitored *outside* of the perf command. So usage like: # perf stat -I 1000 -e intel_cqm/llc_occupancy {command} are completely out of scope ... we aren't planning to change the perf(1) command to know about creating a CQM monitor group, assigning the PID of {command} to it, and then report on llc_occupancy. So perf(1) usage is only going to support modes where it attaches to some monitor group that was previously established. The "-C 2" option to monitor CPU 2 is certainly plausible ... assuming you set up a monitor group to track what is happening on CPU 2 ... I just don't know how perf(1) would know the name of that group. Vikas is pushing for "-R rdtgroup" ... though our offline discussions included overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups depending on event type. -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
>> Nice to have: >> 1) Readout using "perf(1)" [subset of modes that make sense ... tying >> monitoring >> to resctrl file system will make most command line usage of perf(1) close to >> impossible. > > > We discussed this offline and I still disagree that it is close to > impossible to use perf and perf_event_open. In fact, I think it's very > simple : Maybe s/most/many/ ? The issue here is that we are going to define which tasks and cpus are being monitored *outside* of the perf command. So usage like: # perf stat -I 1000 -e intel_cqm/llc_occupancy {command} are completely out of scope ... we aren't planning to change the perf(1) command to know about creating a CQM monitor group, assigning the PID of {command} to it, and then report on llc_occupancy. So perf(1) usage is only going to support modes where it attaches to some monitor group that was previously established. The "-C 2" option to monitor CPU 2 is certainly plausible ... assuming you set up a monitor group to track what is happening on CPU 2 ... I just don't know how perf(1) would know the name of that group. Vikas is pushing for "-R rdtgroup" ... though our offline discussions included overloading "-g" and have perf(1) pick appropriately from cgroups or rdtgroups depending on event type. -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, 1 Feb 2017, Yu, Fenghua wrote: From: Andi Kleen [mailto:a...@firstfloor.org] "Luck, Tony"writes: 9) Measure per logical CPU (pick active RMID in same precedence for task/cpu as CAT picks CLOSID) 10) Put multiple CPUs into a group I'm not sure this is a real requirement. It's just an optimization, right? If you can assign policies to threads, you can implicitly set it per CPU through affinity (or the other way around). The only benefit would be possibly less context switch overhead, but if all the thread (including idle) assigned to a CPU have the same policy it would have the same results. I suspect dropping this would likely simplify the interface significantly. Assigning a pid P to a CPU and monitoring the P don't count all events happening on the CPU. Other processes/threads (e.g. kernel threads) than the assigned P can run on the CPU. Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot cases. This matches the use case where a bunch of real time tasks which have no CLOS id(kernel threads or others in root group) would want to run exclusively on a cpu and are configured so. If any other tasks run there from other class of service we dont want to pullute the cache - hence choose their own CLOSId. Now in order to measure this RMIds need to match the same policy as CAT. Thanks, Vikas Thanks. -Fenghua
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, 1 Feb 2017, Yu, Fenghua wrote: From: Andi Kleen [mailto:a...@firstfloor.org] "Luck, Tony" writes: 9) Measure per logical CPU (pick active RMID in same precedence for task/cpu as CAT picks CLOSID) 10) Put multiple CPUs into a group I'm not sure this is a real requirement. It's just an optimization, right? If you can assign policies to threads, you can implicitly set it per CPU through affinity (or the other way around). The only benefit would be possibly less context switch overhead, but if all the thread (including idle) assigned to a CPU have the same policy it would have the same results. I suspect dropping this would likely simplify the interface significantly. Assigning a pid P to a CPU and monitoring the P don't count all events happening on the CPU. Other processes/threads (e.g. kernel threads) than the assigned P can run on the CPU. Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot cases. This matches the use case where a bunch of real time tasks which have no CLOS id(kernel threads or others in root group) would want to run exclusively on a cpu and are configured so. If any other tasks run there from other class of service we dont want to pullute the cache - hence choose their own CLOSId. Now in order to measure this RMIds need to match the same policy as CAT. Thanks, Vikas Thanks. -Fenghua
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
>> 7) Must be able to measure based on existing resctrl CAT group >> 8) Can get measurements for subsets of tasks in a CAT group (to find >> the guys hogging the resources) >> 9) Measure per logical CPU (pick active RMID in same precedence for >> task/cpu as CAT picks CLOSID) > > I agree that "Measure per logical CPU" is a requirement, but why is > "pick active RMID in same precedence for task/cpu as CAT picks CLOSID" > one as well? Are we set on handling RMIDs the way CLOSIDs are > handled? there are drawbacks to do so, one is that it would make > impossible to do CPU monitoring and CPU filtering the way is done for > all other PMUs. I'm too focused on monitoring existing CAT groups. If we move the parenthetical remark from item 9, to item 7, then I think it is better. When monitoring a CAT group we need to monitor exactly the processes that are controlled by the CAT group. So RMID must match CLOSID, and the precedence rules make that work. For other monitoring cases we can do things differently - so long as we have a way to express what we want, and we don't pile a ton of code into context switch to figure out which RMID is to be loaded into PQR_ASSOC. I thought of another requirement this morning: N+1) When we set up monitoring we must allocate all the resources we need (or fail the setup if we can't get them). Not allowed to error in the middle of monitoring because we can't find a free RMID) -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
>> 7) Must be able to measure based on existing resctrl CAT group >> 8) Can get measurements for subsets of tasks in a CAT group (to find >> the guys hogging the resources) >> 9) Measure per logical CPU (pick active RMID in same precedence for >> task/cpu as CAT picks CLOSID) > > I agree that "Measure per logical CPU" is a requirement, but why is > "pick active RMID in same precedence for task/cpu as CAT picks CLOSID" > one as well? Are we set on handling RMIDs the way CLOSIDs are > handled? there are drawbacks to do so, one is that it would make > impossible to do CPU monitoring and CPU filtering the way is done for > all other PMUs. I'm too focused on monitoring existing CAT groups. If we move the parenthetical remark from item 9, to item 7, then I think it is better. When monitoring a CAT group we need to monitor exactly the processes that are controlled by the CAT group. So RMID must match CLOSID, and the precedence rules make that work. For other monitoring cases we can do things differently - so long as we have a way to express what we want, and we don't pile a ton of code into context switch to figure out which RMID is to be loaded into PQR_ASSOC. I thought of another requirement this morning: N+1) When we set up monitoring we must allocate all the resources we need (or fail the setup if we can't get them). Not allowed to error in the middle of monitoring because we can't find a free RMID) -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
> From: Andi Kleen [mailto:a...@firstfloor.org] > "Luck, Tony"writes: > > 9) Measure per logical CPU (pick active RMID in same precedence for > task/cpu as CAT picks CLOSID) > > 10) Put multiple CPUs into a group > > I'm not sure this is a real requirement. It's just an optimization, right? If > you > can assign policies to threads, you can implicitly set it per CPU through > affinity > (or the other way around). > The only benefit would be possibly less context switch overhead, but if all > the thread (including idle) assigned to a CPU have the same policy it would > have the same results. > > I suspect dropping this would likely simplify the interface significantly. Assigning a pid P to a CPU and monitoring the P don't count all events happening on the CPU. Other processes/threads (e.g. kernel threads) than the assigned P can run on the CPU. Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot cases. Thanks. -Fenghua
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
> From: Andi Kleen [mailto:a...@firstfloor.org] > "Luck, Tony" writes: > > 9) Measure per logical CPU (pick active RMID in same precedence for > task/cpu as CAT picks CLOSID) > > 10) Put multiple CPUs into a group > > I'm not sure this is a real requirement. It's just an optimization, right? If > you > can assign policies to threads, you can implicitly set it per CPU through > affinity > (or the other way around). > The only benefit would be possibly less context switch overhead, but if all > the thread (including idle) assigned to a CPU have the same policy it would > have the same results. > > I suspect dropping this would likely simplify the interface significantly. Assigning a pid P to a CPU and monitoring the P don't count all events happening on the CPU. Other processes/threads (e.g. kernel threads) than the assigned P can run on the CPU. Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot cases. Thanks. -Fenghua
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
> > I'm not sure this is a real requirement. It's just an optimization, > > right? If you can assign policies to threads, you can implicitly set it > > per CPU through affinity (or the other way around). > > That's difficult when distinct users/systems do monitoring and system > management. What if the cluster manager decides to change affinity > for a task after the monitoring service has initiated monitoring a CPU > in the way you describe? Why would you want to monitor a CPU if you don't know what it is running? The results would be meaningless. So you really want to integrate those two services. > > > The only benefit would be possibly less context switch overhead, > > but if all the thread (including idle) assigned to a CPU have the > > same policy it would have the same results. > > I think another of the reasons for the CPU monitoring requirement is > to monitor interruptions in CPUs running the idle thread. In CAT, idle threads are just threads, so they could be just exposed to perf (e.g. combination of pid 0 + cpu filter) > Also, if perf's like monitoring is supported, it'd allow something like > > perf stat -e LLC-load,LLC-prefetches,intel_cqm/total_bytes -C 2 This would work without a special API. -Andi
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
> > I'm not sure this is a real requirement. It's just an optimization, > > right? If you can assign policies to threads, you can implicitly set it > > per CPU through affinity (or the other way around). > > That's difficult when distinct users/systems do monitoring and system > management. What if the cluster manager decides to change affinity > for a task after the monitoring service has initiated monitoring a CPU > in the way you describe? Why would you want to monitor a CPU if you don't know what it is running? The results would be meaningless. So you really want to integrate those two services. > > > The only benefit would be possibly less context switch overhead, > > but if all the thread (including idle) assigned to a CPU have the > > same policy it would have the same results. > > I think another of the reasons for the CPU monitoring requirement is > to monitor interruptions in CPUs running the idle thread. In CAT, idle threads are just threads, so they could be just exposed to perf (e.g. combination of pid 0 + cpu filter) > Also, if perf's like monitoring is supported, it'd allow something like > > perf stat -e LLC-load,LLC-prefetches,intel_cqm/total_bytes -C 2 This would work without a special API. -Andi
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, Feb 1, 2017 at 4:35 PM, Andi Kleenwrote: > "Luck, Tony" writes: >> 9)Measure per logical CPU (pick active RMID in same precedence for >> task/cpu as CAT picks CLOSID) >> 10) Put multiple CPUs into a group > > I'm not sure this is a real requirement. It's just an optimization, > right? If you can assign policies to threads, you can implicitly set it > per CPU through affinity (or the other way around). That's difficult when distinct users/systems do monitoring and system management. What if the cluster manager decides to change affinity for a task after the monitoring service has initiated monitoring a CPU in the way you describe? > The only benefit would be possibly less context switch overhead, > but if all the thread (including idle) assigned to a CPU have the > same policy it would have the same results. I think another of the reasons for the CPU monitoring requirement is to monitor interruptions in CPUs running the idle thread. In CAT, those interruptions use the CPU's CLOSID. Here they'd use the CPU's RMID. Since RMID's are scarce, CPUs can be aggregated into groups to save many. Also, if perf's like monitoring is supported, it'd allow something like perf stat -e LLC-load,LLC-prefetches,intel_cqm/total_bytes -C 2 Thanks, David
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, Feb 1, 2017 at 4:35 PM, Andi Kleen wrote: > "Luck, Tony" writes: >> 9)Measure per logical CPU (pick active RMID in same precedence for >> task/cpu as CAT picks CLOSID) >> 10) Put multiple CPUs into a group > > I'm not sure this is a real requirement. It's just an optimization, > right? If you can assign policies to threads, you can implicitly set it > per CPU through affinity (or the other way around). That's difficult when distinct users/systems do monitoring and system management. What if the cluster manager decides to change affinity for a task after the monitoring service has initiated monitoring a CPU in the way you describe? > The only benefit would be possibly less context switch overhead, > but if all the thread (including idle) assigned to a CPU have the > same policy it would have the same results. I think another of the reasons for the CPU monitoring requirement is to monitor interruptions in CPUs running the idle thread. In CAT, those interruptions use the CPU's CLOSID. Here they'd use the CPU's RMID. Since RMID's are scarce, CPUs can be aggregated into groups to save many. Also, if perf's like monitoring is supported, it'd allow something like perf stat -e LLC-load,LLC-prefetches,intel_cqm/total_bytes -C 2 Thanks, David
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
"Luck, Tony"writes: > 9)Measure per logical CPU (pick active RMID in same precedence for > task/cpu as CAT picks CLOSID) > 10) Put multiple CPUs into a group I'm not sure this is a real requirement. It's just an optimization, right? If you can assign policies to threads, you can implicitly set it per CPU through affinity (or the other way around). The only benefit would be possibly less context switch overhead, but if all the thread (including idle) assigned to a CPU have the same policy it would have the same results. I suspect dropping this would likely simplify the interface significantly. -Andi
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
"Luck, Tony" writes: > 9)Measure per logical CPU (pick active RMID in same precedence for > task/cpu as CAT picks CLOSID) > 10) Put multiple CPUs into a group I'm not sure this is a real requirement. It's just an optimization, right? If you can assign policies to threads, you can implicitly set it per CPU through affinity (or the other way around). The only benefit would be possibly less context switch overhead, but if all the thread (including idle) assigned to a CPU have the same policy it would have the same results. I suspect dropping this would likely simplify the interface significantly. -Andi
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, Feb 1, 2017 at 12:08 PM Luck, Tonywrote: > > > I was asking for requirements, not a design proposal. In order to make a > > design you need a requirements specification. > > Here's what I came up with ... not a fully baked list, but should allow for > some useful > discussion on whether any of these are not really needed, or if there is a > glaring hole > that misses some use case: > > 1) Able to measure using all supported events (currently L3 occupancy, > Total B/W, Local B/W) > 2) Measure per thread > 3) Including kernel threads > 4) Put multiple threads into a single measurement group (forced by h/w > shortage of RMIDs, but probably good to have anyway) Even with infinite hw RMIDs you want to be able to have one RMID per thread groups to avoid reading a potentially large list of RMIDs every time you measure one group's event (with the delay and error associated to measure many RMIDs whose values fluctuate rapidly). > 5) New threads created inherit measurement group from parent > 6) Report separate results per domain (L3) > 7) Must be able to measure based on existing resctrl CAT group > 8) Can get measurements for subsets of tasks in a CAT group (to find the > guys hogging the resources) > 9) Measure per logical CPU (pick active RMID in same precedence for > task/cpu as CAT picks CLOSID) I agree that "Measure per logical CPU" is a requirement, but why is "pick active RMID in same precedence for task/cpu as CAT picks CLOSID" one as well? Are we set on handling RMIDs the way CLOSIDs are handled? there are drawbacks to do so, one is that it would make impossible to do CPU monitoring and CPU filtering the way is done for all other PMUs. i.e. the following commands (or their equivalent in whatever other API you create) won't work: a) perf stat -e intel_cqm/total_bytes/ -C 2 or b.1) perf stat -e intel_cqm/total_bytes/ -C 2 or b.2) perf stat -e intel_cqm/llc_occupancy/ -a in (a) because many RMIDs may run in the CPU and, in (b's) because the same measurement group's RMID will be used across all CPUs. I know this is similar to how it is in CAT, but CAT was never intended to do monitoring. We can do the CAT way and the perf way, or not, but if we will drop support for perf's like CPU support, it must be explicitly stated and not an implicit consequence of a design choice leaked into requirements. > 10) Put multiple CPUs into a group 11) Able to measure across CAT groups. So that a user can: A) measure a task that runs on CPUs that are in different CAT groups (one of Thomas' use case FWICT), and B) measure tasks even if they change their CAT group (my use case). > > Nice to have: > 1) Readout using "perf(1)" [subset of modes that make sense ... tying > monitoring to resctrl file system will make most command line usage of > perf(1) close to impossible. We discussed this offline and I still disagree that it is close to impossible to use perf and perf_event_open. In fact, I think it's very simple : a) We stretch the usage of the pid parameter in perf_event_open to also allow a PMU specific task group fd (as of now it's either a PID or a cgroup fd). b) PMUs that can handle non-cgroup task groups have a special PMU_CAP flag to signal the generic code to not resolve the fd to a cgroup pointer and, instead, save it as is in struct perf_event (a few lines of code). c) The PMU takes care of resolving the task group's fd. The above is ONE way to do it, there may be others. But there is a big advantage on leveraging perf_event_open and ease integration with the perf tool and the myriads of tools that use the perf API. 12) Whatever fs or syscall is provided instead of perf syscalls, it should provide total_time_enabled in the way perf does, otherwise is hard to interpret MBM values. > > -Tony > >
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, Feb 1, 2017 at 12:08 PM Luck, Tony wrote: > > > I was asking for requirements, not a design proposal. In order to make a > > design you need a requirements specification. > > Here's what I came up with ... not a fully baked list, but should allow for > some useful > discussion on whether any of these are not really needed, or if there is a > glaring hole > that misses some use case: > > 1) Able to measure using all supported events (currently L3 occupancy, > Total B/W, Local B/W) > 2) Measure per thread > 3) Including kernel threads > 4) Put multiple threads into a single measurement group (forced by h/w > shortage of RMIDs, but probably good to have anyway) Even with infinite hw RMIDs you want to be able to have one RMID per thread groups to avoid reading a potentially large list of RMIDs every time you measure one group's event (with the delay and error associated to measure many RMIDs whose values fluctuate rapidly). > 5) New threads created inherit measurement group from parent > 6) Report separate results per domain (L3) > 7) Must be able to measure based on existing resctrl CAT group > 8) Can get measurements for subsets of tasks in a CAT group (to find the > guys hogging the resources) > 9) Measure per logical CPU (pick active RMID in same precedence for > task/cpu as CAT picks CLOSID) I agree that "Measure per logical CPU" is a requirement, but why is "pick active RMID in same precedence for task/cpu as CAT picks CLOSID" one as well? Are we set on handling RMIDs the way CLOSIDs are handled? there are drawbacks to do so, one is that it would make impossible to do CPU monitoring and CPU filtering the way is done for all other PMUs. i.e. the following commands (or their equivalent in whatever other API you create) won't work: a) perf stat -e intel_cqm/total_bytes/ -C 2 or b.1) perf stat -e intel_cqm/total_bytes/ -C 2 or b.2) perf stat -e intel_cqm/llc_occupancy/ -a in (a) because many RMIDs may run in the CPU and, in (b's) because the same measurement group's RMID will be used across all CPUs. I know this is similar to how it is in CAT, but CAT was never intended to do monitoring. We can do the CAT way and the perf way, or not, but if we will drop support for perf's like CPU support, it must be explicitly stated and not an implicit consequence of a design choice leaked into requirements. > 10) Put multiple CPUs into a group 11) Able to measure across CAT groups. So that a user can: A) measure a task that runs on CPUs that are in different CAT groups (one of Thomas' use case FWICT), and B) measure tasks even if they change their CAT group (my use case). > > Nice to have: > 1) Readout using "perf(1)" [subset of modes that make sense ... tying > monitoring to resctrl file system will make most command line usage of > perf(1) close to impossible. We discussed this offline and I still disagree that it is close to impossible to use perf and perf_event_open. In fact, I think it's very simple : a) We stretch the usage of the pid parameter in perf_event_open to also allow a PMU specific task group fd (as of now it's either a PID or a cgroup fd). b) PMUs that can handle non-cgroup task groups have a special PMU_CAP flag to signal the generic code to not resolve the fd to a cgroup pointer and, instead, save it as is in struct perf_event (a few lines of code). c) The PMU takes care of resolving the task group's fd. The above is ONE way to do it, there may be others. But there is a big advantage on leveraging perf_event_open and ease integration with the perf tool and the myriads of tools that use the perf API. 12) Whatever fs or syscall is provided instead of perf syscalls, it should provide total_time_enabled in the way perf does, otherwise is hard to interpret MBM values. > > -Tony > >
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
> I was asking for requirements, not a design proposal. In order to make a > design you need a requirements specification. Here's what I came up with ... not a fully baked list, but should allow for some useful discussion on whether any of these are not really needed, or if there is a glaring hole that misses some use case: 1) Able to measure using all supported events (currently L3 occupancy, Total B/W, Local B/W) 2) Measure per thread 3) Including kernel threads 4) Put multiple threads into a single measurement group (forced by h/w shortage of RMIDs, but probably good to have anyway) 5) New threads created inherit measurement group from parent 6) Report separate results per domain (L3) 7) Must be able to measure based on existing resctrl CAT group 8) Can get measurements for subsets of tasks in a CAT group (to find the guys hogging the resources) 9) Measure per logical CPU (pick active RMID in same precedence for task/cpu as CAT picks CLOSID) 10) Put multiple CPUs into a group Nice to have: 1) Readout using "perf(1)" [subset of modes that make sense ... tying monitoring to resctrl file system will make most command line usage of perf(1) close to impossible. -Tony
RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
> I was asking for requirements, not a design proposal. In order to make a > design you need a requirements specification. Here's what I came up with ... not a fully baked list, but should allow for some useful discussion on whether any of these are not really needed, or if there is a glaring hole that misses some use case: 1) Able to measure using all supported events (currently L3 occupancy, Total B/W, Local B/W) 2) Measure per thread 3) Including kernel threads 4) Put multiple threads into a single measurement group (forced by h/w shortage of RMIDs, but probably good to have anyway) 5) New threads created inherit measurement group from parent 6) Report separate results per domain (L3) 7) Must be able to measure based on existing resctrl CAT group 8) Can get measurements for subsets of tasks in a CAT group (to find the guys hogging the resources) 9) Measure per logical CPU (pick active RMID in same precedence for task/cpu as CAT picks CLOSID) 10) Put multiple CPUs into a group Nice to have: 1) Readout using "perf(1)" [subset of modes that make sense ... tying monitoring to resctrl file system will make most command line usage of perf(1) close to impossible. -Tony
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Mon, Jan 23, 2017 at 10:47:44AM +0100, Thomas Gleixner wrote: > So again: > > Can please everyone involved write up their specific requirements > for CQM and stop spamming us with half baken design proposals? > > And I mean abstract requirements and not again something which is > referring to existing crap or some desired crap. > > The complete list of requirements has to be agreed on before we talk about > anything else. So something along the lines of: A) need to create a (named) group of tasks 1) group composition needs to be dynamic; ie. we can add/remove member tasks at any time. 2) a task can only belong to _one_ group at any one time. 3) grouping need not be hierarchical? B) for each group, we need to set a CAT mask 1) this CAT mask must be dynamic; ie. we can, during the existence of the group, change the mask at any time. C) for each group, we need to monitor CQM bits 1) this monitor need not change Supporting Use-Cases: A.1: The Job (or VM) can have a dynamic task set B.1: Dynamic QoS for each Job (or VM) as demand / load changes Feel free to expand etc..
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Mon, Jan 23, 2017 at 10:47:44AM +0100, Thomas Gleixner wrote: > So again: > > Can please everyone involved write up their specific requirements > for CQM and stop spamming us with half baken design proposals? > > And I mean abstract requirements and not again something which is > referring to existing crap or some desired crap. > > The complete list of requirements has to be agreed on before we talk about > anything else. So something along the lines of: A) need to create a (named) group of tasks 1) group composition needs to be dynamic; ie. we can add/remove member tasks at any time. 2) a task can only belong to _one_ group at any one time. 3) grouping need not be hierarchical? B) for each group, we need to set a CAT mask 1) this CAT mask must be dynamic; ie. we can, during the existence of the group, change the mask at any time. C) for each group, we need to monitor CQM bits 1) this monitor need not change Supporting Use-Cases: A.1: The Job (or VM) can have a dynamic task set B.1: Dynamic QoS for each Job (or VM) as demand / load changes Feel free to expand etc..
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote: > On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixnerwrote: > > Can you please write up in a abstract way what the design requirements are > > that you need. So far we are talking about implementation details and > > unspecfied wishlists, but what we really need is an abstract requirement. > > My pleasure: > > > Design Proposal for Monitoring of RDT Allocation Groups. I was asking for requirements, not a design proposal. In order to make a design you need a requirements specification. So again: Can please everyone involved write up their specific requirements for CQM and stop spamming us with half baken design proposals? And I mean abstract requirements and not again something which is referring to existing crap or some desired crap. The complete list of requirements has to be agreed on before we talk about anything else. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote: > On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner wrote: > > Can you please write up in a abstract way what the design requirements are > > that you need. So far we are talking about implementation details and > > unspecfied wishlists, but what we really need is an abstract requirement. > > My pleasure: > > > Design Proposal for Monitoring of RDT Allocation Groups. I was asking for requirements, not a design proposal. In order to make a design you need a requirements specification. So again: Can please everyone involved write up their specific requirements for CQM and stop spamming us with half baken design proposals? And I mean abstract requirements and not again something which is referring to existing crap or some desired crap. The complete list of requirements has to be agreed on before we talk about anything else. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote: On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikaswrote: On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote: On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner wrote: On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: If resctrl groups could lift the restriction of one resctl per CLOSID, then the user can create many resctrl in the way perf cgroups are created now. The advantage is that there wont be cgroup hierarchy! making things much simpler. Also no need to optimize perf event context switch to make llc_occupancy work. So if I understand you correctly, then you want a mechanism to have groups of entities (tasks, cpus) and associate them to a particular resource control group. So they share the CLOSID of the control group and each entity group can have its own RMID. Now you want to be able to move the entity groups around between control groups without losing the RMID associated to the entity group. So the whole picture would look like this: rdt -> CTRLGRP -> CLOSID mon -> MONGRP -> RMID And you want to move MONGRP from one CTRLGRP to another. Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the same thing. Details below. Can you please write up in a abstract way what the design requirements are that you need. So far we are talking about implementation details and unspecfied wishlists, but what we really need is an abstract requirement. My pleasure: Design Proposal for Monitoring of RDT Allocation Groups. - Currently each CTRLGRP has a unique CLOSID and a (most likely) unique cache bitmask (CBM) per resource. Non-unique CBM are possible although useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs. CLOSIDs are much more scarce than RMIDs. If we lift the condition of unique CLOSID, then the user can create multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP would share the CLOSID and RDT_Allocation must maintain the schemata to CLOSID relationship (similarly to what the previous CAT driver used to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as now: adding an element removes it from its previous CTRLGRP. This change would allow further partitioning the allocation groups into (allocation, monitoring) groups as follows: With allocation only: CTRLGRP0 CTRLGRP_ALLOC_ONLY schemata: L3:0=0xff0 L3:0=x00f tasks: PID0 P0_0,P0_1,P1_0,P1_1 cpus:0x30xC Not clear what the PID0 and P0_0 mean ? PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it does now in RDT. I am not changing that. If you have to support something like MONGRP and CTRLGRP overall you want to allow for a task to be present in multiple groups ? I am not proposing to support MONGRP and CTRLGRP. I am proposing to allow monitoring of CTRGRPs only. If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC independently, with the new model we could create: CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3 schemata: L3:0=0xff0 L3:0=x00fL3:0=0x00f L3:0=0x00f tasks: PID0 P0_0,P0_1 P1_0, P1_1 cpus:0x3 0xC 0x0 0x0 Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0). Now we can ask perf to monitor any of the CTRLGRPs independently -once we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-. The perf_event will reserve and assign the RMID to the monitored CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR (CLOSID and RMID), so perf won't have to. This can be solved by suporting just the -t in perf and a new option in perf to suport resctrl group monitoring (something similar to -R). That way we provide the flexible granularity to monitor tasks independent of whether they are in any resctrl group (and hence also a subset). One of the key points of my proposal is to remove monitoring PIDs independently. That simplifies things by letting RDT handle CLOSIDs and RMIDs together. CTRLGRP TASKS MASK CTRLGRP1PID1,PID2 L3:0=0Xf,1=0xf0 CTRLGRP2PID3,PID4 L3:0=0Xf0,1=0xf00 #perf stat -e llc_occupancy -R CTRLGRP1 #perf stat -e llc_occupancy -t PID3,PID4 The RMID allocation is independent of resctrl CLOSid allocation and hence the RMID is not always married to CLOS which seems like the requirement here. It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can change in my proposal. OR We could have CTRLGRPs with control_only, monitor_only or control_monitor options. now a task could be present in both control_only and monitor_only group or it could be present only in a control_monitor_group. The transitions from one state to another are guarded by this same principle. CTRLGRP
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote: On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikas wrote: On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote: On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner wrote: On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: If resctrl groups could lift the restriction of one resctl per CLOSID, then the user can create many resctrl in the way perf cgroups are created now. The advantage is that there wont be cgroup hierarchy! making things much simpler. Also no need to optimize perf event context switch to make llc_occupancy work. So if I understand you correctly, then you want a mechanism to have groups of entities (tasks, cpus) and associate them to a particular resource control group. So they share the CLOSID of the control group and each entity group can have its own RMID. Now you want to be able to move the entity groups around between control groups without losing the RMID associated to the entity group. So the whole picture would look like this: rdt -> CTRLGRP -> CLOSID mon -> MONGRP -> RMID And you want to move MONGRP from one CTRLGRP to another. Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the same thing. Details below. Can you please write up in a abstract way what the design requirements are that you need. So far we are talking about implementation details and unspecfied wishlists, but what we really need is an abstract requirement. My pleasure: Design Proposal for Monitoring of RDT Allocation Groups. - Currently each CTRLGRP has a unique CLOSID and a (most likely) unique cache bitmask (CBM) per resource. Non-unique CBM are possible although useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs. CLOSIDs are much more scarce than RMIDs. If we lift the condition of unique CLOSID, then the user can create multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP would share the CLOSID and RDT_Allocation must maintain the schemata to CLOSID relationship (similarly to what the previous CAT driver used to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as now: adding an element removes it from its previous CTRLGRP. This change would allow further partitioning the allocation groups into (allocation, monitoring) groups as follows: With allocation only: CTRLGRP0 CTRLGRP_ALLOC_ONLY schemata: L3:0=0xff0 L3:0=x00f tasks: PID0 P0_0,P0_1,P1_0,P1_1 cpus:0x30xC Not clear what the PID0 and P0_0 mean ? PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it does now in RDT. I am not changing that. If you have to support something like MONGRP and CTRLGRP overall you want to allow for a task to be present in multiple groups ? I am not proposing to support MONGRP and CTRLGRP. I am proposing to allow monitoring of CTRGRPs only. If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC independently, with the new model we could create: CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3 schemata: L3:0=0xff0 L3:0=x00fL3:0=0x00f L3:0=0x00f tasks: PID0 P0_0,P0_1 P1_0, P1_1 cpus:0x3 0xC 0x0 0x0 Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0). Now we can ask perf to monitor any of the CTRLGRPs independently -once we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-. The perf_event will reserve and assign the RMID to the monitored CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR (CLOSID and RMID), so perf won't have to. This can be solved by suporting just the -t in perf and a new option in perf to suport resctrl group monitoring (something similar to -R). That way we provide the flexible granularity to monitor tasks independent of whether they are in any resctrl group (and hence also a subset). One of the key points of my proposal is to remove monitoring PIDs independently. That simplifies things by letting RDT handle CLOSIDs and RMIDs together. CTRLGRP TASKS MASK CTRLGRP1PID1,PID2 L3:0=0Xf,1=0xf0 CTRLGRP2PID3,PID4 L3:0=0Xf0,1=0xf00 #perf stat -e llc_occupancy -R CTRLGRP1 #perf stat -e llc_occupancy -t PID3,PID4 The RMID allocation is independent of resctrl CLOSid allocation and hence the RMID is not always married to CLOS which seems like the requirement here. It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can change in my proposal. OR We could have CTRLGRPs with control_only, monitor_only or control_monitor options. now a task could be present in both control_only and monitor_only group or it could be present only in a control_monitor_group. The transitions from one state to another are guarded by this same principle. CTRLGRP TASKS MASK
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikaswrote: > > > On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote: > >> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner >> wrote: >>> >>> >>> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: If resctrl groups could lift the restriction of one resctl per CLOSID, then the user can create many resctrl in the way perf cgroups are created now. The advantage is that there wont be cgroup hierarchy! making things much simpler. Also no need to optimize perf event context switch to make llc_occupancy work. >>> >>> >>> So if I understand you correctly, then you want a mechanism to have >>> groups >>> of entities (tasks, cpus) and associate them to a particular resource >>> control group. >>> >>> So they share the CLOSID of the control group and each entity group can >>> have its own RMID. >>> >>> Now you want to be able to move the entity groups around between control >>> groups without losing the RMID associated to the entity group. >>> >>> So the whole picture would look like this: >>> >>> rdt -> CTRLGRP -> CLOSID >>> >>> mon -> MONGRP -> RMID >>> >>> And you want to move MONGRP from one CTRLGRP to another. >> >> >> Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the >> same thing. Details below. >> >>> >>> Can you please write up in a abstract way what the design requirements >>> are >>> that you need. So far we are talking about implementation details and >>> unspecfied wishlists, but what we really need is an abstract requirement. >> >> >> My pleasure: >> >> >> Design Proposal for Monitoring of RDT Allocation Groups. >> >> - >> >> Currently each CTRLGRP has a unique CLOSID and a (most likely) unique >> cache bitmask (CBM) per resource. Non-unique CBM are possible although >> useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs. >> CLOSIDs are much more scarce than RMIDs. >> >> If we lift the condition of unique CLOSID, then the user can create >> multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP >> would share the CLOSID and RDT_Allocation must maintain the schemata >> to CLOSID relationship (similarly to what the previous CAT driver used >> to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as >> now: adding an element removes it from its previous CTRLGRP. >> >> >> This change would allow further partitioning the allocation groups >> into (allocation, monitoring) groups as follows: >> >> With allocation only: >>CTRLGRP0 CTRLGRP_ALLOC_ONLY >> schemata: L3:0=0xff0 L3:0=x00f >> tasks: PID0 P0_0,P0_1,P1_0,P1_1 >> cpus:0x30xC > > > Not clear what the PID0 and P0_0 mean ? PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it does now in RDT. I am not changing that. > > If you have to support something like MONGRP and CTRLGRP overall you want to > allow for a task to be present in multiple groups ? I am not proposing to support MONGRP and CTRLGRP. I am proposing to allow monitoring of CTRGRPs only. >> >> If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC >> independently, with the new model we could create: >>CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3 >> schemata: L3:0=0xff0 L3:0=x00fL3:0=0x00f L3:0=0x00f >> tasks: PID0 P0_0,P0_1 P1_0, P1_1 >> cpus:0x3 0xC 0x0 0x0 >> >> Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for >> (L3,0). >> >> >> Now we can ask perf to monitor any of the CTRLGRPs independently -once >> we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-. >> The perf_event will reserve and assign the RMID to the monitored >> CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR >> (CLOSID and RMID), so perf won't have to. > > > This can be solved by suporting just the -t in perf and a new option in perf > to suport resctrl group monitoring (something similar to -R). That way we > provide the flexible granularity to monitor tasks independent of whether > they are in any resctrl group (and hence also a subset). One of the key points of my proposal is to remove monitoring PIDs independently. That simplifies things by letting RDT handle CLOSIDs and RMIDs together. > > CTRLGRP TASKS MASK > CTRLGRP1PID1,PID2 L3:0=0Xf,1=0xf0 > CTRLGRP2PID3,PID4 L3:0=0Xf0,1=0xf00 > > #perf stat -e llc_occupancy -R CTRLGRP1 > > #perf stat -e llc_occupancy -t PID3,PID4 > > The RMID allocation is independent of resctrl CLOSid allocation and hence > the RMID is not always married to CLOS which seems like the requirement > here. It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can change in my proposal. > > OR > > We could have CTRLGRPs with control_only,
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Jan 20, 2017 at 1:08 PM, Shivappa Vikas wrote: > > > On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote: > >> On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner >> wrote: >>> >>> >>> On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: If resctrl groups could lift the restriction of one resctl per CLOSID, then the user can create many resctrl in the way perf cgroups are created now. The advantage is that there wont be cgroup hierarchy! making things much simpler. Also no need to optimize perf event context switch to make llc_occupancy work. >>> >>> >>> So if I understand you correctly, then you want a mechanism to have >>> groups >>> of entities (tasks, cpus) and associate them to a particular resource >>> control group. >>> >>> So they share the CLOSID of the control group and each entity group can >>> have its own RMID. >>> >>> Now you want to be able to move the entity groups around between control >>> groups without losing the RMID associated to the entity group. >>> >>> So the whole picture would look like this: >>> >>> rdt -> CTRLGRP -> CLOSID >>> >>> mon -> MONGRP -> RMID >>> >>> And you want to move MONGRP from one CTRLGRP to another. >> >> >> Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the >> same thing. Details below. >> >>> >>> Can you please write up in a abstract way what the design requirements >>> are >>> that you need. So far we are talking about implementation details and >>> unspecfied wishlists, but what we really need is an abstract requirement. >> >> >> My pleasure: >> >> >> Design Proposal for Monitoring of RDT Allocation Groups. >> >> - >> >> Currently each CTRLGRP has a unique CLOSID and a (most likely) unique >> cache bitmask (CBM) per resource. Non-unique CBM are possible although >> useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs. >> CLOSIDs are much more scarce than RMIDs. >> >> If we lift the condition of unique CLOSID, then the user can create >> multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP >> would share the CLOSID and RDT_Allocation must maintain the schemata >> to CLOSID relationship (similarly to what the previous CAT driver used >> to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as >> now: adding an element removes it from its previous CTRLGRP. >> >> >> This change would allow further partitioning the allocation groups >> into (allocation, monitoring) groups as follows: >> >> With allocation only: >>CTRLGRP0 CTRLGRP_ALLOC_ONLY >> schemata: L3:0=0xff0 L3:0=x00f >> tasks: PID0 P0_0,P0_1,P1_0,P1_1 >> cpus:0x30xC > > > Not clear what the PID0 and P0_0 mean ? PID0, and P*_* are arbitrary PIDs. The tasks file works the same as it does now in RDT. I am not changing that. > > If you have to support something like MONGRP and CTRLGRP overall you want to > allow for a task to be present in multiple groups ? I am not proposing to support MONGRP and CTRLGRP. I am proposing to allow monitoring of CTRGRPs only. >> >> If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC >> independently, with the new model we could create: >>CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3 >> schemata: L3:0=0xff0 L3:0=x00fL3:0=0x00f L3:0=0x00f >> tasks: PID0 P0_0,P0_1 P1_0, P1_1 >> cpus:0x3 0xC 0x0 0x0 >> >> Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for >> (L3,0). >> >> >> Now we can ask perf to monitor any of the CTRLGRPs independently -once >> we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-. >> The perf_event will reserve and assign the RMID to the monitored >> CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR >> (CLOSID and RMID), so perf won't have to. > > > This can be solved by suporting just the -t in perf and a new option in perf > to suport resctrl group monitoring (something similar to -R). That way we > provide the flexible granularity to monitor tasks independent of whether > they are in any resctrl group (and hence also a subset). One of the key points of my proposal is to remove monitoring PIDs independently. That simplifies things by letting RDT handle CLOSIDs and RMIDs together. > > CTRLGRP TASKS MASK > CTRLGRP1PID1,PID2 L3:0=0Xf,1=0xf0 > CTRLGRP2PID3,PID4 L3:0=0Xf0,1=0xf00 > > #perf stat -e llc_occupancy -R CTRLGRP1 > > #perf stat -e llc_occupancy -t PID3,PID4 > > The RMID allocation is independent of resctrl CLOSid allocation and hence > the RMID is not always married to CLOS which seems like the requirement > here. It is not a requirement. Both the CLOSID and the RMID of a CTRLGRP can change in my proposal. > > OR > > We could have CTRLGRPs with control_only, monitor_only or control_monitor > options. > > now a
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote: On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixnerwrote: On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: If resctrl groups could lift the restriction of one resctl per CLOSID, then the user can create many resctrl in the way perf cgroups are created now. The advantage is that there wont be cgroup hierarchy! making things much simpler. Also no need to optimize perf event context switch to make llc_occupancy work. So if I understand you correctly, then you want a mechanism to have groups of entities (tasks, cpus) and associate them to a particular resource control group. So they share the CLOSID of the control group and each entity group can have its own RMID. Now you want to be able to move the entity groups around between control groups without losing the RMID associated to the entity group. So the whole picture would look like this: rdt -> CTRLGRP -> CLOSID mon -> MONGRP -> RMID And you want to move MONGRP from one CTRLGRP to another. Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the same thing. Details below. Can you please write up in a abstract way what the design requirements are that you need. So far we are talking about implementation details and unspecfied wishlists, but what we really need is an abstract requirement. My pleasure: Design Proposal for Monitoring of RDT Allocation Groups. - Currently each CTRLGRP has a unique CLOSID and a (most likely) unique cache bitmask (CBM) per resource. Non-unique CBM are possible although useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs. CLOSIDs are much more scarce than RMIDs. If we lift the condition of unique CLOSID, then the user can create multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP would share the CLOSID and RDT_Allocation must maintain the schemata to CLOSID relationship (similarly to what the previous CAT driver used to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as now: adding an element removes it from its previous CTRLGRP. This change would allow further partitioning the allocation groups into (allocation, monitoring) groups as follows: With allocation only: CTRLGRP0 CTRLGRP_ALLOC_ONLY schemata: L3:0=0xff0 L3:0=x00f tasks: PID0 P0_0,P0_1,P1_0,P1_1 cpus:0x30xC Not clear what the PID0 and P0_0 mean ? If you have to support something like MONGRP and CTRLGRP overall you want to allow for a task to be present in multiple groups ? If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC independently, with the new model we could create: CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3 schemata: L3:0=0xff0 L3:0=x00fL3:0=0x00f L3:0=0x00f tasks: PID0 P0_0,P0_1 P1_0, P1_1 cpus:0x3 0xC 0x0 0x0 Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0). Now we can ask perf to monitor any of the CTRLGRPs independently -once we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-. The perf_event will reserve and assign the RMID to the monitored CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR (CLOSID and RMID), so perf won't have to. This can be solved by suporting just the -t in perf and a new option in perf to suport resctrl group monitoring (something similar to -R). That way we provide the flexible granularity to monitor tasks independent of whether they are in any resctrl group (and hence also a subset). CTRLGRP TASKS MASK CTRLGRP1PID1,PID2 L3:0=0Xf,1=0xf0 CTRLGRP2PID3,PID4 L3:0=0Xf0,1=0xf00 #perf stat -e llc_occupancy -R CTRLGRP1 #perf stat -e llc_occupancy -t PID3,PID4 The RMID allocation is independent of resctrl CLOSid allocation and hence the RMID is not always married to CLOS which seems like the requirement here. OR We could have CTRLGRPs with control_only, monitor_only or control_monitor options. now a task could be present in both control_only and monitor_only group or it could be present only in a control_monitor_group. The transitions from one state to another are guarded by this same principle. CTRLGRP TASKS MASKTYPE CTRLGRP1PID1,PID2 L3:0=0Xf,1=0xf0 control_only CTRLGRP2PID3,PID4 L3:0=0Xf0,1=0xf00 control_only CTRLGRP3PID2,PID3 monitor_only CTRLGRP4PID5,PID6 L3:0=0Xf0,1=0xf00 control_monitor CTRLGRP3 allows you to monitor a set of tasks which is not bound to be in the same CTRLGRP and you can add or move tasks into this. The adding and removing the tasks is whats easily supported compared to the task granularity although such a thing could still be supported with
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, 20 Jan 2017, David Carrillo-Cisneros wrote: On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner wrote: On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: If resctrl groups could lift the restriction of one resctl per CLOSID, then the user can create many resctrl in the way perf cgroups are created now. The advantage is that there wont be cgroup hierarchy! making things much simpler. Also no need to optimize perf event context switch to make llc_occupancy work. So if I understand you correctly, then you want a mechanism to have groups of entities (tasks, cpus) and associate them to a particular resource control group. So they share the CLOSID of the control group and each entity group can have its own RMID. Now you want to be able to move the entity groups around between control groups without losing the RMID associated to the entity group. So the whole picture would look like this: rdt -> CTRLGRP -> CLOSID mon -> MONGRP -> RMID And you want to move MONGRP from one CTRLGRP to another. Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the same thing. Details below. Can you please write up in a abstract way what the design requirements are that you need. So far we are talking about implementation details and unspecfied wishlists, but what we really need is an abstract requirement. My pleasure: Design Proposal for Monitoring of RDT Allocation Groups. - Currently each CTRLGRP has a unique CLOSID and a (most likely) unique cache bitmask (CBM) per resource. Non-unique CBM are possible although useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs. CLOSIDs are much more scarce than RMIDs. If we lift the condition of unique CLOSID, then the user can create multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP would share the CLOSID and RDT_Allocation must maintain the schemata to CLOSID relationship (similarly to what the previous CAT driver used to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as now: adding an element removes it from its previous CTRLGRP. This change would allow further partitioning the allocation groups into (allocation, monitoring) groups as follows: With allocation only: CTRLGRP0 CTRLGRP_ALLOC_ONLY schemata: L3:0=0xff0 L3:0=x00f tasks: PID0 P0_0,P0_1,P1_0,P1_1 cpus:0x30xC Not clear what the PID0 and P0_0 mean ? If you have to support something like MONGRP and CTRLGRP overall you want to allow for a task to be present in multiple groups ? If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC independently, with the new model we could create: CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3 schemata: L3:0=0xff0 L3:0=x00fL3:0=0x00f L3:0=0x00f tasks: PID0 P0_0,P0_1 P1_0, P1_1 cpus:0x3 0xC 0x0 0x0 Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0). Now we can ask perf to monitor any of the CTRLGRPs independently -once we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-. The perf_event will reserve and assign the RMID to the monitored CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR (CLOSID and RMID), so perf won't have to. This can be solved by suporting just the -t in perf and a new option in perf to suport resctrl group monitoring (something similar to -R). That way we provide the flexible granularity to monitor tasks independent of whether they are in any resctrl group (and hence also a subset). CTRLGRP TASKS MASK CTRLGRP1PID1,PID2 L3:0=0Xf,1=0xf0 CTRLGRP2PID3,PID4 L3:0=0Xf0,1=0xf00 #perf stat -e llc_occupancy -R CTRLGRP1 #perf stat -e llc_occupancy -t PID3,PID4 The RMID allocation is independent of resctrl CLOSid allocation and hence the RMID is not always married to CLOS which seems like the requirement here. OR We could have CTRLGRPs with control_only, monitor_only or control_monitor options. now a task could be present in both control_only and monitor_only group or it could be present only in a control_monitor_group. The transitions from one state to another are guarded by this same principle. CTRLGRP TASKS MASKTYPE CTRLGRP1PID1,PID2 L3:0=0Xf,1=0xf0 control_only CTRLGRP2PID3,PID4 L3:0=0Xf0,1=0xf00 control_only CTRLGRP3PID2,PID3 monitor_only CTRLGRP4PID5,PID6 L3:0=0Xf0,1=0xf00 control_monitor CTRLGRP3 allows you to monitor a set of tasks which is not bound to be in the same CTRLGRP and you can add or move tasks into this. The adding and removing the tasks is whats easily supported compared to the task granularity although such a thing could still be supported with the task granularity.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappawrote: Resending including Thomas , also with some changes. Sorry for the spam Based on Thomas and Peterz feedback Can think of two design variants which target: -Support monitoring and allocating using the same resctrl group. user can use a resctrl group to allocate resources and also monitor them (with respect to tasks or cpu) -Also allows monitoring outside of resctrl so that user can monitor subgroups who use the same closid. This mode can be used when user wants to monitor more than just the resctrl groups. The first design version uses and modifies perf_cgroup, second version builds a new interface resmon. The second version would require to build a whole new set of tools, deploy them and maintain them. Users will have to run perf for certain events and resmon (or whatever is named the new tool) for rdt. I see it as too complex and much prefer to keep using perf. This was so that we have the flexibility to align the tools as per the requirement of the feature rather than twisting the perf behaviour and also have that flexibility for future when new RDT features are added (something similar to what we did by introducing resctrl groups instead of using cgroups for CAT) Sometimes thats a lot simpler as we dont need a lot code given the limited/specific syscalls we need to support. Just like the resctrl fs which is specific to RDT. It looks like your requirement is to be able to monitor a group of tasks independently apart from the resctrl groups? The task option should provide that flexibility to monitor a bunch of tasks independently apart from whether they are part of resctrl group or not. The assignment of RMID is contolled underneat by the kernel so we can optimize the usage of RMIDs and also RMIDs are tied to this group of tasks whether its a subset of resctrl group or not. The first version is close to the patches sent with some additions/changes. This includes details of the design as per Thomas/Peterz feedback. 1> First Design option: without modifying the resctrl and using perf In this design everything in resctrl interface works like before (the info, resource group files like task schemata all remain the same) Monitor cqm using perf -- perf can monitor individual tasks using the -t option just like before. # perf stat -e llc_occupancy -t PID1,PID2 user can monitor the cpu occupancy using the -C option in perf: # perf stat -e llc_occupancy -C 5 Below shows how user can monitor cgroup occupancy: # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ # mkdir /sys/fs/cgroup/perf_event/g1 # mkdir /sys/fs/cgroup/perf_event/g2 # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks # perf stat -e intel_cqm/llc_occupancy/ -a -G g2 To monitor a resctrl group, user can group the same tasks in resctrl group into the cgroup. To monitor the tasks in p1 in example 2 below, add the tasks in resctrl group p1 to cgroup g1 # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks Introducing a new option for resctrl may complicate monitoring because supporting cgroup 'task groups' and resctrl 'task groups' leads to situations where: if the groups intersect, then there is no way to know what l3_allocations contribute to which group. ex: p1 has tasks t1, t2, t3 g1 has tasks t2, t3, t4 The only way to get occupancy for g1 and p1 would be to allocate an RMID for each task which can as well be done with the -t option. That's simply recreating the resctrl group as a cgroup. I think that the main advantage of doing allocation first is that we could use the context switch in rdt allocation and greatly simplify the pmu side of it. If resctrl groups could lift the restriction of one resctl per CLOSID, then the user can create many resctrl in the way perf cgroups are created now. The advantage is that there wont be cgroup hierarchy! making things much simpler. Also no need to optimize perf event context switch to make llc_occupancy work. Then we only need a way to express that monitoring must happen in a resctl to the perf_event_open syscall. My first thought is to have a "rdt_monitor" file per resctl group. A user passes it to perf_event_open in the way cgroups are passed now. We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also cover rdt_monitor files. The syscall can figure if it's a cgroup or a rdt_group. The rdt_monitoring PMU would only work with rdt_monitor groups Then the rdm_monitoring PMU will be pretty dumb, having neither task nor CPU contexts. Just providing the pmu->read and pmu->event_init functions. Task monitoring can be done with resctrl as well by adding the PID to a new resctl and opening the event on it. And, since we'd allow CLOSID to be
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa wrote: Resending including Thomas , also with some changes. Sorry for the spam Based on Thomas and Peterz feedback Can think of two design variants which target: -Support monitoring and allocating using the same resctrl group. user can use a resctrl group to allocate resources and also monitor them (with respect to tasks or cpu) -Also allows monitoring outside of resctrl so that user can monitor subgroups who use the same closid. This mode can be used when user wants to monitor more than just the resctrl groups. The first design version uses and modifies perf_cgroup, second version builds a new interface resmon. The second version would require to build a whole new set of tools, deploy them and maintain them. Users will have to run perf for certain events and resmon (or whatever is named the new tool) for rdt. I see it as too complex and much prefer to keep using perf. This was so that we have the flexibility to align the tools as per the requirement of the feature rather than twisting the perf behaviour and also have that flexibility for future when new RDT features are added (something similar to what we did by introducing resctrl groups instead of using cgroups for CAT) Sometimes thats a lot simpler as we dont need a lot code given the limited/specific syscalls we need to support. Just like the resctrl fs which is specific to RDT. It looks like your requirement is to be able to monitor a group of tasks independently apart from the resctrl groups? The task option should provide that flexibility to monitor a bunch of tasks independently apart from whether they are part of resctrl group or not. The assignment of RMID is contolled underneat by the kernel so we can optimize the usage of RMIDs and also RMIDs are tied to this group of tasks whether its a subset of resctrl group or not. The first version is close to the patches sent with some additions/changes. This includes details of the design as per Thomas/Peterz feedback. 1> First Design option: without modifying the resctrl and using perf In this design everything in resctrl interface works like before (the info, resource group files like task schemata all remain the same) Monitor cqm using perf -- perf can monitor individual tasks using the -t option just like before. # perf stat -e llc_occupancy -t PID1,PID2 user can monitor the cpu occupancy using the -C option in perf: # perf stat -e llc_occupancy -C 5 Below shows how user can monitor cgroup occupancy: # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ # mkdir /sys/fs/cgroup/perf_event/g1 # mkdir /sys/fs/cgroup/perf_event/g2 # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks # perf stat -e intel_cqm/llc_occupancy/ -a -G g2 To monitor a resctrl group, user can group the same tasks in resctrl group into the cgroup. To monitor the tasks in p1 in example 2 below, add the tasks in resctrl group p1 to cgroup g1 # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks Introducing a new option for resctrl may complicate monitoring because supporting cgroup 'task groups' and resctrl 'task groups' leads to situations where: if the groups intersect, then there is no way to know what l3_allocations contribute to which group. ex: p1 has tasks t1, t2, t3 g1 has tasks t2, t3, t4 The only way to get occupancy for g1 and p1 would be to allocate an RMID for each task which can as well be done with the -t option. That's simply recreating the resctrl group as a cgroup. I think that the main advantage of doing allocation first is that we could use the context switch in rdt allocation and greatly simplify the pmu side of it. If resctrl groups could lift the restriction of one resctl per CLOSID, then the user can create many resctrl in the way perf cgroups are created now. The advantage is that there wont be cgroup hierarchy! making things much simpler. Also no need to optimize perf event context switch to make llc_occupancy work. Then we only need a way to express that monitoring must happen in a resctl to the perf_event_open syscall. My first thought is to have a "rdt_monitor" file per resctl group. A user passes it to perf_event_open in the way cgroups are passed now. We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also cover rdt_monitor files. The syscall can figure if it's a cgroup or a rdt_group. The rdt_monitoring PMU would only work with rdt_monitor groups Then the rdm_monitoring PMU will be pretty dumb, having neither task nor CPU contexts. Just providing the pmu->read and pmu->event_init functions. Task monitoring can be done with resctrl as well by adding the PID to a new resctl and opening the event on it. And, since we'd allow CLOSID to be shared between resctrl groups,
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Jan 20, 2017 at 12:30 AM, Thomas Gleixnerwrote: > On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: >> On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner wrote: >> > Above you are talking about the same CLOSID and different RMIDS and not >> > about changing both. >> >> The scenario I talked about implies changing CLOSID without affecting >> monitoring. >> It happens when the allocation needs for a thread/cgroup/CPU change >> dynamically. Forcing to change the RMID together with the CLOSID would >> give wrong monitoring values unless the old RMID is kept around until >> becomes free, which is ugly and would waste a RMID. > > When the allocation needs for a resource control group change, then we > simply update the allocation constraints of that group without chaning the > CLOSID. So everything just stays the same. > > If you move entities to a different group then of course the CLOSID > changes and then it's a different story how to deal with monitoring. > >> > To gather any useful information for both CPU1 and T1 you need TWO >> > RMIDs. Everything else is voodoo and crystal ball analysis and we are not >> > going to support that. >> > >> >> Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU >> just because the CLOSID changed is wasteful. > > Again, the CLOSID only changes if you move entities to a different resource > control group and in that case the RMID change is the least of your worries. > >> Correct. But there may not be a fixed CLOSID association if loads >> exhibit dynamic behavior and/or system load changes dynamically. > > So, you really want to move entities around between resource control groups > dynamically? I'm not seing why you would want to do that, but I'm all ear > to get educated on that. No, I don't want to move entities across resource control groups. I was confused by the idea of CLOSIDs being married to control groups, but now is clear even to me that that was never the intention. Thanks, David > > Thanks, > > tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Jan 20, 2017 at 12:30 AM, Thomas Gleixner wrote: > On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: >> On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner wrote: >> > Above you are talking about the same CLOSID and different RMIDS and not >> > about changing both. >> >> The scenario I talked about implies changing CLOSID without affecting >> monitoring. >> It happens when the allocation needs for a thread/cgroup/CPU change >> dynamically. Forcing to change the RMID together with the CLOSID would >> give wrong monitoring values unless the old RMID is kept around until >> becomes free, which is ugly and would waste a RMID. > > When the allocation needs for a resource control group change, then we > simply update the allocation constraints of that group without chaning the > CLOSID. So everything just stays the same. > > If you move entities to a different group then of course the CLOSID > changes and then it's a different story how to deal with monitoring. > >> > To gather any useful information for both CPU1 and T1 you need TWO >> > RMIDs. Everything else is voodoo and crystal ball analysis and we are not >> > going to support that. >> > >> >> Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU >> just because the CLOSID changed is wasteful. > > Again, the CLOSID only changes if you move entities to a different resource > control group and in that case the RMID change is the least of your worries. > >> Correct. But there may not be a fixed CLOSID association if loads >> exhibit dynamic behavior and/or system load changes dynamically. > > So, you really want to move entities around between resource control groups > dynamically? I'm not seing why you would want to do that, but I'm all ear > to get educated on that. No, I don't want to move entities across resource control groups. I was confused by the idea of CLOSIDs being married to control groups, but now is clear even to me that that was never the intention. Thanks, David > > Thanks, > > tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixnerwrote: > > On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: > > > > If resctrl groups could lift the restriction of one resctl per CLOSID, > > then the user can create many resctrl in the way perf cgroups are > > created now. The advantage is that there wont be cgroup hierarchy! > > making things much simpler. Also no need to optimize perf event > > context switch to make llc_occupancy work. > > So if I understand you correctly, then you want a mechanism to have groups > of entities (tasks, cpus) and associate them to a particular resource > control group. > > So they share the CLOSID of the control group and each entity group can > have its own RMID. > > Now you want to be able to move the entity groups around between control > groups without losing the RMID associated to the entity group. > > So the whole picture would look like this: > > rdt -> CTRLGRP -> CLOSID > > mon -> MONGRP -> RMID > > And you want to move MONGRP from one CTRLGRP to another. Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the same thing. Details below. > > Can you please write up in a abstract way what the design requirements are > that you need. So far we are talking about implementation details and > unspecfied wishlists, but what we really need is an abstract requirement. My pleasure: Design Proposal for Monitoring of RDT Allocation Groups. - Currently each CTRLGRP has a unique CLOSID and a (most likely) unique cache bitmask (CBM) per resource. Non-unique CBM are possible although useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs. CLOSIDs are much more scarce than RMIDs. If we lift the condition of unique CLOSID, then the user can create multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP would share the CLOSID and RDT_Allocation must maintain the schemata to CLOSID relationship (similarly to what the previous CAT driver used to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as now: adding an element removes it from its previous CTRLGRP. This change would allow further partitioning the allocation groups into (allocation, monitoring) groups as follows: With allocation only: CTRLGRP0 CTRLGRP_ALLOC_ONLY schemata: L3:0=0xff0 L3:0=x00f tasks: PID0 P0_0,P0_1,P1_0,P1_1 cpus:0x30xC If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC independently, with the new model we could create: CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3 schemata: L3:0=0xff0 L3:0=x00fL3:0=0x00f L3:0=0x00f tasks: PID0 P0_0,P0_1 P1_0, P1_1 cpus:0x3 0xC 0x0 0x0 Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0). Now we can ask perf to monitor any of the CTRLGRPs independently -once we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-. The perf_event will reserve and assign the RMID to the monitored CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR (CLOSID and RMID), so perf won't have to. If CTRLGRP's schemata changes, the RDT subsystem will find a new CLOSID for the new schemata (potentially reusing an existing one) or fail (just like the old CAT used to). The RMID does not change during schemata updates. If a CTRLGRP dies, the monitoring perf_event continues to exists as a useless wraith, just as happens with cgroup events now. Since CTRLGRPs have no hierarchy. There is no need to handle that in the new RDT Monitoring PMU, greatly simplifying it over the previously proposed versions. A breaking change in user observed behavior with respect to the existing CQM PMU is that there wouldn't be task events. A task must be part of a CTRLGRP and events are created per (CTRLGRP, resource_id) pair. If an user wants to monitor a task across multiple resources (e.g. l3_occupancy across two packages), she must create one event per resource_id and add the two counts. I see this breaking change as an improvement, since hiding the cache topology to user space introduced lots of ugliness and complexity to the CQM PMU without improving accuracy over user space adding the events. Implementation ideas: First idea is to expose one monitoring file per resource in a CTRLGRP, so the list of CTRLGRP's files would be: schemata, tasks, cpus, monitor_l3_0, monitor_l3_1, ... the monitor_ file descriptor is passed to perf_event_open in the way cgroup file descriptors are passed now. All events to the same (CTRLGRP,resource_id) share RMID. The RMID allocation part can either be handled by RDT Allocation or by the RDT Monitoring PMU. Either ways, the existence of PMU's perf_events allocates/releases the RMID. Also, since this new design removes hierarchy and task events, it allows for a simple solution of the RMID rotation problem.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Fri, Jan 20, 2017 at 5:29 AM Thomas Gleixner wrote: > > On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: > > > > If resctrl groups could lift the restriction of one resctl per CLOSID, > > then the user can create many resctrl in the way perf cgroups are > > created now. The advantage is that there wont be cgroup hierarchy! > > making things much simpler. Also no need to optimize perf event > > context switch to make llc_occupancy work. > > So if I understand you correctly, then you want a mechanism to have groups > of entities (tasks, cpus) and associate them to a particular resource > control group. > > So they share the CLOSID of the control group and each entity group can > have its own RMID. > > Now you want to be able to move the entity groups around between control > groups without losing the RMID associated to the entity group. > > So the whole picture would look like this: > > rdt -> CTRLGRP -> CLOSID > > mon -> MONGRP -> RMID > > And you want to move MONGRP from one CTRLGRP to another. Almost, but not quite. My idea is no have MONGRP and CTRLGRP to be the same thing. Details below. > > Can you please write up in a abstract way what the design requirements are > that you need. So far we are talking about implementation details and > unspecfied wishlists, but what we really need is an abstract requirement. My pleasure: Design Proposal for Monitoring of RDT Allocation Groups. - Currently each CTRLGRP has a unique CLOSID and a (most likely) unique cache bitmask (CBM) per resource. Non-unique CBM are possible although useless. An unique CLOSID forbids more CTRLGRPs than physical CLOSIDs. CLOSIDs are much more scarce than RMIDs. If we lift the condition of unique CLOSID, then the user can create multiple CTRLGRPs with the same schemata. Internally, those CTRCGRP would share the CLOSID and RDT_Allocation must maintain the schemata to CLOSID relationship (similarly to what the previous CAT driver used to do). Elements in CTRLGRP.tasks and CTRLGRP.cpus behave the same as now: adding an element removes it from its previous CTRLGRP. This change would allow further partitioning the allocation groups into (allocation, monitoring) groups as follows: With allocation only: CTRLGRP0 CTRLGRP_ALLOC_ONLY schemata: L3:0=0xff0 L3:0=x00f tasks: PID0 P0_0,P0_1,P1_0,P1_1 cpus:0x30xC If we want to monitor (P0_0,P0_1), (P1_0,P1_1) and CPUs 0xC independently, with the new model we could create: CTRLGRP0 CTRLGRP1 CTRLGRP2CTRLGRP3 schemata: L3:0=0xff0 L3:0=x00fL3:0=0x00f L3:0=0x00f tasks: PID0 P0_0,P0_1 P1_0, P1_1 cpus:0x3 0xC 0x0 0x0 Internally, CTRLGRP1, CTRLGRP2, and CTRLGRP2 would share the CLOSID for (L3,0). Now we can ask perf to monitor any of the CTRLGRPs independently -once we solve how to pass to perf what (CTRLGRP, resource_id) to monitor-. The perf_event will reserve and assign the RMID to the monitored CTRLGRP. The RDT subsystem will context switch the whole PQR_ASSOC MSR (CLOSID and RMID), so perf won't have to. If CTRLGRP's schemata changes, the RDT subsystem will find a new CLOSID for the new schemata (potentially reusing an existing one) or fail (just like the old CAT used to). The RMID does not change during schemata updates. If a CTRLGRP dies, the monitoring perf_event continues to exists as a useless wraith, just as happens with cgroup events now. Since CTRLGRPs have no hierarchy. There is no need to handle that in the new RDT Monitoring PMU, greatly simplifying it over the previously proposed versions. A breaking change in user observed behavior with respect to the existing CQM PMU is that there wouldn't be task events. A task must be part of a CTRLGRP and events are created per (CTRLGRP, resource_id) pair. If an user wants to monitor a task across multiple resources (e.g. l3_occupancy across two packages), she must create one event per resource_id and add the two counts. I see this breaking change as an improvement, since hiding the cache topology to user space introduced lots of ugliness and complexity to the CQM PMU without improving accuracy over user space adding the events. Implementation ideas: First idea is to expose one monitoring file per resource in a CTRLGRP, so the list of CTRLGRP's files would be: schemata, tasks, cpus, monitor_l3_0, monitor_l3_1, ... the monitor_ file descriptor is passed to perf_event_open in the way cgroup file descriptors are passed now. All events to the same (CTRLGRP,resource_id) share RMID. The RMID allocation part can either be handled by RDT Allocation or by the RDT Monitoring PMU. Either ways, the existence of PMU's perf_events allocates/releases the RMID. Also, since this new design removes hierarchy and task events, it allows for a simple solution of the RMID rotation problem. The removal of task
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappawrote: > > Resending including Thomas , also with some changes. Sorry for the spam > > Based on Thomas and Peterz feedback Can think of two design > variants which target: > > -Support monitoring and allocating using the same resctrl group. > user can use a resctrl group to allocate resources and also monitor > them (with respect to tasks or cpu) > > -Also allows monitoring outside of resctrl so that user can > monitor subgroups who use the same closid. This mode can be used > when user wants to monitor more than just the resctrl groups. > > The first design version uses and modifies perf_cgroup, second version > builds a new interface resmon. The first version is close to the patches > sent with some additions/changes. This includes details of the design as > per Thomas/Peterz feedback. > > 1> First Design option: without modifying the resctrl and using perf > > > > In this design everything in resctrl interface works like > before (the info, resource group files like task schemata all remain the > same) > > > Monitor cqm using perf > -- > > perf can monitor individual tasks using the -t > option just like before. > > # perf stat -e llc_occupancy -t PID1,PID2 > > user can monitor the cpu occupancy using the -C option in perf: > > # perf stat -e llc_occupancy -C 5 > > Below shows how user can monitor cgroup occupancy: > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # mkdir /sys/fs/cgroup/perf_event/g2 > # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks > > # perf stat -e intel_cqm/llc_occupancy/ -a -G g2 > Presented this way, this does not quite address the use case I described earlier here. We want to be able to monitor the cgroup allocations from first thread creation. What you have above has a large gap. Many apps do allocations as their very first steps, so if you do: $ my_test_prg & [1456] $ echo 1456 >/sys/fs/cgroup/perf_event/g2/tasks $ perf stat -e intel_cqm/llc_occupancy/ -a -G g2 You have a race. But if you allow: $ perf stat -e intel_cqm/llc_occupancy/ -a -G g2 (i.e, on an empty cgroup) $ echo $$ >/sys/fs/cgroup/perf_event/g2/tasks (put shell in cgroup, so my_test_prg runs immediately in the cgroup) $ my_test_prg & Then there is a way to avoid the gap. > > To monitor a resctrl group, user can group the same tasks in resctrl > group into the cgroup. > > To monitor the tasks in p1 in example 2 below, add the tasks in resctrl > group p1 to cgroup g1 > > # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks > > Introducing a new option for resctrl may complicate monitoring because > supporting cgroup 'task groups' and resctrl 'task groups' leads to > situations where: > if the groups intersect, then there is no way to know what > l3_allocations contribute to which group. > > ex: > p1 has tasks t1, t2, t3 > g1 has tasks t2, t3, t4 > > The only way to get occupancy for g1 and p1 would be to allocate an RMID > for each task which can as well be done with the -t option. > > Monitoring cqm cgroups Implementation > - > > When monitoring two different cgroups in the same hierarchy (ex say g11 > has an ancestor g1 which are both being monitored as shown below) we > need the g11 counts to be considered for g1 as well. > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # mkdir /sys/fs/cgroup/perf_event/g1/g11 > > When measuring for g1 llc_occupancy we cannot write two different RMIDs > (because we need to count for g11 as well) > during context switch to measure the occupancy for both g1 and g11. > Hence the driver maintains this information and writes the RMID of the > lowest member in the ancestory which is being monitored during ctx > switch. > > The cqm_info is added to the perf_cgroup structure to maintain this > information. The structure is allocated and destroyed at css_alloc and > css_free. All the events tied to a cgroup can use the same > information while reading the counts. > > struct perf_cgroup { > #ifdef CONFIG_INTEL_RDT_M > void *cqm_info; > #endif > ... > > } > > struct cqm_info { > bool mon_enabled; > int level; > u32 *rmid; > struct cgrp_cqm_info *mfa; > struct list_head tskmon_rlist; > }; > > Due to the hierarchical nature of cgroups, every cgroup just > monitors for the 'nearest monitored ancestor' at all times. > Since root cgroup is always monitored, all descendents > at boot time monitor for root and hence all mfa points to root > except for root->mfa which is NULL. > > 1. RMID setup: When cgroup x start monitoring: >for each descendent y, if y's mfa->level < x->level, then >y->mfa = x. (Where level of root node = 0...) > 2. sched_in: During sched_in for x >
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa wrote: > > Resending including Thomas , also with some changes. Sorry for the spam > > Based on Thomas and Peterz feedback Can think of two design > variants which target: > > -Support monitoring and allocating using the same resctrl group. > user can use a resctrl group to allocate resources and also monitor > them (with respect to tasks or cpu) > > -Also allows monitoring outside of resctrl so that user can > monitor subgroups who use the same closid. This mode can be used > when user wants to monitor more than just the resctrl groups. > > The first design version uses and modifies perf_cgroup, second version > builds a new interface resmon. The first version is close to the patches > sent with some additions/changes. This includes details of the design as > per Thomas/Peterz feedback. > > 1> First Design option: without modifying the resctrl and using perf > > > > In this design everything in resctrl interface works like > before (the info, resource group files like task schemata all remain the > same) > > > Monitor cqm using perf > -- > > perf can monitor individual tasks using the -t > option just like before. > > # perf stat -e llc_occupancy -t PID1,PID2 > > user can monitor the cpu occupancy using the -C option in perf: > > # perf stat -e llc_occupancy -C 5 > > Below shows how user can monitor cgroup occupancy: > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # mkdir /sys/fs/cgroup/perf_event/g2 > # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks > > # perf stat -e intel_cqm/llc_occupancy/ -a -G g2 > Presented this way, this does not quite address the use case I described earlier here. We want to be able to monitor the cgroup allocations from first thread creation. What you have above has a large gap. Many apps do allocations as their very first steps, so if you do: $ my_test_prg & [1456] $ echo 1456 >/sys/fs/cgroup/perf_event/g2/tasks $ perf stat -e intel_cqm/llc_occupancy/ -a -G g2 You have a race. But if you allow: $ perf stat -e intel_cqm/llc_occupancy/ -a -G g2 (i.e, on an empty cgroup) $ echo $$ >/sys/fs/cgroup/perf_event/g2/tasks (put shell in cgroup, so my_test_prg runs immediately in the cgroup) $ my_test_prg & Then there is a way to avoid the gap. > > To monitor a resctrl group, user can group the same tasks in resctrl > group into the cgroup. > > To monitor the tasks in p1 in example 2 below, add the tasks in resctrl > group p1 to cgroup g1 > > # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks > > Introducing a new option for resctrl may complicate monitoring because > supporting cgroup 'task groups' and resctrl 'task groups' leads to > situations where: > if the groups intersect, then there is no way to know what > l3_allocations contribute to which group. > > ex: > p1 has tasks t1, t2, t3 > g1 has tasks t2, t3, t4 > > The only way to get occupancy for g1 and p1 would be to allocate an RMID > for each task which can as well be done with the -t option. > > Monitoring cqm cgroups Implementation > - > > When monitoring two different cgroups in the same hierarchy (ex say g11 > has an ancestor g1 which are both being monitored as shown below) we > need the g11 counts to be considered for g1 as well. > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # mkdir /sys/fs/cgroup/perf_event/g1/g11 > > When measuring for g1 llc_occupancy we cannot write two different RMIDs > (because we need to count for g11 as well) > during context switch to measure the occupancy for both g1 and g11. > Hence the driver maintains this information and writes the RMID of the > lowest member in the ancestory which is being monitored during ctx > switch. > > The cqm_info is added to the perf_cgroup structure to maintain this > information. The structure is allocated and destroyed at css_alloc and > css_free. All the events tied to a cgroup can use the same > information while reading the counts. > > struct perf_cgroup { > #ifdef CONFIG_INTEL_RDT_M > void *cqm_info; > #endif > ... > > } > > struct cqm_info { > bool mon_enabled; > int level; > u32 *rmid; > struct cgrp_cqm_info *mfa; > struct list_head tskmon_rlist; > }; > > Due to the hierarchical nature of cgroups, every cgroup just > monitors for the 'nearest monitored ancestor' at all times. > Since root cgroup is always monitored, all descendents > at boot time monitor for root and hence all mfa points to root > except for root->mfa which is NULL. > > 1. RMID setup: When cgroup x start monitoring: >for each descendent y, if y's mfa->level < x->level, then >y->mfa = x. (Where level of root node = 0...) > 2. sched_in: During sched_in for x >if (x->mon_enabled) choose
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: > > If resctrl groups could lift the restriction of one resctl per CLOSID, > then the user can create many resctrl in the way perf cgroups are > created now. The advantage is that there wont be cgroup hierarchy! > making things much simpler. Also no need to optimize perf event > context switch to make llc_occupancy work. So if I understand you correctly, then you want a mechanism to have groups of entities (tasks, cpus) and associate them to a particular resource control group. So they share the CLOSID of the control group and each entity group can have its own RMID. Now you want to be able to move the entity groups around between control groups without losing the RMID associated to the entity group. So the whole picture would look like this: rdt -> CTRLGRP -> CLOSID mon -> MONGRP -> RMID And you want to move MONGRP from one CTRLGRP to another. Can you please write up in a abstract way what the design requirements are that you need. So far we are talking about implementation details and unspecfied wishlists, but what we really need is an abstract requirement. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: > > If resctrl groups could lift the restriction of one resctl per CLOSID, > then the user can create many resctrl in the way perf cgroups are > created now. The advantage is that there wont be cgroup hierarchy! > making things much simpler. Also no need to optimize perf event > context switch to make llc_occupancy work. So if I understand you correctly, then you want a mechanism to have groups of entities (tasks, cpus) and associate them to a particular resource control group. So they share the CLOSID of the control group and each entity group can have its own RMID. Now you want to be able to move the entity groups around between control groups without losing the RMID associated to the entity group. So the whole picture would look like this: rdt -> CTRLGRP -> CLOSID mon -> MONGRP -> RMID And you want to move MONGRP from one CTRLGRP to another. Can you please write up in a abstract way what the design requirements are that you need. So far we are talking about implementation details and unspecfied wishlists, but what we really need is an abstract requirement. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: > On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixnerwrote: > > Above you are talking about the same CLOSID and different RMIDS and not > > about changing both. > > The scenario I talked about implies changing CLOSID without affecting > monitoring. > It happens when the allocation needs for a thread/cgroup/CPU change > dynamically. Forcing to change the RMID together with the CLOSID would > give wrong monitoring values unless the old RMID is kept around until > becomes free, which is ugly and would waste a RMID. When the allocation needs for a resource control group change, then we simply update the allocation constraints of that group without chaning the CLOSID. So everything just stays the same. If you move entities to a different group then of course the CLOSID changes and then it's a different story how to deal with monitoring. > > To gather any useful information for both CPU1 and T1 you need TWO > > RMIDs. Everything else is voodoo and crystal ball analysis and we are not > > going to support that. > > > > Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU > just because the CLOSID changed is wasteful. Again, the CLOSID only changes if you move entities to a different resource control group and in that case the RMID change is the least of your worries. > Correct. But there may not be a fixed CLOSID association if loads > exhibit dynamic behavior and/or system load changes dynamically. So, you really want to move entities around between resource control groups dynamically? I'm not seing why you would want to do that, but I'm all ear to get educated on that. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: > On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner wrote: > > Above you are talking about the same CLOSID and different RMIDS and not > > about changing both. > > The scenario I talked about implies changing CLOSID without affecting > monitoring. > It happens when the allocation needs for a thread/cgroup/CPU change > dynamically. Forcing to change the RMID together with the CLOSID would > give wrong monitoring values unless the old RMID is kept around until > becomes free, which is ugly and would waste a RMID. When the allocation needs for a resource control group change, then we simply update the allocation constraints of that group without chaning the CLOSID. So everything just stays the same. If you move entities to a different group then of course the CLOSID changes and then it's a different story how to deal with monitoring. > > To gather any useful information for both CPU1 and T1 you need TWO > > RMIDs. Everything else is voodoo and crystal ball analysis and we are not > > going to support that. > > > > Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU > just because the CLOSID changed is wasteful. Again, the CLOSID only changes if you move entities to a different resource control group and in that case the RMID change is the least of your worries. > Correct. But there may not be a fixed CLOSID association if loads > exhibit dynamic behavior and/or system load changes dynamically. So, you really want to move entities around between resource control groups dynamically? I'm not seing why you would want to do that, but I'm all ear to get educated on that. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappawrote: > Resending including Thomas , also with some changes. Sorry for the spam > > Based on Thomas and Peterz feedback Can think of two design > variants which target: > > -Support monitoring and allocating using the same resctrl group. > user can use a resctrl group to allocate resources and also monitor > them (with respect to tasks or cpu) > > -Also allows monitoring outside of resctrl so that user can > monitor subgroups who use the same closid. This mode can be used > when user wants to monitor more than just the resctrl groups. > > The first design version uses and modifies perf_cgroup, second version > builds a new interface resmon. The second version would require to build a whole new set of tools, deploy them and maintain them. Users will have to run perf for certain events and resmon (or whatever is named the new tool) for rdt. I see it as too complex and much prefer to keep using perf. > The first version is close to the patches > sent with some additions/changes. This includes details of the design as > per Thomas/Peterz feedback. > > 1> First Design option: without modifying the resctrl and using perf > > > > In this design everything in resctrl interface works like > before (the info, resource group files like task schemata all remain the > same) > > > Monitor cqm using perf > -- > > perf can monitor individual tasks using the -t > option just like before. > > # perf stat -e llc_occupancy -t PID1,PID2 > > user can monitor the cpu occupancy using the -C option in perf: > > # perf stat -e llc_occupancy -C 5 > > Below shows how user can monitor cgroup occupancy: > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # mkdir /sys/fs/cgroup/perf_event/g2 > # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks > > # perf stat -e intel_cqm/llc_occupancy/ -a -G g2 > > To monitor a resctrl group, user can group the same tasks in resctrl > group into the cgroup. > > To monitor the tasks in p1 in example 2 below, add the tasks in resctrl > group p1 to cgroup g1 > > # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks > > Introducing a new option for resctrl may complicate monitoring because > supporting cgroup 'task groups' and resctrl 'task groups' leads to > situations where: > if the groups intersect, then there is no way to know what > l3_allocations contribute to which group. > > ex: > p1 has tasks t1, t2, t3 > g1 has tasks t2, t3, t4 > > The only way to get occupancy for g1 and p1 would be to allocate an RMID > for each task which can as well be done with the -t option. That's simply recreating the resctrl group as a cgroup. I think that the main advantage of doing allocation first is that we could use the context switch in rdt allocation and greatly simplify the pmu side of it. If resctrl groups could lift the restriction of one resctl per CLOSID, then the user can create many resctrl in the way perf cgroups are created now. The advantage is that there wont be cgroup hierarchy! making things much simpler. Also no need to optimize perf event context switch to make llc_occupancy work. Then we only need a way to express that monitoring must happen in a resctl to the perf_event_open syscall. My first thought is to have a "rdt_monitor" file per resctl group. A user passes it to perf_event_open in the way cgroups are passed now. We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also cover rdt_monitor files. The syscall can figure if it's a cgroup or a rdt_group. The rdt_monitoring PMU would only work with rdt_monitor groups Then the rdm_monitoring PMU will be pretty dumb, having neither task nor CPU contexts. Just providing the pmu->read and pmu->event_init functions. Task monitoring can be done with resctrl as well by adding the PID to a new resctl and opening the event on it. And, since we'd allow CLOSID to be shared between resctrl groups, allocation wouldn't break. It's a first idea, so please dont hate too hard ;) . David > > Monitoring cqm cgroups Implementation > - > > When monitoring two different cgroups in the same hierarchy (ex say g11 > has an ancestor g1 which are both being monitored as shown below) we > need the g11 counts to be considered for g1 as well. > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # mkdir /sys/fs/cgroup/perf_event/g1/g11 > > When measuring for g1 llc_occupancy we cannot write two different RMIDs > (because we need to count for g11 as well) > during context switch to measure the occupancy for both g1 and g11. > Hence the driver maintains this information and writes the RMID of the > lowest member in the ancestory which is being monitored during ctx >
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Jan 19, 2017 at 6:32 PM, Vikas Shivappa wrote: > Resending including Thomas , also with some changes. Sorry for the spam > > Based on Thomas and Peterz feedback Can think of two design > variants which target: > > -Support monitoring and allocating using the same resctrl group. > user can use a resctrl group to allocate resources and also monitor > them (with respect to tasks or cpu) > > -Also allows monitoring outside of resctrl so that user can > monitor subgroups who use the same closid. This mode can be used > when user wants to monitor more than just the resctrl groups. > > The first design version uses and modifies perf_cgroup, second version > builds a new interface resmon. The second version would require to build a whole new set of tools, deploy them and maintain them. Users will have to run perf for certain events and resmon (or whatever is named the new tool) for rdt. I see it as too complex and much prefer to keep using perf. > The first version is close to the patches > sent with some additions/changes. This includes details of the design as > per Thomas/Peterz feedback. > > 1> First Design option: without modifying the resctrl and using perf > > > > In this design everything in resctrl interface works like > before (the info, resource group files like task schemata all remain the > same) > > > Monitor cqm using perf > -- > > perf can monitor individual tasks using the -t > option just like before. > > # perf stat -e llc_occupancy -t PID1,PID2 > > user can monitor the cpu occupancy using the -C option in perf: > > # perf stat -e llc_occupancy -C 5 > > Below shows how user can monitor cgroup occupancy: > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # mkdir /sys/fs/cgroup/perf_event/g2 > # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks > > # perf stat -e intel_cqm/llc_occupancy/ -a -G g2 > > To monitor a resctrl group, user can group the same tasks in resctrl > group into the cgroup. > > To monitor the tasks in p1 in example 2 below, add the tasks in resctrl > group p1 to cgroup g1 > > # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks > > Introducing a new option for resctrl may complicate monitoring because > supporting cgroup 'task groups' and resctrl 'task groups' leads to > situations where: > if the groups intersect, then there is no way to know what > l3_allocations contribute to which group. > > ex: > p1 has tasks t1, t2, t3 > g1 has tasks t2, t3, t4 > > The only way to get occupancy for g1 and p1 would be to allocate an RMID > for each task which can as well be done with the -t option. That's simply recreating the resctrl group as a cgroup. I think that the main advantage of doing allocation first is that we could use the context switch in rdt allocation and greatly simplify the pmu side of it. If resctrl groups could lift the restriction of one resctl per CLOSID, then the user can create many resctrl in the way perf cgroups are created now. The advantage is that there wont be cgroup hierarchy! making things much simpler. Also no need to optimize perf event context switch to make llc_occupancy work. Then we only need a way to express that monitoring must happen in a resctl to the perf_event_open syscall. My first thought is to have a "rdt_monitor" file per resctl group. A user passes it to perf_event_open in the way cgroups are passed now. We could extend the meaning of the flag PERF_FLAG_PID_CGROUP to also cover rdt_monitor files. The syscall can figure if it's a cgroup or a rdt_group. The rdt_monitoring PMU would only work with rdt_monitor groups Then the rdm_monitoring PMU will be pretty dumb, having neither task nor CPU contexts. Just providing the pmu->read and pmu->event_init functions. Task monitoring can be done with resctrl as well by adding the PID to a new resctl and opening the event on it. And, since we'd allow CLOSID to be shared between resctrl groups, allocation wouldn't break. It's a first idea, so please dont hate too hard ;) . David > > Monitoring cqm cgroups Implementation > - > > When monitoring two different cgroups in the same hierarchy (ex say g11 > has an ancestor g1 which are both being monitored as shown below) we > need the g11 counts to be considered for g1 as well. > > # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ > # mkdir /sys/fs/cgroup/perf_event/g1 > # mkdir /sys/fs/cgroup/perf_event/g1/g11 > > When measuring for g1 llc_occupancy we cannot write two different RMIDs > (because we need to count for g11 as well) > during context switch to measure the occupancy for both g1 and g11. > Hence the driver maintains this information and writes the RMID of the > lowest member in the ancestory which is being monitored during ctx > switch. > > The cqm_info is added
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixnerwrote: > On Wed, 18 Jan 2017, David Carrillo-Cisneros wrote: >> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner wrote: >> There are use cases where the RMID to CLOSID mapping is not that simple. >> Some of them are: >> >> 1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread >> during phases that initialize relevant data, while changing it to another >> during >> phases that pollute cache. Yet, we want the RMID to remain the same. > > That's fine. I did not say that you need fixed RMD <-> CLOSID mappings. The > point is that monitoring across different CLOSID domains is pointless. > > I have no idea how you want to do that with the proposed implementation to > switch the RMID of the thread on the fly, but that's a different story. > >> A different variation is to change CLOSID to increase/decrease the size of >> the >> allocated cache when high/low contention is detected. >> >> 2. Contention detection. I start with: >>- T1 has RMID 1. >>- T1 changes RMID to 2. >> will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases. > > Of course does RMID1 decrease because it's not longer in use. Oh well. > >> The rate of change will be relative to the level of cache contention present >> at the time. This all happens without changing the CLOSID. > > See above. > >> > >> > So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not >> > care at all about the occupancy of T1 simply because that is running on a >> > seperate reservation. >> >> It is not useless for scenarios where CLOSID and RMIDs change dynamically >> See above. > > Above you are talking about the same CLOSID and different RMIDS and not > about changing both. The scenario I talked about implies changing CLOSID without affecting monitoring. It happens when the allocation needs for a thread/cgroup/CPU change dynamically. Forcing to change the RMID together with the CLOSID would give wrong monitoring values unless the old RMID is kept around until becomes free, which is ugly and would waste a RMID. > >> > Trying to make that an aggregated value in the first >> > place is completely wrong. If you want an aggregate, which is pretty much >> > useless, then user space tools can generate it easily. >> >> Not useless, see above. > > It is prettey useless, because CPU4 has CLOSID1 while T1 has CLOSID4 and > making an aggregate over those two has absolutely nothing to do with your > scenario above. That's true. It is useless in the case you mentioned. I erroneously interpreted the "useless" in your comment as a general statement about aggregating RMID occupancies. > > If you want the aggregate value, then create it in user space and oracle > (or should I say google) out of it whatever you want, but do not impose > that to the kernel. > >> Having user space tools to aggregate implies wasting some of the already >> scarce RMIDs. > > Oh well. Can you please explain how you want to monitor the scenario I > explained above: > > CPU4 CLOSID 1 > T1CLOSID 4 > > So if T1 runs on CPU4 then it uses CLOSID 4 which does not at all affect > the cache occupancy of CLOSID 1. So if you use the same RMID then you > pollute either the information of CPU4 (CLOSID1) or the information of T1 > (CLOSID4) > > To gather any useful information for both CPU1 and T1 you need TWO > RMIDs. Everything else is voodoo and crystal ball analysis and we are not > going to support that. > Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU just because the CLOSID changed is wasteful. >> > The whole approach you and David have taken is to whack some desired cgroup >> > functionality and whatever into CQM without rethinking the overall >> > design. And that's fundamentaly broken because it does not take cache (and >> > memory bandwidth) allocation into account. >> >> Monitoring and allocation are closely related yet independent. > > Independent to some degree. Sure you can claim they are completely > independent, but lots of the resulting combinations make absolutely no > sense at all. And we really don't want to support non-sensical measurements > just because we can. The outcome of this is complexity, inaccuracy and code > which is too horrible to look at. > >> I see the advantages of allowing a per-cpu RMID as you describe in the >> example. >> >> Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond >> one simply monitoring occupancy per allocation. > > I agree there are use cases where you want to monitor across allocations, > like monitoring a task which has no CLOSID assigned and runs on different > CPUs and therefor potentially on different CLOSIDs which are assigned to > the different CPUs. > > That's fine and you want a seperate RMID for this. > > But once you have a fixed CLOSID association then reusing and aggregating > across CLOSID domains is more than useless. > Correct. But
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, Jan 19, 2017 at 9:41 AM, Thomas Gleixner wrote: > On Wed, 18 Jan 2017, David Carrillo-Cisneros wrote: >> On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner wrote: >> There are use cases where the RMID to CLOSID mapping is not that simple. >> Some of them are: >> >> 1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread >> during phases that initialize relevant data, while changing it to another >> during >> phases that pollute cache. Yet, we want the RMID to remain the same. > > That's fine. I did not say that you need fixed RMD <-> CLOSID mappings. The > point is that monitoring across different CLOSID domains is pointless. > > I have no idea how you want to do that with the proposed implementation to > switch the RMID of the thread on the fly, but that's a different story. > >> A different variation is to change CLOSID to increase/decrease the size of >> the >> allocated cache when high/low contention is detected. >> >> 2. Contention detection. I start with: >>- T1 has RMID 1. >>- T1 changes RMID to 2. >> will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases. > > Of course does RMID1 decrease because it's not longer in use. Oh well. > >> The rate of change will be relative to the level of cache contention present >> at the time. This all happens without changing the CLOSID. > > See above. > >> > >> > So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not >> > care at all about the occupancy of T1 simply because that is running on a >> > seperate reservation. >> >> It is not useless for scenarios where CLOSID and RMIDs change dynamically >> See above. > > Above you are talking about the same CLOSID and different RMIDS and not > about changing both. The scenario I talked about implies changing CLOSID without affecting monitoring. It happens when the allocation needs for a thread/cgroup/CPU change dynamically. Forcing to change the RMID together with the CLOSID would give wrong monitoring values unless the old RMID is kept around until becomes free, which is ugly and would waste a RMID. > >> > Trying to make that an aggregated value in the first >> > place is completely wrong. If you want an aggregate, which is pretty much >> > useless, then user space tools can generate it easily. >> >> Not useless, see above. > > It is prettey useless, because CPU4 has CLOSID1 while T1 has CLOSID4 and > making an aggregate over those two has absolutely nothing to do with your > scenario above. That's true. It is useless in the case you mentioned. I erroneously interpreted the "useless" in your comment as a general statement about aggregating RMID occupancies. > > If you want the aggregate value, then create it in user space and oracle > (or should I say google) out of it whatever you want, but do not impose > that to the kernel. > >> Having user space tools to aggregate implies wasting some of the already >> scarce RMIDs. > > Oh well. Can you please explain how you want to monitor the scenario I > explained above: > > CPU4 CLOSID 1 > T1CLOSID 4 > > So if T1 runs on CPU4 then it uses CLOSID 4 which does not at all affect > the cache occupancy of CLOSID 1. So if you use the same RMID then you > pollute either the information of CPU4 (CLOSID1) or the information of T1 > (CLOSID4) > > To gather any useful information for both CPU1 and T1 you need TWO > RMIDs. Everything else is voodoo and crystal ball analysis and we are not > going to support that. > Correct. Yet, having two RMIDs to monitor the same task/cgroup/CPU just because the CLOSID changed is wasteful. >> > The whole approach you and David have taken is to whack some desired cgroup >> > functionality and whatever into CQM without rethinking the overall >> > design. And that's fundamentaly broken because it does not take cache (and >> > memory bandwidth) allocation into account. >> >> Monitoring and allocation are closely related yet independent. > > Independent to some degree. Sure you can claim they are completely > independent, but lots of the resulting combinations make absolutely no > sense at all. And we really don't want to support non-sensical measurements > just because we can. The outcome of this is complexity, inaccuracy and code > which is too horrible to look at. > >> I see the advantages of allowing a per-cpu RMID as you describe in the >> example. >> >> Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond >> one simply monitoring occupancy per allocation. > > I agree there are use cases where you want to monitor across allocations, > like monitoring a task which has no CLOSID assigned and runs on different > CPUs and therefor potentially on different CLOSIDs which are assigned to > the different CPUs. > > That's fine and you want a seperate RMID for this. > > But once you have a fixed CLOSID association then reusing and aggregating > across CLOSID domains is more than useless. > Correct. But there may not be a fixed CLOSID
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Resending including Thomas , also with some changes. Sorry for the spam Based on Thomas and Peterz feedback Can think of two design variants which target: -Support monitoring and allocating using the same resctrl group. user can use a resctrl group to allocate resources and also monitor them (with respect to tasks or cpu) -Also allows monitoring outside of resctrl so that user can monitor subgroups who use the same closid. This mode can be used when user wants to monitor more than just the resctrl groups. The first design version uses and modifies perf_cgroup, second version builds a new interface resmon. The first version is close to the patches sent with some additions/changes. This includes details of the design as per Thomas/Peterz feedback. 1> First Design option: without modifying the resctrl and using perf In this design everything in resctrl interface works like before (the info, resource group files like task schemata all remain the same) Monitor cqm using perf -- perf can monitor individual tasks using the -t option just like before. # perf stat -e llc_occupancy -t PID1,PID2 user can monitor the cpu occupancy using the -C option in perf: # perf stat -e llc_occupancy -C 5 Below shows how user can monitor cgroup occupancy: # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ # mkdir /sys/fs/cgroup/perf_event/g1 # mkdir /sys/fs/cgroup/perf_event/g2 # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks # perf stat -e intel_cqm/llc_occupancy/ -a -G g2 To monitor a resctrl group, user can group the same tasks in resctrl group into the cgroup. To monitor the tasks in p1 in example 2 below, add the tasks in resctrl group p1 to cgroup g1 # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks Introducing a new option for resctrl may complicate monitoring because supporting cgroup 'task groups' and resctrl 'task groups' leads to situations where: if the groups intersect, then there is no way to know what l3_allocations contribute to which group. ex: p1 has tasks t1, t2, t3 g1 has tasks t2, t3, t4 The only way to get occupancy for g1 and p1 would be to allocate an RMID for each task which can as well be done with the -t option. Monitoring cqm cgroups Implementation - When monitoring two different cgroups in the same hierarchy (ex say g11 has an ancestor g1 which are both being monitored as shown below) we need the g11 counts to be considered for g1 as well. # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ # mkdir /sys/fs/cgroup/perf_event/g1 # mkdir /sys/fs/cgroup/perf_event/g1/g11 When measuring for g1 llc_occupancy we cannot write two different RMIDs (because we need to count for g11 as well) during context switch to measure the occupancy for both g1 and g11. Hence the driver maintains this information and writes the RMID of the lowest member in the ancestory which is being monitored during ctx switch. The cqm_info is added to the perf_cgroup structure to maintain this information. The structure is allocated and destroyed at css_alloc and css_free. All the events tied to a cgroup can use the same information while reading the counts. struct perf_cgroup { #ifdef CONFIG_INTEL_RDT_M void *cqm_info; #endif ... } struct cqm_info { bool mon_enabled; int level; u32 *rmid; struct cgrp_cqm_info *mfa; struct list_head tskmon_rlist; }; Due to the hierarchical nature of cgroups, every cgroup just monitors for the 'nearest monitored ancestor' at all times. Since root cgroup is always monitored, all descendents at boot time monitor for root and hence all mfa points to root except for root->mfa which is NULL. 1. RMID setup: When cgroup x start monitoring: for each descendent y, if y's mfa->level < x->level, then y->mfa = x. (Where level of root node = 0...) 2. sched_in: During sched_in for x if (x->mon_enabled) choose x->rmid else choose x->mfa->rmid. 3. read: for each descendent of cgroup x if (x->monitored) count += rmid_read(x->rmid). 4. evt_destroy: for each descendent y of x, if (y->mfa == x) then y->mfa = x->mfa. Meaning if any descendent was monitoring for x, set that descendent to monitor for the cgroup which x was monitoring for. To monitor a task in a cgroup x along with monitoring cgroup x itself cqm_info maintains a list of tasks that are being monitored in the cgroup. When a task which belongs to a cgroup x is being monitored, it always uses its own task->rmid even if cgroup x is monitored during sched_in. To account for the counts of such tasks, cgroup keeps this list and parses it during read. taskmon_rlist is used to maintain the list. The list is modified when a task is attached to the cgroup or removed from the group. Example 1 (Some examples modeled from resctrl ui documentation) - A single
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Resending including Thomas , also with some changes. Sorry for the spam Based on Thomas and Peterz feedback Can think of two design variants which target: -Support monitoring and allocating using the same resctrl group. user can use a resctrl group to allocate resources and also monitor them (with respect to tasks or cpu) -Also allows monitoring outside of resctrl so that user can monitor subgroups who use the same closid. This mode can be used when user wants to monitor more than just the resctrl groups. The first design version uses and modifies perf_cgroup, second version builds a new interface resmon. The first version is close to the patches sent with some additions/changes. This includes details of the design as per Thomas/Peterz feedback. 1> First Design option: without modifying the resctrl and using perf In this design everything in resctrl interface works like before (the info, resource group files like task schemata all remain the same) Monitor cqm using perf -- perf can monitor individual tasks using the -t option just like before. # perf stat -e llc_occupancy -t PID1,PID2 user can monitor the cpu occupancy using the -C option in perf: # perf stat -e llc_occupancy -C 5 Below shows how user can monitor cgroup occupancy: # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ # mkdir /sys/fs/cgroup/perf_event/g1 # mkdir /sys/fs/cgroup/perf_event/g2 # echo PID1 > /sys/fs/cgroup/perf_event/g2/tasks # perf stat -e intel_cqm/llc_occupancy/ -a -G g2 To monitor a resctrl group, user can group the same tasks in resctrl group into the cgroup. To monitor the tasks in p1 in example 2 below, add the tasks in resctrl group p1 to cgroup g1 # echo 5678 > /sys/fs/cgroup/perf_event/g1/tasks Introducing a new option for resctrl may complicate monitoring because supporting cgroup 'task groups' and resctrl 'task groups' leads to situations where: if the groups intersect, then there is no way to know what l3_allocations contribute to which group. ex: p1 has tasks t1, t2, t3 g1 has tasks t2, t3, t4 The only way to get occupancy for g1 and p1 would be to allocate an RMID for each task which can as well be done with the -t option. Monitoring cqm cgroups Implementation - When monitoring two different cgroups in the same hierarchy (ex say g11 has an ancestor g1 which are both being monitored as shown below) we need the g11 counts to be considered for g1 as well. # mount -t cgroup -o perf_event perf_event /sys/fs/cgroup/perf_event/ # mkdir /sys/fs/cgroup/perf_event/g1 # mkdir /sys/fs/cgroup/perf_event/g1/g11 When measuring for g1 llc_occupancy we cannot write two different RMIDs (because we need to count for g11 as well) during context switch to measure the occupancy for both g1 and g11. Hence the driver maintains this information and writes the RMID of the lowest member in the ancestory which is being monitored during ctx switch. The cqm_info is added to the perf_cgroup structure to maintain this information. The structure is allocated and destroyed at css_alloc and css_free. All the events tied to a cgroup can use the same information while reading the counts. struct perf_cgroup { #ifdef CONFIG_INTEL_RDT_M void *cqm_info; #endif ... } struct cqm_info { bool mon_enabled; int level; u32 *rmid; struct cgrp_cqm_info *mfa; struct list_head tskmon_rlist; }; Due to the hierarchical nature of cgroups, every cgroup just monitors for the 'nearest monitored ancestor' at all times. Since root cgroup is always monitored, all descendents at boot time monitor for root and hence all mfa points to root except for root->mfa which is NULL. 1. RMID setup: When cgroup x start monitoring: for each descendent y, if y's mfa->level < x->level, then y->mfa = x. (Where level of root node = 0...) 2. sched_in: During sched_in for x if (x->mon_enabled) choose x->rmid else choose x->mfa->rmid. 3. read: for each descendent of cgroup x if (x->monitored) count += rmid_read(x->rmid). 4. evt_destroy: for each descendent y of x, if (y->mfa == x) then y->mfa = x->mfa. Meaning if any descendent was monitoring for x, set that descendent to monitor for the cgroup which x was monitoring for. To monitor a task in a cgroup x along with monitoring cgroup x itself cqm_info maintains a list of tasks that are being monitored in the cgroup. When a task which belongs to a cgroup x is being monitored, it always uses its own task->rmid even if cgroup x is monitored during sched_in. To account for the counts of such tasks, cgroup keeps this list and parses it during read. taskmon_rlist is used to maintain the list. The list is modified when a task is attached to the cgroup or removed from the group. Example 1 (Some examples modeled from resctrl ui documentation) - A single
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Hello Peterz, On Wed, 18 Jan 2017, Peter Zijlstra wrote: On Wed, Jan 18, 2017 at 09:53:02AM +0100, Thomas Gleixner wrote: The whole approach you and David have taken is to whack some desired cgroup functionality and whatever into CQM without rethinking the overall design. And that's fundamentaly broken because it does not take cache (and memory bandwidth) allocation into account. I seriously doubt, that the existing CQM/MBM code can be refactored in any useful way. As Peter Zijlstra said before: Remove the existing cruft completely and start with completely new design from scratch. And this new design should start from the allocation angle and then add the whole other muck on top so far its possible. Allocation related monitoring must be the primary focus, everything else is just tinkering. Agreed, the little I have seen of these patches is quite horrible. And there seems to be a definite lack of design; or at the very least an utter lack of communication of it. the 1/12 Documentation patch describes the interface. Basically we are just trying to support the task and cgroup monitoring. By the design document, do you want a document describing how we enable the cgroup for cqm since its a special case? (which would include all the arch_info in the perf_cgroup we add to keep track of hierarchy in the driver , etc ..) Thanks, Vikas The approach, in so far that I could make sense of it, seems to utterly rape perf-cgroup. I think Thomas makes a sensible point in trying to match it to the CAT stuffs.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
Hello Peterz, On Wed, 18 Jan 2017, Peter Zijlstra wrote: On Wed, Jan 18, 2017 at 09:53:02AM +0100, Thomas Gleixner wrote: The whole approach you and David have taken is to whack some desired cgroup functionality and whatever into CQM without rethinking the overall design. And that's fundamentaly broken because it does not take cache (and memory bandwidth) allocation into account. I seriously doubt, that the existing CQM/MBM code can be refactored in any useful way. As Peter Zijlstra said before: Remove the existing cruft completely and start with completely new design from scratch. And this new design should start from the allocation angle and then add the whole other muck on top so far its possible. Allocation related monitoring must be the primary focus, everything else is just tinkering. Agreed, the little I have seen of these patches is quite horrible. And there seems to be a definite lack of design; or at the very least an utter lack of communication of it. the 1/12 Documentation patch describes the interface. Basically we are just trying to support the task and cgroup monitoring. By the design document, do you want a document describing how we enable the cgroup for cqm since its a special case? (which would include all the arch_info in the perf_cgroup we add to keep track of hierarchy in the driver , etc ..) Thanks, Vikas The approach, in so far that I could make sense of it, seems to utterly rape perf-cgroup. I think Thomas makes a sensible point in trying to match it to the CAT stuffs.
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: > A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested > is very problematic because the number of CLOSIDs is much much smaller than > the > number of RMIDs, and, as Stephane mentioned it's a common use case to want to > independently monitor many task/cgroups inside an allocation partition. Again, that was not my intention. I just want to limit the combinations. > A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of > cgroup monitoring but the case where CLOSID changes would be messy. In CLOSIDs of RDT groups do not change. They are allocated when the group is created. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, 18 Jan 2017, Stephane Eranian wrote: > On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixnerwrote: > > > Your use case is specific to HPC and not Web workloads we run. Jobs run > in cgroups which may span all the CPUs of the machine. CAT may be used > to partition the cache. Cgroups would run inside a partition. There may > be multiple cgroups running in the same partition. I can understand the > value of tracking occupancy per CLOSID, however that granularity is not > enough for our use case. Inside a partition, we want to know the > occupancy of each cgroup to be able to assign blame to the top > consumer. Thus, there needs to be a way to monitor occupancy per > cgroup. I'd like to understand how your proposal would cover this use > case. The point I'm making as I explained to David is that we need to start from the allocation angle. Of course can you monitor different tasks or task groups inside an allocation. > Another important aspect is that CQM measures new allocations, thus to > get total occupancy you need to be able to monitor the thread, CPU, > CLOSid or cgroup from the beginning of execution. In the case of a cgroup > from the moment where the first thread is scheduled into the cgroup. To > do this a RMID needs to be assigned from the beginning to the entity to > be monitored. It could be by creating a CQM event just to cause an RMID > to be assigned as discussed earlier on this thread. And then if a perf > stat is launched later it will get the same RMID and report full > occupancy. But that requires the first event to remain alive, i.e., some > process must keep the file descriptor open, i.e., need some daemon or a > perf stat running in the background. That's fine, but there must be a less convoluted way to do that. The currently proposed stuff is simply horrible because it lacks any form of design and is just hacked into submission. > There are also use cases where you want CQM without necessarily enabling > CAT, for instance, if you want to know the cache footprint of a workload > to estimate how if it could be co-located with others. That's a subset of the other stuff because it's all bound to CLOSID 0. So you can again monitor tasks or tasks groups seperately. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Thu, 19 Jan 2017, David Carrillo-Cisneros wrote: > A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested > is very problematic because the number of CLOSIDs is much much smaller than > the > number of RMIDs, and, as Stephane mentioned it's a common use case to want to > independently monitor many task/cgroups inside an allocation partition. Again, that was not my intention. I just want to limit the combinations. > A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of > cgroup monitoring but the case where CLOSID changes would be messy. In CLOSIDs of RDT groups do not change. They are allocated when the group is created. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, 18 Jan 2017, Stephane Eranian wrote: > On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner wrote: > > > Your use case is specific to HPC and not Web workloads we run. Jobs run > in cgroups which may span all the CPUs of the machine. CAT may be used > to partition the cache. Cgroups would run inside a partition. There may > be multiple cgroups running in the same partition. I can understand the > value of tracking occupancy per CLOSID, however that granularity is not > enough for our use case. Inside a partition, we want to know the > occupancy of each cgroup to be able to assign blame to the top > consumer. Thus, there needs to be a way to monitor occupancy per > cgroup. I'd like to understand how your proposal would cover this use > case. The point I'm making as I explained to David is that we need to start from the allocation angle. Of course can you monitor different tasks or task groups inside an allocation. > Another important aspect is that CQM measures new allocations, thus to > get total occupancy you need to be able to monitor the thread, CPU, > CLOSid or cgroup from the beginning of execution. In the case of a cgroup > from the moment where the first thread is scheduled into the cgroup. To > do this a RMID needs to be assigned from the beginning to the entity to > be monitored. It could be by creating a CQM event just to cause an RMID > to be assigned as discussed earlier on this thread. And then if a perf > stat is launched later it will get the same RMID and report full > occupancy. But that requires the first event to remain alive, i.e., some > process must keep the file descriptor open, i.e., need some daemon or a > perf stat running in the background. That's fine, but there must be a less convoluted way to do that. The currently proposed stuff is simply horrible because it lacks any form of design and is just hacked into submission. > There are also use cases where you want CQM without necessarily enabling > CAT, for instance, if you want to know the cache footprint of a workload > to estimate how if it could be co-located with others. That's a subset of the other stuff because it's all bound to CLOSID 0. So you can again monitor tasks or tasks groups seperately. Thanks, tglx
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, 18 Jan 2017, David Carrillo-Cisneros wrote: > On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixnerwrote: > There are use cases where the RMID to CLOSID mapping is not that simple. > Some of them are: > > 1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread > during phases that initialize relevant data, while changing it to another > during > phases that pollute cache. Yet, we want the RMID to remain the same. That's fine. I did not say that you need fixed RMD <-> CLOSID mappings. The point is that monitoring across different CLOSID domains is pointless. I have no idea how you want to do that with the proposed implementation to switch the RMID of the thread on the fly, but that's a different story. > A different variation is to change CLOSID to increase/decrease the size of the > allocated cache when high/low contention is detected. > > 2. Contention detection. I start with: >- T1 has RMID 1. >- T1 changes RMID to 2. > will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases. Of course does RMID1 decrease because it's not longer in use. Oh well. > The rate of change will be relative to the level of cache contention present > at the time. This all happens without changing the CLOSID. See above. > > > > So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not > > care at all about the occupancy of T1 simply because that is running on a > > seperate reservation. > > It is not useless for scenarios where CLOSID and RMIDs change dynamically > See above. Above you are talking about the same CLOSID and different RMIDS and not about changing both. > > Trying to make that an aggregated value in the first > > place is completely wrong. If you want an aggregate, which is pretty much > > useless, then user space tools can generate it easily. > > Not useless, see above. It is prettey useless, because CPU4 has CLOSID1 while T1 has CLOSID4 and making an aggregate over those two has absolutely nothing to do with your scenario above. If you want the aggregate value, then create it in user space and oracle (or should I say google) out of it whatever you want, but do not impose that to the kernel. > Having user space tools to aggregate implies wasting some of the already > scarce RMIDs. Oh well. Can you please explain how you want to monitor the scenario I explained above: CPU4 CLOSID 1 T1CLOSID 4 So if T1 runs on CPU4 then it uses CLOSID 4 which does not at all affect the cache occupancy of CLOSID 1. So if you use the same RMID then you pollute either the information of CPU4 (CLOSID1) or the information of T1 (CLOSID4) To gather any useful information for both CPU1 and T1 you need TWO RMIDs. Everything else is voodoo and crystal ball analysis and we are not going to support that. > > The whole approach you and David have taken is to whack some desired cgroup > > functionality and whatever into CQM without rethinking the overall > > design. And that's fundamentaly broken because it does not take cache (and > > memory bandwidth) allocation into account. > > Monitoring and allocation are closely related yet independent. Independent to some degree. Sure you can claim they are completely independent, but lots of the resulting combinations make absolutely no sense at all. And we really don't want to support non-sensical measurements just because we can. The outcome of this is complexity, inaccuracy and code which is too horrible to look at. > I see the advantages of allowing a per-cpu RMID as you describe in the > example. > > Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond > one simply monitoring occupancy per allocation. I agree there are use cases where you want to monitor across allocations, like monitoring a task which has no CLOSID assigned and runs on different CPUs and therefor potentially on different CLOSIDs which are assigned to the different CPUs. That's fine and you want a seperate RMID for this. But once you have a fixed CLOSID association then reusing and aggregating across CLOSID domains is more than useless. > > I seriously doubt, that the existing CQM/MBM code can be refactored in any > > useful way. As Peter Zijlstra said before: Remove the existing cruft > > completely and start with completely new design from scratch. > > > > And this new design should start from the allocation angle and then add the > > whole other muck on top so far its possible. Allocation related monitoring > > must be the primary focus, everything else is just tinkering. > > Assuming that my stated need for more than one RMID per CLOSID or more > than one CLOSID per RMID is recognized, what would be the advantage of > starting the design of monitoring from the allocation perspective? > > It's quite doable to create a new version of CQM/CMT without all the > cgroup murk. > > We can also create an easy way to open events to monitor CLOSIDs. Yet, I > don't see the advantage of
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, 18 Jan 2017, David Carrillo-Cisneros wrote: > On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner wrote: > There are use cases where the RMID to CLOSID mapping is not that simple. > Some of them are: > > 1. Fine-tuning of cache allocation. We may want to have a CLOSID for a thread > during phases that initialize relevant data, while changing it to another > during > phases that pollute cache. Yet, we want the RMID to remain the same. That's fine. I did not say that you need fixed RMD <-> CLOSID mappings. The point is that monitoring across different CLOSID domains is pointless. I have no idea how you want to do that with the proposed implementation to switch the RMID of the thread on the fly, but that's a different story. > A different variation is to change CLOSID to increase/decrease the size of the > allocated cache when high/low contention is detected. > > 2. Contention detection. I start with: >- T1 has RMID 1. >- T1 changes RMID to 2. > will expect llc_occupancy(1) to decrease while llc_occupancy(2) increases. Of course does RMID1 decrease because it's not longer in use. Oh well. > The rate of change will be relative to the level of cache contention present > at the time. This all happens without changing the CLOSID. See above. > > > > So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not > > care at all about the occupancy of T1 simply because that is running on a > > seperate reservation. > > It is not useless for scenarios where CLOSID and RMIDs change dynamically > See above. Above you are talking about the same CLOSID and different RMIDS and not about changing both. > > Trying to make that an aggregated value in the first > > place is completely wrong. If you want an aggregate, which is pretty much > > useless, then user space tools can generate it easily. > > Not useless, see above. It is prettey useless, because CPU4 has CLOSID1 while T1 has CLOSID4 and making an aggregate over those two has absolutely nothing to do with your scenario above. If you want the aggregate value, then create it in user space and oracle (or should I say google) out of it whatever you want, but do not impose that to the kernel. > Having user space tools to aggregate implies wasting some of the already > scarce RMIDs. Oh well. Can you please explain how you want to monitor the scenario I explained above: CPU4 CLOSID 1 T1CLOSID 4 So if T1 runs on CPU4 then it uses CLOSID 4 which does not at all affect the cache occupancy of CLOSID 1. So if you use the same RMID then you pollute either the information of CPU4 (CLOSID1) or the information of T1 (CLOSID4) To gather any useful information for both CPU1 and T1 you need TWO RMIDs. Everything else is voodoo and crystal ball analysis and we are not going to support that. > > The whole approach you and David have taken is to whack some desired cgroup > > functionality and whatever into CQM without rethinking the overall > > design. And that's fundamentaly broken because it does not take cache (and > > memory bandwidth) allocation into account. > > Monitoring and allocation are closely related yet independent. Independent to some degree. Sure you can claim they are completely independent, but lots of the resulting combinations make absolutely no sense at all. And we really don't want to support non-sensical measurements just because we can. The outcome of this is complexity, inaccuracy and code which is too horrible to look at. > I see the advantages of allowing a per-cpu RMID as you describe in the > example. > > Yet, RMIDs and CLOSIDs should remain independent to allow use cases beyond > one simply monitoring occupancy per allocation. I agree there are use cases where you want to monitor across allocations, like monitoring a task which has no CLOSID assigned and runs on different CPUs and therefor potentially on different CLOSIDs which are assigned to the different CPUs. That's fine and you want a seperate RMID for this. But once you have a fixed CLOSID association then reusing and aggregating across CLOSID domains is more than useless. > > I seriously doubt, that the existing CQM/MBM code can be refactored in any > > useful way. As Peter Zijlstra said before: Remove the existing cruft > > completely and start with completely new design from scratch. > > > > And this new design should start from the allocation angle and then add the > > whole other muck on top so far its possible. Allocation related monitoring > > must be the primary focus, everything else is just tinkering. > > Assuming that my stated need for more than one RMID per CLOSID or more > than one CLOSID per RMID is recognized, what would be the advantage of > starting the design of monitoring from the allocation perspective? > > It's quite doable to create a new version of CQM/CMT without all the > cgroup murk. > > We can also create an easy way to open events to monitor CLOSIDs. Yet, I > don't see the advantage of dissociating
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, Jan 18, 2017 at 6:09 PM, David Carrillo-Cisneroswrote: > On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner wrote: >> On Tue, 17 Jan 2017, Shivappa Vikas wrote: >>> On Tue, 17 Jan 2017, Thomas Gleixner wrote: >>> > On Fri, 6 Jan 2017, Vikas Shivappa wrote: >>> > > - Issue(1): Inaccurate data for per package data, systemwide. Just >>> > > prints >>> > > zeros or arbitrary numbers. >>> > > >>> > > Fix: Patches fix this by just throwing an error if the mode is not >>> > > supported. >>> > > The modes supported is task monitoring and cgroup monitoring. >>> > > Also the per package >>> > > data for say socket x is returned with the -C -G cgrpy >>> > > option. >>> > > The systemwide data can be looked up by monitoring root cgroup. >>> > >>> > Fine. That just lacks any comment in the implementation. Otherwise I would >>> > not have asked the question about cpu monitoring. Though I fundamentaly >>> > hate the idea of requiring cgroups for this to work. >>> > >>> > If I just want to look at CPU X why on earth do I have to set up all that >>> > cgroup muck? Just because your main focus is cgroups? >>> >>> The upstream per cpu data is broken because its not overriding the other >>> task >>> event RMIDs on that cpu with the cpu event RMID. >>> >>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized >>> to the cpu, however then we miss all the task llc_occupancy in that - still >>> evaluating it. >> >> The point here is that CQM is closely connected to the cache allocation >> technology. After a lengthy discussion we ended up having >> >> - per cpu CLOSID >> - per task CLOSID >> >> where all tasks which do not have a CLOSID assigned use the CLOSID which is >> assigned to the CPU they are running on. >> >> So if I configure a system by simply partitioning the cache per cpu, which >> is the proper way to do it for HPC and RT usecases where workloads are >> partitioned on CPUs as well, then I really want to have an equaly simple >> way to monitor the occupancy for that reservation. >> >> And looking at that from the CAT point of view, which is the proper way to >> do it, makes it obvious that CQM should be modeled to match CAT. >> >> So lets assume the following: >> >>CPU 0-3 default CLOSID 0 >>CPU 4 CLOSID 1 >>CPU 5 CLOSID 2 >>CPU 6 CLOSID 3 >>CPU 7 CLOSID 3 >> >>T1 CLOSID 4 >>T2 CLOSID 5 >>T3 CLOSID 6 >>T4 CLOSID 6 >> >>All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU >>they run on. >> >> then the obvious basic monitoring requirement is to have a RMID for each >> CLOSID. >> >> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not >> care at all about the occupancy of T1 simply because that is running on a >> seperate reservation. Trying to make that an aggregated value in the first >> place is completely wrong. If you want an aggregate, which is pretty much >> useless, then user space tools can generate it easily. >> >> The whole approach you and David have taken is to whack some desired cgroup >> functionality and whatever into CQM without rethinking the overall >> design. And that's fundamentaly broken because it does not take cache (and >> memory bandwidth) allocation into account. >> >> I seriously doubt, that the existing CQM/MBM code can be refactored in any >> useful way. As Peter Zijlstra said before: Remove the existing cruft >> completely and start with completely new design from scratch. >> >> And this new design should start from the allocation angle and then add the >> whole other muck on top so far its possible. Allocation related monitoring >> must be the primary focus, everything else is just tinkering. >> > > If in this email you meant "Resource group" where you wrote "CLOSID", then > please disregard my previous email. It seems like a good idea to me to have > a 1:1 mapping between RMIDs and "Resource groups". > > The distinction matter because changing the schemata in the resource group > would likely trigger a change of CLOSID, which is useful. > Just realized that the sharing of CLOSIDs is not part of the accepted version of RDT. My mental model was still on the old CAT driver that did allow sharing of CLOSIDs between cgroups. Now I understand why CLOSID was assumed to be equal with "Resource groups". Sorry for the noise. Then the comments in my previous email hold. In summary and addition to latest emails: A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested is very problematic because the number of CLOSIDs is much much smaller than the number of RMIDs, and, as Stephane mentioned it's a common use case to want to independently monitor many task/cgroups inside an allocation partition. A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of cgroup monitoring but the case
Re: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes
On Wed, Jan 18, 2017 at 6:09 PM, David Carrillo-Cisneros wrote: > On Wed, Jan 18, 2017 at 12:53 AM, Thomas Gleixner wrote: >> On Tue, 17 Jan 2017, Shivappa Vikas wrote: >>> On Tue, 17 Jan 2017, Thomas Gleixner wrote: >>> > On Fri, 6 Jan 2017, Vikas Shivappa wrote: >>> > > - Issue(1): Inaccurate data for per package data, systemwide. Just >>> > > prints >>> > > zeros or arbitrary numbers. >>> > > >>> > > Fix: Patches fix this by just throwing an error if the mode is not >>> > > supported. >>> > > The modes supported is task monitoring and cgroup monitoring. >>> > > Also the per package >>> > > data for say socket x is returned with the -C -G cgrpy >>> > > option. >>> > > The systemwide data can be looked up by monitoring root cgroup. >>> > >>> > Fine. That just lacks any comment in the implementation. Otherwise I would >>> > not have asked the question about cpu monitoring. Though I fundamentaly >>> > hate the idea of requiring cgroups for this to work. >>> > >>> > If I just want to look at CPU X why on earth do I have to set up all that >>> > cgroup muck? Just because your main focus is cgroups? >>> >>> The upstream per cpu data is broken because its not overriding the other >>> task >>> event RMIDs on that cpu with the cpu event RMID. >>> >>> Can be fixed by adding a percpu struct to hold the RMID thats affinitized >>> to the cpu, however then we miss all the task llc_occupancy in that - still >>> evaluating it. >> >> The point here is that CQM is closely connected to the cache allocation >> technology. After a lengthy discussion we ended up having >> >> - per cpu CLOSID >> - per task CLOSID >> >> where all tasks which do not have a CLOSID assigned use the CLOSID which is >> assigned to the CPU they are running on. >> >> So if I configure a system by simply partitioning the cache per cpu, which >> is the proper way to do it for HPC and RT usecases where workloads are >> partitioned on CPUs as well, then I really want to have an equaly simple >> way to monitor the occupancy for that reservation. >> >> And looking at that from the CAT point of view, which is the proper way to >> do it, makes it obvious that CQM should be modeled to match CAT. >> >> So lets assume the following: >> >>CPU 0-3 default CLOSID 0 >>CPU 4 CLOSID 1 >>CPU 5 CLOSID 2 >>CPU 6 CLOSID 3 >>CPU 7 CLOSID 3 >> >>T1 CLOSID 4 >>T2 CLOSID 5 >>T3 CLOSID 6 >>T4 CLOSID 6 >> >>All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU >>they run on. >> >> then the obvious basic monitoring requirement is to have a RMID for each >> CLOSID. >> >> So when I monitor CPU4, i.e. CLOSID 1 and T1 runs on CPU4, then I do not >> care at all about the occupancy of T1 simply because that is running on a >> seperate reservation. Trying to make that an aggregated value in the first >> place is completely wrong. If you want an aggregate, which is pretty much >> useless, then user space tools can generate it easily. >> >> The whole approach you and David have taken is to whack some desired cgroup >> functionality and whatever into CQM without rethinking the overall >> design. And that's fundamentaly broken because it does not take cache (and >> memory bandwidth) allocation into account. >> >> I seriously doubt, that the existing CQM/MBM code can be refactored in any >> useful way. As Peter Zijlstra said before: Remove the existing cruft >> completely and start with completely new design from scratch. >> >> And this new design should start from the allocation angle and then add the >> whole other muck on top so far its possible. Allocation related monitoring >> must be the primary focus, everything else is just tinkering. >> > > If in this email you meant "Resource group" where you wrote "CLOSID", then > please disregard my previous email. It seems like a good idea to me to have > a 1:1 mapping between RMIDs and "Resource groups". > > The distinction matter because changing the schemata in the resource group > would likely trigger a change of CLOSID, which is useful. > Just realized that the sharing of CLOSIDs is not part of the accepted version of RDT. My mental model was still on the old CAT driver that did allow sharing of CLOSIDs between cgroups. Now I understand why CLOSID was assumed to be equal with "Resource groups". Sorry for the noise. Then the comments in my previous email hold. In summary and addition to latest emails: A 1:1 mapping between CLOSID/"Resource group" to RMID, as Fenghua suggested is very problematic because the number of CLOSIDs is much much smaller than the number of RMIDs, and, as Stephane mentioned it's a common use case to want to independently monitor many task/cgroups inside an allocation partition. A 1:many mapping of CLOSID to RMIDs may work as a cheap replacement of cgroup monitoring but the case where CLOSID changes would be messy. In