Re: [RFD] CAT user space interface revisited

2016-01-06 Thread Tejun Heo
Hello, Marcelo.

On Wed, Jan 06, 2016 at 10:46:15AM -0200, Marcelo Tosatti wrote:
> Well, i suppose cgroups has facilities to handle this? That is, what is
> required is:

No, it doesn't.

> On task creation, move the new task to a particular cgroup, based on
> some visible characteristic of the task: (process name matching OR explicit
> kernel thread creator specification OR ...).

cgroup's primary goal is resource tracking and control.  For userland
processes, following fork / clone is enough; however, for a lot of
kthread tasks, task isn't even the right unit.  e.g. Think of CPU
cycles spent on packet reception, spawning per-cgroup kthreads to
handle packet rx separately isn't a realistic option.  The granularity
needs to be higher.  Except for a handful of cases, this pattern
holds.  Another example is IO resources spent during journal write.
Most of in-kernel resource tracking can't be split per-kthread.

While assigning kthreads to specific cgroups can be useful for a few
specific use cases, in terms of in-kernel resource tracking, it's more
of a distraction.

Please stop using cgroup for random task grouping.  Supporting the
level of flexibility to support arbitrary grouping gets in the way of
implementing proper resource control.  You won't be happy because
cgroup's rules get in the way and cgroup won't be happy because your
random stuff gets in the way of proper resource control.

Thomas's proposal obviously works better for the task at hand.  Maybe
there's something which can be extracted out of cgroup and shared for
task group tracking, if nothing else, hooks and synchronization, but
please don't tack it on top of cgroup when it doesn't really fit the
hierarchical resource distribution model.

Thanks.

-- 
tejun
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2016-01-06 Thread Marcelo Tosatti
On Wed, Jan 06, 2016 at 12:09:50AM +0100, Thomas Gleixner wrote:
> Marcelo,
> 
> On Mon, 4 Jan 2016, Marcelo Tosatti wrote:
> > On Thu, Dec 31, 2015 at 11:30:57PM +0100, Thomas Gleixner wrote:
> > > I don't have an idea how that would look like. The current structure is a
> > > cgroups based hierarchy oriented approach, which does not allow simple 
> > > things
> > > like
> > > 
> > > T1
> > > T20000
> > > 
> > > at least not in a way which is natural to the problem at hand.
> > 
> > 
> > 
> > cgroupA/
> > 
> > cbm_mask  (if set, set for all CPUs)
> 
> You mean sockets, right?
> 
> > 
> > socket1/cbm_mask
> > socket2/cbm_mask
> > ...
> > socketN/cbm_mask (if set, overrides global
> > cbm_mask).
> > 
> > Something along those lines.
> > 
> > Do you see any problem with it?
> 
> So for that case:
> 
> task1: cbm_mask 
> task2: cbm_mask 0000
> 
> i.e. task1 and task2 share bit 2/3 of the mask. 
> 
> I need to have two cgroups: cgroup1 and cgroup2, task1 is member of cgroup1
> and task2 is member of cgroup2, right?
> 
> So now add some more of this and then figure out, which cbm_masks are in use
> on which socket. That means I need to go through all cgroups and find the
> cbm_masks there.

Yes.

> With my proposed directory structure you get a very clear view about the
> in-use closids and the associated cbm_masks. That view represents the hardware
> in the best way. With the cgroups stuff we get an artificial representation
> which does not tell us anything about the in-use closids and the associated
> cbm_masks.

Because you expose cos-ID ---> cbm / cdp masks.

Fine, agree thats nice.

> > > I cannot imagine how that modification to the current interface would 
> > > solve
> > > that. Not to talk about per CPU associations which are not related to 
> > > tasks at
> > > all.
> > 
> > Not sure what you mean by per CPU associations.
> 
> As I wrote before:
> 
>  "It would even be sufficient for particular use cases to just associate
>   a piece of cache to a given CPU and do not bother with tasks at all."
> 
> > If you fix a cbmmask on a given pCPU, say CPU1, and control which tasks
> > run on that pCPU, then you control the cbmmask for all tasks (say
> > tasklist-1) on that CPU, fine.
> > 
> > Can achieve the same by putting all tasks from tasklist-1 into a
> > cgroup.
> 
> Which means, that I need to go and find everything including kernel threads
> and put them into a particular cgroup. That's really not useful and it simply
> does not work:
> 
> To which cgroup belongs a dynamically created per cpu worker thread? To the
> cgroup of the parent. But is the parent necessarily in the proper cgroup? No,
> there is no guarantee. So it ends up in some random cgroup unless I start
> chasing every new thread, instead of letting it use the default cosid of the
> CPU.

Well, i suppose cgroups has facilities to handle this? That is, what is
required is:

On task creation, move the new task to a particular cgroup, based on
some visible characteristic of the task: (process name matching OR explicit
kernel thread creator specification OR ...).

Because there are two cases. Consider a kernel thread T, which contains 
code that is timing sensitive therefore requires to use a COSID (which
means use a reserved portion of cache).

Case 1) kernel thread T starts kernel thread R, which is also timing
sensitive (and wants to use the same COSID as kernel thread T). 
In that case, the cgroup's default (inherit cgroup from parent)
behaviour is correct.

Case 2) kernel thread T starts kernel thread X, which is not timing
sensitive, therefore kernel thread X should use "default cosid".
In the case of cgroups, in the example used elsewhere in this thread,
kernel thread X should be moved to "cgroupsALL".

Strictly speaking there is a third case:

Case 3) kernel thread T starts kernel thread Z, which wants to
be moved to a different COSID other than kernel thread T's COSID.

So using the default COSID is not necessarily the correct thing to do
(this should be configurable on a per-case basis).

> Having a per cpu default cos-id which is used when the task does not have a
> cos-id associated makes a lot of sense and makes it simpler to utilize that
> facility.

You would need a facility to switch to "inherit cgroup from parent"
mode, and also to handle case 3 (which i supposed cgroups did, because 
the same problem exists for other cgroup controllers).

> > > >> Per cpu default cos id for the cpus on that socket:
> > > >> 
> > > >>  xxx/cat/socket-N/cpu-x/default_cosid
> > > >>  ...
> > > >>  xxx/cat/socket-N/cpu-N/default_cosid
> > > >>
> > > >> The above allows a simple cpu based partitioning. All tasks which do
> > > >> not have a cache partition assigned on a particular socket use the
> > > >> default one of the cpu they are running on.
> > > 
> > >  Where is that information in (*2) and how is th

Re: [RFD] CAT user space interface revisited

2016-01-05 Thread Thomas Gleixner
Marcelo,

On Mon, 4 Jan 2016, Marcelo Tosatti wrote:
> On Thu, Dec 31, 2015 at 11:30:57PM +0100, Thomas Gleixner wrote:
> > I don't have an idea how that would look like. The current structure is a
> > cgroups based hierarchy oriented approach, which does not allow simple 
> > things
> > like
> > 
> > T1  
> > T2  0000
> > 
> > at least not in a way which is natural to the problem at hand.
> 
> 
> 
>   cgroupA/
> 
>   cbm_mask  (if set, set for all CPUs)

You mean sockets, right?

> 
>   socket1/cbm_mask
>   socket2/cbm_mask
>   ...
>   socketN/cbm_mask (if set, overrides global
>   cbm_mask).
> 
> Something along those lines.
> 
> Do you see any problem with it?

So for that case:

task1:   cbm_mask 
task2:   cbm_mask 0000

i.e. task1 and task2 share bit 2/3 of the mask. 

I need to have two cgroups: cgroup1 and cgroup2, task1 is member of cgroup1
and task2 is member of cgroup2, right?

So now add some more of this and then figure out, which cbm_masks are in use
on which socket. That means I need to go through all cgroups and find the
cbm_masks there.

With my proposed directory structure you get a very clear view about the
in-use closids and the associated cbm_masks. That view represents the hardware
in the best way. With the cgroups stuff we get an artificial representation
which does not tell us anything about the in-use closids and the associated
cbm_masks.
 
> > I cannot imagine how that modification to the current interface would solve
> > that. Not to talk about per CPU associations which are not related to tasks 
> > at
> > all.
> 
> Not sure what you mean by per CPU associations.

As I wrote before:

 "It would even be sufficient for particular use cases to just associate
  a piece of cache to a given CPU and do not bother with tasks at all."

> If you fix a cbmmask on a given pCPU, say CPU1, and control which tasks
> run on that pCPU, then you control the cbmmask for all tasks (say
> tasklist-1) on that CPU, fine.
> 
> Can achieve the same by putting all tasks from tasklist-1 into a
> cgroup.

Which means, that I need to go and find everything including kernel threads
and put them into a particular cgroup. That's really not useful and it simply
does not work:

To which cgroup belongs a dynamically created per cpu worker thread? To the
cgroup of the parent. But is the parent necessarily in the proper cgroup? No,
there is no guarantee. So it ends up in some random cgroup unless I start
chasing every new thread, instead of letting it use the default cosid of the
CPU.

Having a per cpu default cos-id which is used when the task does not have a
cos-id associated makes a lot of sense and makes it simpler to utilize that
facility.

> > >> Per cpu default cos id for the cpus on that socket:
> > >> 
> > >>  xxx/cat/socket-N/cpu-x/default_cosid
> > >>  ...
> > >>  xxx/cat/socket-N/cpu-N/default_cosid
> > >>
> > >> The above allows a simple cpu based partitioning. All tasks which do
> > >> not have a cache partition assigned on a particular socket use the
> > >> default one of the cpu they are running on.
> > 
> >  Where is that information in (*2) and how is that related to (*1)? If you
> >  think it's not required, please explain why.
> 
> Not required because with current Intel patchset you'd do:


...


> # cat intel_rdt.l3_cbm
> 0000
> # cat ../cgroupALL/intel_rdt.l3_cbm
> 00ff
> 
> Bits f0 are shared between cgroupRSVD and cgroupALL. Lets change:
> # echo 0xf > ../cgroupALL/intel_rdt.l3_cbm
> # cat ../cgroupALL/intel_rdt.l3_cbm
> 000f
> 
> Now they share none.

Well, you changed ALL and everything, but you still did not assign a
particular cos-id to a particular CPU as their default.
 
> > >> Now for the task(s) partitioning:
> > >>
> > >>  xxx/cat/partitions/
> > >>
> > >> Under that directory one can create partitions
> > >> 
> > >>  xxx/cat/partitions/p1/tasks
> > >>   /socket-0/cosid
> > >>   ...
> > >>   /socket-n/cosid
> > >> 
> > >> The default value for the per socket cosid is COSID_DEFAULT, which
> > >> causes the task(s) to use the per cpu default id. 
> > 
> >  Where is that information in (*2) and how is that related to (*1)? If you
> >  think it's not required, please explain why.
> > 
> > Yes. I ask the same question several times and I really want to see the
> > directory/interface structure which solves all of the above before anyone
> > starts to implement it. 
> 
> I don't see the problem, have a sequence of commands above which shows
> to set a directory structure which is useful and does what the HW 
> interface is supposed to do.

Well, you have a sequence of commands, which gives you the result which you
need for your particular problem.

> > We already have a completely useless interface (*1)
> > and there is no point to implement another one based on it (*2) just because
> > it so

Re: [RFD] CAT user space interface revisited

2016-01-04 Thread Marcelo Tosatti
CPU and do not bother with tasks at all.
> > > 
> > > We really need to make this as configurable as possible from userspace
> > > without imposing random restrictions to it. I played around with it on
> > > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > > enabled) makes it really useless if we force the ids to have the same
> > > meaning on all sockets and restrict it to per task partitioning."
> > > 
> > > Yes, thats the issue we hit, that is the modification that was agreed
> > > with Intel, and thats what we are waiting for them to post.
> > 
> > How do you implement the above - especially that part:
> > 
> >  "It would even be sufficient for particular use cases to just associate a
> >   piece of cache to a given CPU and do not bother with tasks at all."
> > 
> > as a "simple" modification to (*1) ?
> 
> As noted above.
> >  
> > > > I described a directory structure for that qos/cat stuff in my proposal 
> > > > and
> > > > that's complete AFAICT.
> > > 
> > > Ok, lets make the job for the submitter easier. You are the maintainer,
> > > so you decide.
> > > 
> > > Is it enough for you to have (*2) (which was agreed with Intel), or 
> > > would you rather prefer to integrate the directory structure at 
> > > "[RFD] CAT user space interface revisited" ?
> > 
> > The only thing I care about as a maintainer is, that we merge something 
> > which
> > actually reflects the properties of the hardware and gives the admin the
> > required flexibility to utilize it fully. I don't care at all if it's my
> > proposal or something else which allows to do the same.
> > 
> > Let me copy the relevant bits from my proposal here once more and let me ask
> > questions to the various points so you can tell me how that modification to
> > (*1) is going to deal with that.
> > 
> > >> At top level:
> > >>
> > >>  xxx/cat/max_cosids <- Assume that all CPUs are the same
> > >>  xxx/cat/max_maskbits <- Assume that all CPUs are the same
> 
> This can be exposed to userspace via a file.
> 
> > >>  xxx/cat/cdp_enable <- Depends on CDP availability
> > 
> >  Where is that information in (*2) and how is that related to (*1)? If you
> >  think it's not required, please explain why.
> 
> Intel has come up with a scheme to implement CDP. I'll go read 
> that and reply to this email afterwards.

Pasting relevant parts of the patchset submission. 
Looks fine to me, two files, one for data cache cbmmask, another
for instruction cache cbm mask. 
Those two files would be moved to "socket-N" directories.

(will review the CDP patchset...).

Subject: [PATCH V2 0/5] x86: Intel Code Data Prioritization Support

This patch set supports Intel code data prioritization which is an
extension of cache allocation and allows to allocate code and data cache
seperately. It also includes cgroup interface for the user as seperate
patches. The cgroup interface for cache alloc is also resent.

This patch adds enumeration support for Code Data Prioritization(CDP)
feature found in future Intel Xeon processors. It includes CPUID
enumeration routines for CDP.

CDP is an extension to Cache Allocation and lets threads allocate subset
of L3 cache for code and data separately. The allocation is represented
by the code or data cache capacity bit mask(cbm) MSRs
IA32_L3_QOS_MASK_n. Each Class of service would be associated with one
dcache_cbm and one icache_cbm MSR and hence the number of available
CLOSids is halved with CDP. The association for a CLOSid 'n' is shown
below :

data_cbm_address (n) = base + (n <<1)
code_cbm_address (n) = base + (n <<1) +1.
During scheduling the kernel writes the CLOSid
of the thread to IA32_PQR_ASSOC_MSR.

Adds two files to the intel_rdt cgroup 'dcache_cbm' and 'icache_cbm'
when code data prioritization(cdp) support is present. The files
represent the data capacity bit mask(cbm) and instruction cbm for L3
cache. User can specify the data and code cbm and the threads belonging
to the cgroup would get to fill the l3 cache represented by the cbm with
the data or code.

For ex: Consider a scenario where the max cbm bits is 10 and L3 cache
size is 10MB:
then specifying a dcache_cbm = 0x3 and icache_cbm = 0xc would give 2MB
of exclusive cache for data and code for the tasks to fill in.

This feature is an extension to cache allocation and lets user specify a
capacity for code and data separately. Initially these cbms would have
the same value as the l3_cbm(which represents the common cbm for code
and data). Once the user tries to write to either the dcache_cbm or
icache_cbm, the kernel tries to enable the cdp mode in hardware by
writing to the IA32_PQOS_CFG MSR. The switch is only possible if the
number of Class of service IDs(CLOSids) used is < half of total CLOSids
available at the time of switch. This is because the CLOSIds are halved
once CDP is enabled and each CLOSid now maps to a data IA32_L3_QOS_n MSR
and a code IA32_L3_QOS_n MSR.
Once the CDP is enabled user can use the dcache_cbm and icache_cbm just
like the l3_cbm. The CLOSids are not exposed to the user and maintained
by the kernel internally.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2016-01-04 Thread Marcelo Tosatti
or them to post.
> 
> How do you implement the above - especially that part:
> 
>  "It would even be sufficient for particular use cases to just associate a
>   piece of cache to a given CPU and do not bother with tasks at all."
> 
> as a "simple" modification to (*1) ?

As noted above.
>  
> > > I described a directory structure for that qos/cat stuff in my proposal 
> > > and
> > > that's complete AFAICT.
> > 
> > Ok, lets make the job for the submitter easier. You are the maintainer,
> > so you decide.
> > 
> > Is it enough for you to have (*2) (which was agreed with Intel), or 
> > would you rather prefer to integrate the directory structure at 
> > "[RFD] CAT user space interface revisited" ?
> 
> The only thing I care about as a maintainer is, that we merge something which
> actually reflects the properties of the hardware and gives the admin the
> required flexibility to utilize it fully. I don't care at all if it's my
> proposal or something else which allows to do the same.
> 
> Let me copy the relevant bits from my proposal here once more and let me ask
> questions to the various points so you can tell me how that modification to
> (*1) is going to deal with that.
> 
> >> At top level:
> >>
> >>  xxx/cat/max_cosids <- Assume that all CPUs are the same
> >>  xxx/cat/max_maskbits <- Assume that all CPUs are the same

This can be exposed to userspace via a file.

> >>  xxx/cat/cdp_enable <- Depends on CDP availability
> 
>  Where is that information in (*2) and how is that related to (*1)? If you
>  think it's not required, please explain why.

Intel has come up with a scheme to implement CDP. I'll go read 
that and reply to this email afterwards.

> >> Per socket data:
> >>
> >>  xxx/cat/socket-0/
> >>  ...
> >>  xxx/cat/socket-N/l3_size
> >>  xxx/cat/socket-N/hwsharedbits
> 
>  Where is that information in (*2) and how is that related to (*1)? If you
>  think it's not required, please explain why.

l3_size: userspace can figure that by itself (exposed somewhere in
sysfs).

hwsharedbits: All userspace needs to know is
which bits are shared with HW, to decide whether or not to use that
region of a given socket for a given cbmmask.

So expose that userspace, fine. Can do that in cgroups.

> >> Per socket mask data:
> >>
> >>  xxx/cat/socket-N/cos-id-0/
> >>  ...
> >>  xxx/cat/socket-N/cos-id-N/inuse
> >>   /cat_mask
> >>   /cdp_mask <- Data mask if CDP enabled
> 
>  Where is that information in (*2) and how is that related to (*1)? If you
>  think it's not required, please explain why.

Unsure - will reply in next email (but per-socket information seems
independent of that).

> 
> >> Per cpu default cos id for the cpus on that socket:
> >> 
> >>  xxx/cat/socket-N/cpu-x/default_cosid
> >>  ...
> >>  xxx/cat/socket-N/cpu-N/default_cosid
> >>
> >> The above allows a simple cpu based partitioning. All tasks which do
> >> not have a cache partition assigned on a particular socket use the
> >> default one of the cpu they are running on.
> 
>  Where is that information in (*2) and how is that related to (*1)? If you
>  think it's not required, please explain why.

Not required because with current Intel patchset you'd do:


# mount | grep rdt
cgroup on /sys/fs/cgroup/intel_rdt type cgroup
(rw,nosuid,nodev,noexec,relatime,intel_rdt)
# cd /sys/fs/cgroup/intel_rdt
# ls
cgroupALL  cgroup.procs  cgroup.sane_behavior
notify_on_release  tasks
cgroup.clone_children  cgroupRSVDintel_rdt.l3_cbm  release_agent
# cat tasks
1042
1066
1067
1069
...
# cd cgroupALL/
ps auxw  | while read i; do echo $i ; done
| cut -f 2 -d " "   | grep -v PID | while read x ; do echo $x > tasks;
done
-bash: echo: write error: No such process
-bash: echo: write error: No such process
-bash: echo: write error: No such process
-bash: echo: write error: No such process

# cat ../tasks | while read i; do echo $i > tasks; done
# cat ../tasks  | wc -l
0
(no tasks on root cgroup)

# cd ../cgroupRSVD
# cgroupRSVD]# cat tasks
# ps auxw | grep postfix
root   1942  0.0  0.0  91136  4860 ?Ss   Nov25   0:00
/usr/libexec/postfix/master -w
postfix1981  0.0  0.0  91308  6520 ?SNov25   0:00 qmgr
-l -t unix -u
postfix4416  0.0  0.0  91240  6296 ?S17:05   0:00 pickup
-l -t unix -u
root   4486  0.0  0.0 112652  2304 pts/0S+   17:31   0:00 grep
--color=auto postfix
# echo 4416 > tasks
# cat

Re: [RFD] CAT user space interface revisited

2015-12-31 Thread Thomas Gleixner
Marcelo,

On Thu, 31 Dec 2015, Marcelo Tosatti wrote:

First of all thanks for the explanation.

> There is one directory structure in this topic, CAT. That is the
> directory structure which is exposed to userspace to control the 
> CAT HW. 
> 
> With the current patchset posted by Intel ("Subject: [PATCH V16 00/11]
> x86: Intel Cache Allocation Technology Support"), the directory
> structure there (the files and directories exposed by that patchset)
> (*1) does not allow one to configure different CBM masks on each socket
> (that is, it forces the user to configure the same mask CBM on every
> socket). This is a blocker for us, and it is one of the points in your
> proposal.
> 
> There was a call between Red Hat and Intel where it was communicated
> to Intel, and Intel agreed, that it was necessary to fix this (fix this
> == allow different CBM masks on different sockets).
> 
> Now, that is one change to the current directory structure (*1).

I don't have an idea how that would look like. The current structure is a
cgroups based hierarchy oriented approach, which does not allow simple things
like

T1  
T2  0000

at least not in a way which is natural to the problem at hand.

> (*1) modified to allow for different CBM masks on different sockets, 
> lets say (*2), is what we have been waiting for Intel to post. 
> It would handle our usecase, and all use-cases which the current
> patchset from Intel already handles (Vikas posted emails mentioning
> there are happy users of the current interface, feel free to ask 
> him for more details).

I cannot imagine how that modification to the current interface would solve
that. Not to talk about per CPU associations which are not related to tasks at
all.

> What i have asked you, and you replied "to go Google read my previous
> post" is this:
> What are the advantages over you proposal (which is a completely
> different directory structure, requiring a complete rewrite),
> over (*2) ?
> 
> (what is my reason behind this: the reason is that if you, with
> maintainer veto power, forces your proposal to be accepted, it will be
> necessary to wait for another rewrite (a new set of problems, fully
> think through your proposal, test it, ...) rather than simply modify an
> already known, reviewed, already used directory structure.
> 
> And functionally, your proposal adds nothing to (*2) (other than, well,
> being a different directory structure).

Sorry. I cannot see at all how a modification to the existing interface would
cover all the sensible use cases I described in a coherent way. I really want
to see a proper description of the interface before people start hacking on it
in a frenzy. What you described is: "let's say (*2)" modification. That's
pretty meager.

> If Fenghua or you post a patchset, say in 2 weeks, with your proposal,
> i am fine with that. But i since i doubt that will be the case, i am 
> pushing for the interface which requires the least amount of changes
> (and therefore the least amount of time) to be integrated.
> 
> >From your email:
> 
> "It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning."
> 
> Yes, thats the issue we hit, that is the modification that was agreed
> with Intel, and thats what we are waiting for them to post.

How do you implement the above - especially that part:

 "It would even be sufficient for particular use cases to just associate a
  piece of cache to a given CPU and do not bother with tasks at all."

as a "simple" modification to (*1) ?
 
> > I described a directory structure for that qos/cat stuff in my proposal and
> > that's complete AFAICT.
> 
> Ok, lets make the job for the submitter easier. You are the maintainer,
> so you decide.
> 
> Is it enough for you to have (*2) (which was agreed with Intel), or 
> would you rather prefer to integrate the directory structure at 
> "[RFD] CAT user space interface revisited" ?

The only thing I care about as a maintainer is, that we merge something which
actually reflects the properties of the hardware and gives the admin the
required flexibility to utilize it fully. I don't care at all if it's my
proposal or something else which allows to do the same.

Let me copy the relevant bits from my proposal here once 

Re: [RFD] CAT user space interface revisited

2015-12-31 Thread Marcelo Tosatti
On Tue, Dec 29, 2015 at 01:44:16PM +0100, Thomas Gleixner wrote:
> Marcelo,
> 
> On Wed, 23 Dec 2015, Marcelo Tosatti wrote:
> > On Tue, Dec 22, 2015 at 06:12:05PM +, Yu, Fenghua wrote:
> > > > From: Thomas Gleixner [mailto:t...@linutronix.de]
> > > >
> > > > I was not able to identify any existing infrastructure where this 
> > > > really fits in. I
> > > > chose a directory/file based representation. We certainly could do the 
> > > > same
> > > 
> > > Is this be /sys/devices/system/?
> > > Then create qos/cat directory. In the future, other directories may be 
> > > created
> > > e.g. qos/mbm?
> > 
> > I suppose Thomas is talking about the socketmask only, as discussed in
> > the call with Intel.
> 
> I have no idea about what you talked in a RH/Intel call.
>  
> > Thomas, is that correct? (if you want a change in directory structure,
> > please explain the whys, because we don't need that change in directory 
> > structure).
> 
> Can you please start to write coherent and understandable mails? I have no
> idea of which directory structure, which does not need to be changed, you are
> talking.

Thomas,

There is one directory structure in this topic, CAT. That is the
directory structure which is exposed to userspace to control the 
CAT HW. 

With the current patchset posted by Intel ("Subject: [PATCH V16 00/11]
x86: Intel Cache Allocation Technology Support"), the directory
structure there (the files and directories exposed by that patchset)
(*1) does not allow one to configure different CBM masks on each socket
(that is, it forces the user to configure the same mask CBM on every
socket). This is a blocker for us, and it is one of the points in your
proposal.

There was a call between Red Hat and Intel where it was communicated
to Intel, and Intel agreed, that it was necessary to fix this (fix this
== allow different CBM masks on different sockets).

Now, that is one change to the current directory structure (*1).

(*1) modified to allow for different CBM masks on different sockets, 
lets say (*2), is what we have been waiting for Intel to post. 
It would handle our usecase, and all use-cases which the current
patchset from Intel already handles (Vikas posted emails mentioning
there are happy users of the current interface, feel free to ask 
him for more details).

What i have asked you, and you replied "to go Google read my previous
post" is this:
What are the advantages over you proposal (which is a completely
different directory structure, requiring a complete rewrite),
over (*2) ?

(what is my reason behind this: the reason is that if you, with
maintainer veto power, forces your proposal to be accepted, it will be
necessary to wait for another rewrite (a new set of problems, fully
think through your proposal, test it, ...) rather than simply modify an
already known, reviewed, already used directory structure.

And functionally, your proposal adds nothing to (*2) (other than, well,
being a different directory structure).

If Fenghua or you post a patchset, say in 2 weeks, with your proposal,
i am fine with that. But i since i doubt that will be the case, i am 
pushing for the interface which requires the least amount of changes
(and therefore the least amount of time) to be integrated.

>From your email:

"It would even be sufficient for particular use cases to just associate
a piece of cache to a given CPU and do not bother with tasks at all.

We really need to make this as configurable as possible from userspace
without imposing random restrictions to it. I played around with it on
my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
enabled) makes it really useless if we force the ids to have the same
meaning on all sockets and restrict it to per task partitioning."

Yes, thats the issue we hit, that is the modification that was agreed
with Intel, and thats what we are waiting for them to post.

> I described a directory structure for that qos/cat stuff in my proposal and
> that's complete AFAICT.

Ok, lets make the job for the submitter easier. You are the maintainer,
so you decide.

Is it enough for you to have (*2) (which was agreed with Intel), or 
would you rather prefer to integrate the directory structure at 
"[RFD] CAT user space interface revisited" ?

Thanks.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-12-29 Thread Thomas Gleixner
Marcelo,

On Wed, 23 Dec 2015, Marcelo Tosatti wrote:
> On Tue, Dec 22, 2015 at 06:12:05PM +, Yu, Fenghua wrote:
> > > From: Thomas Gleixner [mailto:t...@linutronix.de]
> > >
> > > I was not able to identify any existing infrastructure where this really 
> > > fits in. I
> > > chose a directory/file based representation. We certainly could do the 
> > > same
> > 
> > Is this be /sys/devices/system/?
> > Then create qos/cat directory. In the future, other directories may be 
> > created
> > e.g. qos/mbm?
> 
> I suppose Thomas is talking about the socketmask only, as discussed in
> the call with Intel.

I have no idea about what you talked in a RH/Intel call.
 
> Thomas, is that correct? (if you want a change in directory structure,
> please explain the whys, because we don't need that change in directory 
> structure).

Can you please start to write coherent and understandable mails? I have no
idea of which directory structure, which does not need to be changed, you are
talking.

I described a directory structure for that qos/cat stuff in my proposal and
that's complete AFAICT.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-12-23 Thread Marcelo Tosatti
On Tue, Dec 22, 2015 at 06:12:05PM +, Yu, Fenghua wrote:
> > From: Thomas Gleixner [mailto:t...@linutronix.de]
> > Sent: Wednesday, November 18, 2015 10:25 AM
> > Folks!
> > 
> > After rereading the mail flood on CAT and staring into the SDM for a while, 
> > I
> > think we all should sit back and look at it from scratch again w/o our
> > preconceptions - I certainly had to put my own away.
> > 
> > Let's look at the properties of CAT again:
> > 
> >- It's a per socket facility
> > 
> >- CAT slots can be associated to external hardware. This
> >  association is per socket as well, so different sockets can have
> >  different behaviour. I missed that detail when staring the first
> >  time, thanks for the pointer!
> > 
> >- The association ifself is per cpu. The COS selection happens on a
> >  CPU while the set of masks which are selected via COS are shared
> >  by all CPUs on a socket.
> > 
> > There are restrictions which CAT imposes in terms of configurability:
> > 
> >- The bits which select a cache partition need to be consecutive
> > 
> >- The number of possible cache association masks is limited
> > 
> > Let's look at the configurations (CDP omitted and size restricted)
> > 
> > Default:   1 1 1 1 1 1 1 1
> >1 1 1 1 1 1 1 1
> >1 1 1 1 1 1 1 1
> >1 1 1 1 1 1 1 1
> > 
> > Shared:1 1 1 1 1 1 1 1
> >0 0 1 1 1 1 1 1
> >0 0 0 0 1 1 1 1
> >0 0 0 0 0 0 1 1
> > 
> > Isolated:  1 1 1 1 0 0 0 0
> >0 0 0 0 1 1 0 0
> >0 0 0 0 0 0 1 0
> >0 0 0 0 0 0 0 1
> > 
> > Or any combination thereof. Surely some combinations will not make any
> > sense, but we really should not make any restrictions on the stupidity of a
> > sysadmin. The worst outcome might be L3 disabled for everything, so what?
> > 
> > Now that gets even more convoluted if CDP comes into play and we really
> > need to look at CDP right now. We might end up with something which looks
> > like this:
> > 
> >1 1 1 1 0 0 0 0  Code
> >1 1 1 1 0 0 0 0  Data
> >0 0 0 0 0 0 1 0  Code
> >0 0 0 0 1 1 0 0  Data
> >0 0 0 0 0 0 0 1  Code
> >0 0 0 0 1 1 0 0  Data
> > or
> >0 0 0 0 0 0 0 1  Code
> >0 0 0 0 1 1 0 0  Data
> >0 0 0 0 0 0 0 1  Code
> >0 0 0 0 0 1 1 0  Data
> > 
> > Let's look at partitioning itself. We have two options:
> > 
> >1) Per task partitioning
> > 
> >2) Per CPU partitioning
> > 
> > So far we only talked about #1, but I think that #2 has a value as well. 
> > Let me
> > give you a simple example.
> > 
> > Assume that you have isolated a CPU and run your important task on it. You
> > give that task a slice of cache. Now that task needs kernel services which 
> > run
> > in kernel threads on that CPU. We really don't want to (and cannot) hunt
> > down random kernel threads (think cpu bound worker threads, softirq
> > threads ) and give them another slice of cache. What we really want is:
> > 
> >  1 1 1 1 0 0 0 0<- Default cache
> >  0 0 0 0 1 1 1 0<- Cache for important task
> >  0 0 0 0 0 0 0 1<- Cache for CPU of important task
> > 
> > It would even be sufficient for particular use cases to just associate a 
> > piece of
> > cache to a given CPU and do not bother with tasks at all.
> > 
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on my 
> > new
> > intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same 
> > meaning
> > on all sockets and restrict it to per task partitioning.
> > 
> > Even if next generation systems will have more COS ids available, there are
> > not going to be enough to have a system wide consistent view unless we
> > have COS ids > nr_cpus.
> > 
> > Aside of that I don't think that a system wide consistent view is useful at 
> > all.
> > 
> >  - If a task migrates between sockets, it's going to suffer anyway.
> >Real sensitive applications will simply pin tasks on a socket to
> >avoid that in the first place. If we make the whole thing
> >configurable enough then the sysadmin can set it up to support
> >even the nonsensical case of identical cache partitions on all
> >sockets and let tasks use the corresponding partitions when
> >migrating.
> > 
> >  - The number of cache slices is going to be limited no matter what,
> >so one still has to come up with a sensible partitioning scheme.
> > 
> >  - Even if we have enough cos ids the system wide view will not make
> >the configuration problem any simpler as it remains per socket.
> > 
> > It's hard. Policies are hard by definition, but this one is harder than most
> > other policies due to the inherent limitations.
> > 
> > So now to the interface part. Unfortunately we need to expose 

RE: [RFD] CAT user space interface revisited

2015-12-22 Thread Yu, Fenghua
> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, November 18, 2015 10:25 AM
> Folks!
> 
> After rereading the mail flood on CAT and staring into the SDM for a while, I
> think we all should sit back and look at it from scratch again w/o our
> preconceptions - I certainly had to put my own away.
> 
> Let's look at the properties of CAT again:
> 
>- It's a per socket facility
> 
>- CAT slots can be associated to external hardware. This
>  association is per socket as well, so different sockets can have
>  different behaviour. I missed that detail when staring the first
>  time, thanks for the pointer!
> 
>- The association ifself is per cpu. The COS selection happens on a
>  CPU while the set of masks which are selected via COS are shared
>  by all CPUs on a socket.
> 
> There are restrictions which CAT imposes in terms of configurability:
> 
>- The bits which select a cache partition need to be consecutive
> 
>- The number of possible cache association masks is limited
> 
> Let's look at the configurations (CDP omitted and size restricted)
> 
> Default:   1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
> 
> Shared:  1 1 1 1 1 1 1 1
>  0 0 1 1 1 1 1 1
>  0 0 0 0 1 1 1 1
>  0 0 0 0 0 0 1 1
> 
> Isolated:  1 1 1 1 0 0 0 0
>  0 0 0 0 1 1 0 0
>  0 0 0 0 0 0 1 0
>  0 0 0 0 0 0 0 1
> 
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity of a
> sysadmin. The worst outcome might be L3 disabled for everything, so what?
> 
> Now that gets even more convoluted if CDP comes into play and we really
> need to look at CDP right now. We might end up with something which looks
> like this:
> 
>  1 1 1 1 0 0 0 0  Code
>  1 1 1 1 0 0 0 0  Data
>  0 0 0 0 0 0 1 0  Code
>  0 0 0 0 1 1 0 0  Data
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 1 1 0 0  Data
> or
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 1 1 0 0  Data
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 0 1 1 0  Data
> 
> Let's look at partitioning itself. We have two options:
> 
>1) Per task partitioning
> 
>2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as well. Let 
> me
> give you a simple example.
> 
> Assume that you have isolated a CPU and run your important task on it. You
> give that task a slice of cache. Now that task needs kernel services which run
> in kernel threads on that CPU. We really don't want to (and cannot) hunt
> down random kernel threads (think cpu bound worker threads, softirq
> threads ) and give them another slice of cache. What we really want is:
> 
>1 1 1 1 0 0 0 0<- Default cache
>0 0 0 0 1 1 1 0<- Cache for important task
>0 0 0 0 0 0 0 1<- Cache for CPU of important task
> 
> It would even be sufficient for particular use cases to just associate a 
> piece of
> cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on my new
> intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same meaning
> on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available, there are
> not going to be enough to have a system wide consistent view unless we
> have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is useful at 
> all.
> 
>  - If a task migrates between sockets, it's going to suffer anyway.
>Real sensitive applications will simply pin tasks on a socket to
>avoid that in the first place. If we make the whole thing
>configurable enough then the sysadmin can set it up to support
>even the nonsensical case of identical cache partitions on all
>sockets and let tasks use the corresponding partitions when
>migrating.
> 
>  - The number of cache slices is going to be limited no matter what,
>so one still has to come up with a sensible partitioning scheme.
> 
>  - Even if we have enough cos ids the system wide view will not make
>the configuration problem any simpler as it remains per socket.
> 
> It's hard. Policies are hard by definition, but this one is harder than most
> other policies due to the inherent limitations.
> 
> So now to the interface part. Unfortunately we need to expose this very
> close to the hardware implementation as there are really no abstractions
> which allow us to express the various bitmap combinations. Any abstraction I
> tried to come up with renders that thing completely useless.
> 
> I was not able to identify any existing infrastru

Re: [RFD] CAT user space interface revisited

2015-11-25 Thread Marcelo Tosatti
On Tue, Nov 24, 2015 at 03:31:24PM +0800, Chao Peng wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > 
> > Let's look at partitioning itself. We have two options:
> > 
> >1) Per task partitioning
> > 
> >2) Per CPU partitioning
> > 
> > So far we only talked about #1, but I think that #2 has a value as
> > well. Let me give you a simple example.
> 
> I would second this. In practice per CPU partitioning is useful for
> realtime as well. And I can see three possible solutions:
> 
>  1) What you suggested below, to address both problems in one
> framework. But I wonder if it would end with too complex.
> 
>  2) Achieve per CPU partitioning with per task partitioning. For
> example, if current CAT patch can solve the kernel threads
>   problem, together with CPU pinning, we then can set a same CBM
>   for all the tasks/kernel threads run on an isolated CPU. 

As for the kernel threads problem, it seems its a silly limitation of
the code which handles writes to cgroups:

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index f89d929..0603652 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2466,16 +2466,6 @@ static ssize_t __cgroup_procs_write(struct
kernfs_open_file *of, char *buf,
if (threadgroup)
tsk = tsk->group_leader;

-   /*
-* Workqueue threads may acquire PF_NO_SETAFFINITY and become
-* trapped in a cpuset, or RT worker may be born in a cgroup
-* with no rt_runtime allocated.  Just say no.
-*/
-   if (tsk == kthreadd_task || (tsk->flags & PF_NO_SETAFFINITY)) {
-   ret = -EINVAL;
-   goto out_unlock_rcu;
-   }
-
get_task_struct(tsk);
rcu_read_unlock();

For a cgroup hierarchy with no cpusets (such as CAT only) this
limitation makes no sense (looking for a place where to move this to).

Any ETA on per-socket bitmasks? 

> 
>  3) I wonder if it feasible to separate the two requirements? For
> example, divides the work into three components: rdt-base,
>   per task interface (current cgroup interface/IOCTL or something)
>   and per CPU interface. The two interfaces are exclusive and
>   selected at build time. One thing to reject this option would be
>   even with per CPU partitioning, we still need per task partitioning,
>   in that case we will go to option 1) again.
> 
> Thanks,
> Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-25 Thread Marcelo Tosatti
On Tue, Nov 24, 2015 at 07:25:43PM -0200, Marcelo Tosatti wrote:
> On Tue, Nov 24, 2015 at 04:27:54PM +0800, Chao Peng wrote:
> > On Wed, Nov 18, 2015 at 10:01:54PM -0200, Marcelo Tosatti wrote:
> > > > tglx
> > > 
> > > Again: you don't need to look into the MSR table and relate it 
> > > to tasks if you store the data as:
> > > 
> > >   task group 1 = {
> > >   reservation-1 = {size = 80Kb, type = data, socketmask = 
> > > 0x},
> > >   reservation-2 = {size = 100Kb, type = code, socketmask 
> > > = 0x}
> > >   }
> > >   
> > >   task group 2 = {
> > >   reservation-1 = {size = 80Kb, type = data, socketmask = 
> > > 0x},
> > >   reservation-3 = {size = 200Kb, type = code, socketmask 
> > > = 0x}
> > >   }
> > > 
> > > Task group 1 and task group 2 share reservation-1.
> > 
> > Because there is only size but not CBM position info, I guess for
> > different reservations they will not overlap each other, right?
> 
> Reservation 1 is shared between task group 1 and task group 2 
> so the CBMs overlap (by 80Kb, rounded).
> 
> > Personally I like this way of exposing minimal information to userspace.
> > I can think it working well except for one concern of losing flexibility:
> > 
> > For instance, there is a box for which the full CBM is 0xf. After
> > cache reservation creating/freeing for a while we then have reservations:
> > 
> > reservation1: 0xf
> > reservation2: 0x00ff0
> > 
> > Now people want to request a reservation which size is 0xff, so how
> > will kernel do at this time? It could return just error or do some
> > moving/merging (e.g. for reservation2: 0x00ff0 => 0x0ff00) and then
> > satisfy the request. But I don't know if the moving/merging will cause
> > delay for tasks that is using it.
> 
> Right, i was thinking of adding a "force" parameter. 
> 
> So, default behaviour of attach: do not merge.
> "force" behaviour of attach: move reservations around and merge if
> necessary.

To make the decision userspace would need the know that a merge can
be performed if particular reservations can be moved (that is, the
moveable property is per-reservation, depending on whether its ok 
for the given app to cacheline fault or not).
Anyway, thats for later.






--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-24 Thread Chao Peng
On Wed, Nov 18, 2015 at 10:01:54PM -0200, Marcelo Tosatti wrote:
> > tglx
> 
> Again: you don't need to look into the MSR table and relate it 
> to tasks if you store the data as:
> 
>   task group 1 = {
>   reservation-1 = {size = 80Kb, type = data, socketmask = 
> 0x},
>   reservation-2 = {size = 100Kb, type = code, socketmask 
> = 0x}
>   }
>   
>   task group 2 = {
>   reservation-1 = {size = 80Kb, type = data, socketmask = 
> 0x},
>   reservation-3 = {size = 200Kb, type = code, socketmask 
> = 0x}
>   }
> 
> Task group 1 and task group 2 share reservation-1.

Because there is only size but not CBM position info, I guess for
different reservations they will not overlap each other, right?

Personally I like this way of exposing minimal information to userspace.
I can think it working well except for one concern of losing flexibility:

For instance, there is a box for which the full CBM is 0xf. After
cache reservation creating/freeing for a while we then have reservations:

reservation1: 0xf
reservation2: 0x00ff0

Now people want to request a reservation which size is 0xff, so how
will kernel do at this time? It could return just error or do some
moving/merging (e.g. for reservation2: 0x00ff0 => 0x0ff00) and then
satisfy the request. But I don't know if the moving/merging will cause
delay for tasks that is using it.

Thanks,
Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-23 Thread Chao Peng
On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> 
> Let's look at partitioning itself. We have two options:
> 
>1) Per task partitioning
> 
>2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as
> well. Let me give you a simple example.

I would second this. In practice per CPU partitioning is useful for
realtime as well. And I can see three possible solutions:

 1) What you suggested below, to address both problems in one
framework. But I wonder if it would end with too complex.

 2) Achieve per CPU partitioning with per task partitioning. For
example, if current CAT patch can solve the kernel threads
problem, together with CPU pinning, we then can set a same CBM
for all the tasks/kernel threads run on an isolated CPU. 

 3) I wonder if it feasible to separate the two requirements? For
example, divides the work into three components: rdt-base,
per task interface (current cgroup interface/IOCTL or something)
and per CPU interface. The two interfaces are exclusive and
selected at build time. One thing to reject this option would be
even with per CPU partitioning, we still need per task partitioning,
in that case we will go to option 1) again.

Thanks,
Chao
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-20 Thread Marcelo Tosatti
On Fri, Nov 20, 2015 at 08:53:34AM +0100, Thomas Gleixner wrote:
> On Thu, 19 Nov 2015, Marcelo Tosatti wrote:
> > On Thu, Nov 19, 2015 at 10:09:03AM +0100, Thomas Gleixner wrote:
> > > On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> > > > Actually, there is a point that is useful: you might want the important
> > > > application to share the L3 portion with HW (that HW DMAs into), and
> > > > have only the application and the HW use that region.
> > > > 
> > > > So its a good point that controlling the exact position of the 
> > > > reservation 
> > > > is important.
> > > 
> > > I'm glad you figured that out yourself. :)
> > > 
> > > Thanks,
> > > 
> > >   tglx
> > 
> > The HW is a reclaimer of the L3 region shared with HW.
> > 
> > You might want to remove any threads from reclaiming from 
> > that region.
> 
> I might for some threads, but certainly not for those which need to
> access DMA buffers.

Yes, when i wrote "its a good point that controlling the exact position
of the reservation is important" i had that in mind as well.

But its wrong: not having a bit set in the CBM for the portion of L3
cache which is shared with HW only means "for cacheline misses of the
application, evict cachelines from this portion".

So yes, you might want to exclude the application which accesses DMA
buffers from reclaiming cachelines in the portion shared with HW,
to keep those cachelines longer in L3.

> Throwing away 10% of L3 just because you don't
> want to deal with it at the interface level is hillarious.

If there is interest on per-application configuration then it can 
be integrated as well.

Thanks for your time.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-20 Thread Marcelo Tosatti
On Thu, Nov 19, 2015 at 09:35:34AM +0100, Thomas Gleixner wrote:
> On Wed, 18 Nov 2015, Marcelo Tosatti wrote:
> > On Wed, Nov 18, 2015 at 08:34:07PM -0200, Marcelo Tosatti wrote:
> > > On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > > > Assume that you have isolated a CPU and run your important task on
> > > > it. You give that task a slice of cache. Now that task needs kernel
> > > > services which run in kernel threads on that CPU. We really don't want
> > > > to (and cannot) hunt down random kernel threads (think cpu bound
> > > > worker threads, softirq threads ) and give them another slice of
> > > > cache. What we really want is:
> > > > 
> > > >  1 1 1 1 0 0 0 0<- Default cache
> > > >  0 0 0 0 1 1 1 0<- Cache for important task
> > > >  0 0 0 0 0 0 0 1<- Cache for CPU of important task
> > > > 
> > > > It would even be sufficient for particular use cases to just associate
> > > > a piece of cache to a given CPU and do not bother with tasks at all.
> > 
> > Well any work on behalf of the important task, should have its cache
> > protected as well (example irq handling threads). 
> 
> Right, but that's nothing you can do automatically and certainly not
> from a random application.
> 
> > But for certain kernel tasks for which L3 cache is not beneficial
> > (eg: kernel samepage merging), it might useful to exclude such tasks
> > from the "important, do not flush" L3 cache portion.
> 
> Sure it might be useful, but this needs to be done on a case by case
> basis and there is no way to do this in any automated way.
>  
> > > > It's hard. Policies are hard by definition, but this one is harder
> > > > than most other policies due to the inherent limitations.
> > 
> > That is exactly why it should be allowed for software to automatically 
> > configure the policies.
> 
> There is nothing you can do automatically. 

Every cacheline brought in the L3 has a reaccess time (the time when it
was first brought in to the time it was reaccessed).

Assume you have a single threaded app, a sequence of cacheline
accesses.

Now if there are groups of accesses which have long reaccess times
(meaning that keeping them in L3 is not beneficial), that are large
enough to justify the OS notification, the application can notify the OS
to switch to a constrained COSid (so that L3 misses reclaim from that
small portion of the L3 cache).

> If you want to allow
> applications to set the policies themself, then you need to assign a
> portion of the bitmask space and a portion of the cos id space to that
> application and then let it do with that space what it wants.

Thats why you should specify the requirements independently of each
other (the requirement in this case the size of the reservation and
type, which is tied to the application), and let something else figure
out how they all fit together.

> That's where cgroups come into play. But that does not solve the other
> issues of "global" configuration, i.e. CPU defaults etc.

I don't understand what you mean issues of global configuration.

CPU defaults: A task is associated with a COSid. A COSid points to 
a set of CBMs (one CBM per socket). What defaults are you talking about?

But the interfaces do not exclude each other (the ioctl or syscall
interfaces and the manual direct MSR interface can coexist). There is
time pressure to integrate something workable for the present use cases
(none are in the class "applications set reservation themselves").

Peter has some objection against ioctls. So for something workable,
well have to handle the numbered issues pointed in the other e-mail
(2,3,4), in userspace.


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-19 Thread Thomas Gleixner
On Thu, 19 Nov 2015, Marcelo Tosatti wrote:
> On Thu, Nov 19, 2015 at 10:09:03AM +0100, Thomas Gleixner wrote:
> > On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> > > Actually, there is a point that is useful: you might want the important
> > > application to share the L3 portion with HW (that HW DMAs into), and
> > > have only the application and the HW use that region.
> > > 
> > > So its a good point that controlling the exact position of the 
> > > reservation 
> > > is important.
> > 
> > I'm glad you figured that out yourself. :)
> > 
> > Thanks,
> > 
> > tglx
> 
> The HW is a reclaimer of the L3 region shared with HW.
> 
> You might want to remove any threads from reclaiming from 
> that region.

I might for some threads, but certainly not for those which need to
access DMA buffers. Throwing away 10% of L3 just because you don't
want to deal with it at the interface level is hillarious.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-19 Thread Marcelo Tosatti
On Thu, Nov 19, 2015 at 10:09:03AM +0100, Thomas Gleixner wrote:
> On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> > Actually, there is a point that is useful: you might want the important
> > application to share the L3 portion with HW (that HW DMAs into), and
> > have only the application and the HW use that region.
> > 
> > So its a good point that controlling the exact position of the reservation 
> > is important.
> 
> I'm glad you figured that out yourself. :)
> 
> Thanks,
> 
>   tglx

The HW is a reclaimer of the L3 region shared with HW.

You might want to remove any threads from reclaiming from 
that region.

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-19 Thread Marcelo Tosatti
On Wed, Nov 18, 2015 at 11:05:35PM -0200, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 10:01:53PM -0200, Marcelo Tosatti wrote:
> > On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > > Folks!
> > > 
> > > After rereading the mail flood on CAT and staring into the SDM for a
> > > while, I think we all should sit back and look at it from scratch
> > > again w/o our preconceptions - I certainly had to put my own away.
> > > 
> > > Let's look at the properties of CAT again:
> > > 
> > >- It's a per socket facility
> > > 
> > >- CAT slots can be associated to external hardware. This
> > >  association is per socket as well, so different sockets can have
> > >  different behaviour. I missed that detail when staring the first
> > >  time, thanks for the pointer!
> > > 
> > >- The association ifself is per cpu. The COS selection happens on a
> > >  CPU while the set of masks which are selected via COS are shared
> > >  by all CPUs on a socket.
> > > 
> > > There are restrictions which CAT imposes in terms of configurability:
> > > 
> > >- The bits which select a cache partition need to be consecutive
> > > 
> > >- The number of possible cache association masks is limited
> > > 
> > > Let's look at the configurations (CDP omitted and size restricted)
> > > 
> > > Default:   1 1 1 1 1 1 1 1
> > >  1 1 1 1 1 1 1 1
> > >  1 1 1 1 1 1 1 1
> > >  1 1 1 1 1 1 1 1
> > > 
> > > Shared:  1 1 1 1 1 1 1 1
> > >  0 0 1 1 1 1 1 1
> > >  0 0 0 0 1 1 1 1
> > >  0 0 0 0 0 0 1 1
> > > 
> > > Isolated:  1 1 1 1 0 0 0 0
> > >  0 0 0 0 1 1 0 0
> > >  0 0 0 0 0 0 1 0
> > >  0 0 0 0 0 0 0 1
> > > 
> > > Or any combination thereof. Surely some combinations will not make any
> > > sense, but we really should not make any restrictions on the stupidity
> > > of a sysadmin. The worst outcome might be L3 disabled for everything,
> > > so what?
> > > 
> > > Now that gets even more convoluted if CDP comes into play and we
> > > really need to look at CDP right now. We might end up with something
> > > which looks like this:
> > > 
> > >  1 1 1 1 0 0 0 0  Code
> > >  1 1 1 1 0 0 0 0  Data
> > >  0 0 0 0 0 0 1 0  Code
> > >  0 0 0 0 1 1 0 0  Data
> > >  0 0 0 0 0 0 0 1  Code
> > >  0 0 0 0 1 1 0 0  Data
> > > or 
> > >  0 0 0 0 0 0 0 1  Code
> > >  0 0 0 0 1 1 0 0  Data
> > >  0 0 0 0 0 0 0 1  Code
> > >  0 0 0 0 0 1 1 0  Data
> > > 
> > > Let's look at partitioning itself. We have two options:
> > > 
> > >1) Per task partitioning
> > > 
> > >2) Per CPU partitioning
> > > 
> > > So far we only talked about #1, but I think that #2 has a value as
> > > well. Let me give you a simple example.
> > > 
> > > Assume that you have isolated a CPU and run your important task on
> > > it. You give that task a slice of cache. Now that task needs kernel
> > > services which run in kernel threads on that CPU. We really don't want
> > > to (and cannot) hunt down random kernel threads (think cpu bound
> > > worker threads, softirq threads ) and give them another slice of
> > > cache. What we really want is:
> > > 
> > >1 1 1 1 0 0 0 0<- Default cache
> > >0 0 0 0 1 1 1 0<- Cache for important task
> > >0 0 0 0 0 0 0 1<- Cache for CPU of important task
> > > 
> > > It would even be sufficient for particular use cases to just associate
> > > a piece of cache to a given CPU and do not bother with tasks at all.
> > > 
> > > We really need to make this as configurable as possible from userspace
> > > without imposing random restrictions to it. I played around with it on
> > > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > > enabled) makes it really useless if we force the ids to have the same
> > > meaning on all sockets and restrict it to per task partitioning.
> > > 
> > > Even if next generation systems will have more COS ids available,
> > > there are not going to be enough to have a system wide consistent
> > > view unless we have COS ids > nr_cpus.
> > > 
> > > Aside of that I don't think that a system wide consistent view is
> > > useful at all.
> > > 
> > >  - If a task migrates between sockets, it's going to suffer anyway.
> > >Real sensitive applications will simply pin tasks on a socket to
> > >avoid that in the first place. If we make the whole thing
> > >configurable enough then the sysadmin can set it up to support
> > >even the nonsensical case of identical cache partitions on all
> > >sockets and let tasks use the corresponding partitions when
> > >migrating.
> > > 
> > >  - The number of cache slices is going to be limited no matter what,
> > >so one still has to come up with a sensible partitioning scheme.
> > > 
> > >  - Even if we have enough cos ids the system wide view will not make
> > >the configuration problem any simpler as it remains per socket.
> > > 

Re: [RFD] CAT user space interface revisited

2015-11-19 Thread Luiz Capitulino
On Thu, 19 Nov 2015 09:35:34 +0100 (CET)
Thomas Gleixner  wrote:

> > Well any work on behalf of the important task, should have its cache
> > protected as well (example irq handling threads). 
> 
> Right, but that's nothing you can do automatically and certainly not
> from a random application.

Right and that's not a problem. For the use-cases CAT is intended to,
manual and per-workload system setup is very common. Things like
thread pinning, hugepages reservation, CPU isolation, nohz_full, etc
require manual setup too.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-19 Thread Thomas Gleixner
On Wed, 18 Nov 2015, Marcelo Tosatti wrote
> Actually, there is a point that is useful: you might want the important
> application to share the L3 portion with HW (that HW DMAs into), and
> have only the application and the HW use that region.
> 
> So its a good point that controlling the exact position of the reservation 
> is important.

I'm glad you figured that out yourself. :)

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-19 Thread Thomas Gleixner
On Wed, 18 Nov 2015, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as there are really no
> > abstractions which allow us to express the various bitmap
> > combinations. Any abstraction I tried to come up with renders that
> > thing completely useless.
> 
> No you don't.

Because you have a use case which allows you to write some policy
translator? I seriously doubt that it is general enough.
 
> Again: you don't need to look into the MSR table and relate it 
> to tasks if you store the data as:
> 
>   task group 1 = {
>   reservation-1 = {size = 80Kb, type = data, socketmask = 
> 0x},
>   reservation-2 = {size = 100Kb, type = code, socketmask 
> = 0x}
>   }
>   
>   task group 2 = {
>   reservation-1 = {size = 80Kb, type = data, socketmask = 
> 0x},
>   reservation-3 = {size = 200Kb, type = code, socketmask 
> = 0x}
>   }
> 
> Task group 1 and task group 2 share reservation-1.
> 
> This is what userspace is going to expose to users, of course.


 
> If you expose the MSRs to userspace, you force userspace to convert
> from this format to the MSRs (minding whether there
> are contiguous regions available, and the region shared with HW).

Fair enough. I'm not too fond about the exposure of the MSRs, but I
chose this just to explain the full problem space and the various
requirements we might have accross the full application space.

If we can come up with an abstract way which does not impose
restrictions on the overall configuration abilities, I'm all for it.

> - The bits which select a cache partition need to be consecutive
> 
> BUT, for our usecase the cgroups interface works as well, so lets
> go with that (Tejun apparently had a usecase where tasks were allowed to 
> set reservations themselves, on response to external events).

Can you please set aside your narrow use case view for a moment and
just think about the full application space? We are not designing such
an interface for a single use case.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-19 Thread Thomas Gleixner
On Wed, 18 Nov 2015, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 08:34:07PM -0200, Marcelo Tosatti wrote:
> > On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > > Assume that you have isolated a CPU and run your important task on
> > > it. You give that task a slice of cache. Now that task needs kernel
> > > services which run in kernel threads on that CPU. We really don't want
> > > to (and cannot) hunt down random kernel threads (think cpu bound
> > > worker threads, softirq threads ) and give them another slice of
> > > cache. What we really want is:
> > > 
> > >1 1 1 1 0 0 0 0<- Default cache
> > >0 0 0 0 1 1 1 0<- Cache for important task
> > >0 0 0 0 0 0 0 1<- Cache for CPU of important task
> > > 
> > > It would even be sufficient for particular use cases to just associate
> > > a piece of cache to a given CPU and do not bother with tasks at all.
> 
> Well any work on behalf of the important task, should have its cache
> protected as well (example irq handling threads). 

Right, but that's nothing you can do automatically and certainly not
from a random application.

> But for certain kernel tasks for which L3 cache is not beneficial
> (eg: kernel samepage merging), it might useful to exclude such tasks
> from the "important, do not flush" L3 cache portion.

Sure it might be useful, but this needs to be done on a case by case
basis and there is no way to do this in any automated way.
 
> > > It's hard. Policies are hard by definition, but this one is harder
> > > than most other policies due to the inherent limitations.
> 
> That is exactly why it should be allowed for software to automatically 
> configure the policies.

There is nothing you can do automatically. If you want to allow
applications to set the policies themself, then you need to assign a
portion of the bitmask space and a portion of the cos id space to that
application and then let it do with that space what it wants.

That's where cgroups come into play. But that does not solve the other
issues of "global" configuration, i.e. CPU defaults etc.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-19 Thread Thomas Gleixner
Marcelo,

On Wed, 18 Nov 2015, Marcelo Tosatti wrote:

Can you please trim your replies? It's really annoying having to
search for a single line of reply.

> The cgroups interface works, but moves the problem of contiguous
> allocation to userspace, and is incompatible with cache allocations
> on demand.
>
> Have to solve the kernel threads VS cgroups issue...

Sorry, I have no idea what you want to tell me.

Thanks,

tglx
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-18 Thread Marcelo Tosatti
On Wed, Nov 18, 2015 at 10:01:53PM -0200, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > Folks!
> > 
> > After rereading the mail flood on CAT and staring into the SDM for a
> > while, I think we all should sit back and look at it from scratch
> > again w/o our preconceptions - I certainly had to put my own away.
> > 
> > Let's look at the properties of CAT again:
> > 
> >- It's a per socket facility
> > 
> >- CAT slots can be associated to external hardware. This
> >  association is per socket as well, so different sockets can have
> >  different behaviour. I missed that detail when staring the first
> >  time, thanks for the pointer!
> > 
> >- The association ifself is per cpu. The COS selection happens on a
> >  CPU while the set of masks which are selected via COS are shared
> >  by all CPUs on a socket.
> > 
> > There are restrictions which CAT imposes in terms of configurability:
> > 
> >- The bits which select a cache partition need to be consecutive
> > 
> >- The number of possible cache association masks is limited
> > 
> > Let's look at the configurations (CDP omitted and size restricted)
> > 
> > Default:   1 1 1 1 1 1 1 1
> >1 1 1 1 1 1 1 1
> >1 1 1 1 1 1 1 1
> >1 1 1 1 1 1 1 1
> > 
> > Shared:1 1 1 1 1 1 1 1
> >0 0 1 1 1 1 1 1
> >0 0 0 0 1 1 1 1
> >0 0 0 0 0 0 1 1
> > 
> > Isolated:  1 1 1 1 0 0 0 0
> >0 0 0 0 1 1 0 0
> >0 0 0 0 0 0 1 0
> >0 0 0 0 0 0 0 1
> > 
> > Or any combination thereof. Surely some combinations will not make any
> > sense, but we really should not make any restrictions on the stupidity
> > of a sysadmin. The worst outcome might be L3 disabled for everything,
> > so what?
> > 
> > Now that gets even more convoluted if CDP comes into play and we
> > really need to look at CDP right now. We might end up with something
> > which looks like this:
> > 
> >1 1 1 1 0 0 0 0  Code
> >1 1 1 1 0 0 0 0  Data
> >0 0 0 0 0 0 1 0  Code
> >0 0 0 0 1 1 0 0  Data
> >0 0 0 0 0 0 0 1  Code
> >0 0 0 0 1 1 0 0  Data
> > or 
> >0 0 0 0 0 0 0 1  Code
> >0 0 0 0 1 1 0 0  Data
> >0 0 0 0 0 0 0 1  Code
> >0 0 0 0 0 1 1 0  Data
> > 
> > Let's look at partitioning itself. We have two options:
> > 
> >1) Per task partitioning
> > 
> >2) Per CPU partitioning
> > 
> > So far we only talked about #1, but I think that #2 has a value as
> > well. Let me give you a simple example.
> > 
> > Assume that you have isolated a CPU and run your important task on
> > it. You give that task a slice of cache. Now that task needs kernel
> > services which run in kernel threads on that CPU. We really don't want
> > to (and cannot) hunt down random kernel threads (think cpu bound
> > worker threads, softirq threads ) and give them another slice of
> > cache. What we really want is:
> > 
> >  1 1 1 1 0 0 0 0<- Default cache
> >  0 0 0 0 1 1 1 0<- Cache for important task
> >  0 0 0 0 0 0 0 1<- Cache for CPU of important task
> > 
> > It would even be sufficient for particular use cases to just associate
> > a piece of cache to a given CPU and do not bother with tasks at all.
> > 
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on
> > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same
> > meaning on all sockets and restrict it to per task partitioning.
> > 
> > Even if next generation systems will have more COS ids available,
> > there are not going to be enough to have a system wide consistent
> > view unless we have COS ids > nr_cpus.
> > 
> > Aside of that I don't think that a system wide consistent view is
> > useful at all.
> > 
> >  - If a task migrates between sockets, it's going to suffer anyway.
> >Real sensitive applications will simply pin tasks on a socket to
> >avoid that in the first place. If we make the whole thing
> >configurable enough then the sysadmin can set it up to support
> >even the nonsensical case of identical cache partitions on all
> >sockets and let tasks use the corresponding partitions when
> >migrating.
> > 
> >  - The number of cache slices is going to be limited no matter what,
> >so one still has to come up with a sensible partitioning scheme.
> > 
> >  - Even if we have enough cos ids the system wide view will not make
> >the configuration problem any simpler as it remains per socket.
> > 
> > It's hard. Policies are hard by definition, but this one is harder
> > than most other policies due to the inherent limitations.
> > 
> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as

Re: [RFD] CAT user space interface revisited

2015-11-18 Thread Marcelo Tosatti
On Wed, Nov 18, 2015 at 08:34:07PM -0200, Marcelo Tosatti wrote:
> On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> > Folks!
> > 
> > After rereading the mail flood on CAT and staring into the SDM for a
> > while, I think we all should sit back and look at it from scratch
> > again w/o our preconceptions - I certainly had to put my own away.
> > 
> > Let's look at the properties of CAT again:
> > 
> >- It's a per socket facility
> > 
> >- CAT slots can be associated to external hardware. This
> >  association is per socket as well, so different sockets can have
> >  different behaviour. I missed that detail when staring the first
> >  time, thanks for the pointer!
> > 
> >- The association ifself is per cpu. The COS selection happens on a
> >  CPU while the set of masks which are selected via COS are shared
> >  by all CPUs on a socket.
> > 
> > There are restrictions which CAT imposes in terms of configurability:
> > 
> >- The bits which select a cache partition need to be consecutive
> > 
> >- The number of possible cache association masks is limited
> > 
> > Let's look at the configurations (CDP omitted and size restricted)
> > 
> > Default:   1 1 1 1 1 1 1 1
> >1 1 1 1 1 1 1 1
> >1 1 1 1 1 1 1 1
> >1 1 1 1 1 1 1 1
> > 
> > Shared:1 1 1 1 1 1 1 1
> >0 0 1 1 1 1 1 1
> >0 0 0 0 1 1 1 1
> >0 0 0 0 0 0 1 1
> > 
> > Isolated:  1 1 1 1 0 0 0 0
> >0 0 0 0 1 1 0 0
> >0 0 0 0 0 0 1 0
> >0 0 0 0 0 0 0 1
> > 
> > Or any combination thereof. Surely some combinations will not make any
> > sense, but we really should not make any restrictions on the stupidity
> > of a sysadmin. The worst outcome might be L3 disabled for everything,
> > so what?
> > 
> > Now that gets even more convoluted if CDP comes into play and we
> > really need to look at CDP right now. We might end up with something
> > which looks like this:
> > 
> >1 1 1 1 0 0 0 0  Code
> >1 1 1 1 0 0 0 0  Data
> >0 0 0 0 0 0 1 0  Code
> >0 0 0 0 1 1 0 0  Data
> >0 0 0 0 0 0 0 1  Code
> >0 0 0 0 1 1 0 0  Data
> > or 
> >0 0 0 0 0 0 0 1  Code
> >0 0 0 0 1 1 0 0  Data
> >0 0 0 0 0 0 0 1  Code
> >0 0 0 0 0 1 1 0  Data
> > 
> > Let's look at partitioning itself. We have two options:
> > 
> >1) Per task partitioning
> > 
> >2) Per CPU partitioning
> > 
> > So far we only talked about #1, but I think that #2 has a value as
> > well. Let me give you a simple example.
> > 
> > Assume that you have isolated a CPU and run your important task on
> > it. You give that task a slice of cache. Now that task needs kernel
> > services which run in kernel threads on that CPU. We really don't want
> > to (and cannot) hunt down random kernel threads (think cpu bound
> > worker threads, softirq threads ) and give them another slice of
> > cache. What we really want is:
> > 
> >  1 1 1 1 0 0 0 0<- Default cache
> >  0 0 0 0 1 1 1 0<- Cache for important task
> >  0 0 0 0 0 0 0 1<- Cache for CPU of important task
> > 
> > It would even be sufficient for particular use cases to just associate
> > a piece of cache to a given CPU and do not bother with tasks at all.

Well any work on behalf of the important task, should have its cache
protected as well (example irq handling threads). 

But for certain kernel tasks for which L3 cache is not beneficial
(eg: kernel samepage merging), it might useful to exclude such tasks
from the "important, do not flush" L3 cache portion.

> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on
> > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same
> > meaning on all sockets and restrict it to per task partitioning.
> > 
> > Even if next generation systems will have more COS ids available,
> > there are not going to be enough to have a system wide consistent
> > view unless we have COS ids > nr_cpus.
> > 
> > Aside of that I don't think that a system wide consistent view is
> > useful at all.
> > 
> >  - If a task migrates between sockets, it's going to suffer anyway.
> >Real sensitive applications will simply pin tasks on a socket to
> >avoid that in the first place. If we make the whole thing
> >configurable enough then the sysadmin can set it up to support
> >even the nonsensical case of identical cache partitions on all
> >sockets and let tasks use the corresponding partitions when
> >migrating.
> > 
> >  - The number of cache slices is going to be limited no matter what,
> >so one still has to come up with a sensible partitioning scheme.
> > 
> >  - Even if we have enough cos ids the system wide view will not make
> >the configuration pr

Re: [RFD] CAT user space interface revisited

2015-11-18 Thread Marcelo Tosatti
On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> Folks!
> 
> After rereading the mail flood on CAT and staring into the SDM for a
> while, I think we all should sit back and look at it from scratch
> again w/o our preconceptions - I certainly had to put my own away.
> 
> Let's look at the properties of CAT again:
> 
>- It's a per socket facility
> 
>- CAT slots can be associated to external hardware. This
>  association is per socket as well, so different sockets can have
>  different behaviour. I missed that detail when staring the first
>  time, thanks for the pointer!
> 
>- The association ifself is per cpu. The COS selection happens on a
>  CPU while the set of masks which are selected via COS are shared
>  by all CPUs on a socket.
> 
> There are restrictions which CAT imposes in terms of configurability:
> 
>- The bits which select a cache partition need to be consecutive
> 
>- The number of possible cache association masks is limited
> 
> Let's look at the configurations (CDP omitted and size restricted)
> 
> Default:   1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
> 
> Shared:  1 1 1 1 1 1 1 1
>  0 0 1 1 1 1 1 1
>  0 0 0 0 1 1 1 1
>  0 0 0 0 0 0 1 1
> 
> Isolated:  1 1 1 1 0 0 0 0
>  0 0 0 0 1 1 0 0
>  0 0 0 0 0 0 1 0
>  0 0 0 0 0 0 0 1
> 
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity
> of a sysadmin. The worst outcome might be L3 disabled for everything,
> so what?
> 
> Now that gets even more convoluted if CDP comes into play and we
> really need to look at CDP right now. We might end up with something
> which looks like this:
> 
>  1 1 1 1 0 0 0 0  Code
>  1 1 1 1 0 0 0 0  Data
>  0 0 0 0 0 0 1 0  Code
>  0 0 0 0 1 1 0 0  Data
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 1 1 0 0  Data
> or 
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 1 1 0 0  Data
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 0 1 1 0  Data
> 
> Let's look at partitioning itself. We have two options:
> 
>1) Per task partitioning
> 
>2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as
> well. Let me give you a simple example.
> 
> Assume that you have isolated a CPU and run your important task on
> it. You give that task a slice of cache. Now that task needs kernel
> services which run in kernel threads on that CPU. We really don't want
> to (and cannot) hunt down random kernel threads (think cpu bound
> worker threads, softirq threads ) and give them another slice of
> cache. What we really want is:
> 
>1 1 1 1 0 0 0 0<- Default cache
>0 0 0 0 1 1 1 0<- Cache for important task
>0 0 0 0 0 0 0 1<- Cache for CPU of important task
> 
> It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available,
> there are not going to be enough to have a system wide consistent
> view unless we have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is
> useful at all.
> 
>  - If a task migrates between sockets, it's going to suffer anyway.
>Real sensitive applications will simply pin tasks on a socket to
>avoid that in the first place. If we make the whole thing
>configurable enough then the sysadmin can set it up to support
>even the nonsensical case of identical cache partitions on all
>sockets and let tasks use the corresponding partitions when
>migrating.
> 
>  - The number of cache slices is going to be limited no matter what,
>so one still has to come up with a sensible partitioning scheme.
> 
>  - Even if we have enough cos ids the system wide view will not make
>the configuration problem any simpler as it remains per socket.
> 
> It's hard. Policies are hard by definition, but this one is harder
> than most other policies due to the inherent limitations.
> 
> So now to the interface part. Unfortunately we need to expose this
> very close to the hardware implementation as there are really no
> abstractions which allow us to express the various bitmap
> combinations. Any abstraction I tried to come up with renders that
> thing completely useless.

No you don't.

> I was not able to identify any existing infrastructure where this
> r

Re: [RFD] CAT user space interface revisited

2015-11-18 Thread Marcelo Tosatti
On Wed, Nov 18, 2015 at 07:25:03PM +0100, Thomas Gleixner wrote:
> Folks!
> 
> After rereading the mail flood on CAT and staring into the SDM for a
> while, I think we all should sit back and look at it from scratch
> again w/o our preconceptions - I certainly had to put my own away.
> 
> Let's look at the properties of CAT again:
> 
>- It's a per socket facility
> 
>- CAT slots can be associated to external hardware. This
>  association is per socket as well, so different sockets can have
>  different behaviour. I missed that detail when staring the first
>  time, thanks for the pointer!
> 
>- The association ifself is per cpu. The COS selection happens on a
>  CPU while the set of masks which are selected via COS are shared
>  by all CPUs on a socket.
> 
> There are restrictions which CAT imposes in terms of configurability:
> 
>- The bits which select a cache partition need to be consecutive
> 
>- The number of possible cache association masks is limited
> 
> Let's look at the configurations (CDP omitted and size restricted)
> 
> Default:   1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
> 
> Shared:  1 1 1 1 1 1 1 1
>  0 0 1 1 1 1 1 1
>  0 0 0 0 1 1 1 1
>  0 0 0 0 0 0 1 1
> 
> Isolated:  1 1 1 1 0 0 0 0
>  0 0 0 0 1 1 0 0
>  0 0 0 0 0 0 1 0
>  0 0 0 0 0 0 0 1
> 
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity
> of a sysadmin. The worst outcome might be L3 disabled for everything,
> so what?
> 
> Now that gets even more convoluted if CDP comes into play and we
> really need to look at CDP right now. We might end up with something
> which looks like this:
> 
>  1 1 1 1 0 0 0 0  Code
>  1 1 1 1 0 0 0 0  Data
>  0 0 0 0 0 0 1 0  Code
>  0 0 0 0 1 1 0 0  Data
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 1 1 0 0  Data
> or 
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 1 1 0 0  Data
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 0 1 1 0  Data
> 
> Let's look at partitioning itself. We have two options:
> 
>1) Per task partitioning
> 
>2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as
> well. Let me give you a simple example.
> 
> Assume that you have isolated a CPU and run your important task on
> it. You give that task a slice of cache. Now that task needs kernel
> services which run in kernel threads on that CPU. We really don't want
> to (and cannot) hunt down random kernel threads (think cpu bound
> worker threads, softirq threads ) and give them another slice of
> cache. What we really want is:
> 
>1 1 1 1 0 0 0 0<- Default cache
>0 0 0 0 1 1 1 0<- Cache for important task
>0 0 0 0 0 0 0 1<- Cache for CPU of important task
> 
> It would even be sufficient for particular use cases to just associate
> a piece of cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available,
> there are not going to be enough to have a system wide consistent
> view unless we have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is
> useful at all.
> 
>  - If a task migrates between sockets, it's going to suffer anyway.
>Real sensitive applications will simply pin tasks on a socket to
>avoid that in the first place. If we make the whole thing
>configurable enough then the sysadmin can set it up to support
>even the nonsensical case of identical cache partitions on all
>sockets and let tasks use the corresponding partitions when
>migrating.
> 
>  - The number of cache slices is going to be limited no matter what,
>so one still has to come up with a sensible partitioning scheme.
> 
>  - Even if we have enough cos ids the system wide view will not make
>the configuration problem any simpler as it remains per socket.
> 
> It's hard. Policies are hard by definition, but this one is harder
> than most other policies due to the inherent limitations.
> 
> So now to the interface part. Unfortunately we need to expose this
> very close to the hardware implementation as there are really no
> abstractions which allow us to express the various bitmap
> combinations. Any abstraction I tried to come up with renders that
> thing completely useless.
> 
> I was not able to identify any existing infrastructure where this
> really fits in

RE: [RFD] CAT user space interface revisited

2015-11-18 Thread Auld, Will
+Tony

> -Original Message-
> From: Luiz Capitulino [mailto:lcapitul...@redhat.com]
> Sent: Wednesday, November 18, 2015 11:38 AM
> To: Thomas Gleixner
> Cc: LKML; Peter Zijlstra; x...@kernel.org; Marcelo Tosatti; Shivappa, Vikas; 
> Tejun
> Heo; Yu, Fenghua; Auld, Will; Dugger, Donald D; r...@redhat.com
> Subject: Re: [RFD] CAT user space interface revisited
> 
> On Wed, 18 Nov 2015 19:25:03 +0100 (CET) Thomas Gleixner
>  wrote:
> 
> > We really need to make this as configurable as possible from userspace
> > without imposing random restrictions to it. I played around with it on
> > my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> > enabled) makes it really useless if we force the ids to have the same
> > meaning on all sockets and restrict it to per task partitioning.
> >
> > Even if next generation systems will have more COS ids available,
> > there are not going to be enough to have a system wide consistent view
> > unless we have COS ids > nr_cpus.
> >
> > Aside of that I don't think that a system wide consistent view is
> > useful at all.
> 
> This is a great writeup! I agree with everything you said.
> 
> > So now to the interface part. Unfortunately we need to expose this
> > very close to the hardware implementation as there are really no
> > abstractions which allow us to express the various bitmap
> > combinations. Any abstraction I tried to come up with renders that
> > thing completely useless.
> >
> > I was not able to identify any existing infrastructure where this
> > really fits in. I chose a directory/file based representation. We
> > certainly could do the same with a syscall, but that's just an
> > implementation detail.
> >
> > At top level:
> >
> >xxx/cat/max_cosids   <- Assume that all CPUs are the same
> >xxx/cat/max_maskbits <- Assume that all CPUs are the same
> >xxx/cat/cdp_enable   <- Depends on CDP availability
> >
> > Per socket data:
> >
> >xxx/cat/socket-0/
> >...
> >xxx/cat/socket-N/l3_size
> >xxx/cat/socket-N/hwsharedbits
> >
> > Per socket mask data:
> >
> >xxx/cat/socket-N/cos-id-0/
> >...
> >xxx/cat/socket-N/cos-id-N/inuse
> > /cat_mask
> > /cdp_mask   <- Data mask if CDP enabled
> >
> > Per cpu default cos id for the cpus on that socket:
> >
> >xxx/cat/socket-N/cpu-x/default_cosid
> >...
> >xxx/cat/socket-N/cpu-N/default_cosid
> >
> > The above allows a simple cpu based partitioning. All tasks which do
> > not have a cache partition assigned on a particular socket use the
> > default one of the cpu they are running on.
> >
> > Now for the task(s) partitioning:
> >
> >xxx/cat/partitions/
> >
> > Under that directory one can create partitions
> >
> >xxx/cat/partitions/p1/tasks
> > /socket-0/cosid
> > ...
> > /socket-n/cosid
> >
> >The default value for the per socket cosid is COSID_DEFAULT, which
> >causes the task(s) to use the per cpu default id.
> 
> I hope I've got all the details right, but this proposal looks awesome.
> There's more people who seem to agree with something like this.
> 
> Btw, I think it should be possible to implement this with cgroups. But I too 
> don't
> care that much on cgroups vs. syscalls.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [RFD] CAT user space interface revisited

2015-11-18 Thread Luiz Capitulino
On Wed, 18 Nov 2015 19:25:03 +0100 (CET)
Thomas Gleixner  wrote:

> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on
> my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same
> meaning on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available,
> there are not going to be enough to have a system wide consistent
> view unless we have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is
> useful at all.

This is a great writeup! I agree with everything you said.

> So now to the interface part. Unfortunately we need to expose this
> very close to the hardware implementation as there are really no
> abstractions which allow us to express the various bitmap
> combinations. Any abstraction I tried to come up with renders that
> thing completely useless.
> 
> I was not able to identify any existing infrastructure where this
> really fits in. I chose a directory/file based representation. We
> certainly could do the same with a syscall, but that's just an
> implementation detail.
> 
> At top level:
> 
>xxx/cat/max_cosids <- Assume that all CPUs are the same
>xxx/cat/max_maskbits   <- Assume that all CPUs are the same
>xxx/cat/cdp_enable <- Depends on CDP availability
> 
> Per socket data:
> 
>xxx/cat/socket-0/
>...
>xxx/cat/socket-N/l3_size
>xxx/cat/socket-N/hwsharedbits
> 
> Per socket mask data:
> 
>xxx/cat/socket-N/cos-id-0/
>...
>xxx/cat/socket-N/cos-id-N/inuse
>   /cat_mask   
>   /cdp_mask   <- Data mask if CDP enabled
> 
> Per cpu default cos id for the cpus on that socket:
> 
>xxx/cat/socket-N/cpu-x/default_cosid
>...
>xxx/cat/socket-N/cpu-N/default_cosid
> 
> The above allows a simple cpu based partitioning. All tasks which do
> not have a cache partition assigned on a particular socket use the
> default one of the cpu they are running on.
> 
> Now for the task(s) partitioning:
> 
>xxx/cat/partitions/
> 
> Under that directory one can create partitions
> 
>xxx/cat/partitions/p1/tasks
>   /socket-0/cosid
>   ...
>   /socket-n/cosid
> 
>The default value for the per socket cosid is COSID_DEFAULT, which
>causes the task(s) to use the per cpu default id.

I hope I've got all the details right, but this proposal looks awesome.
There's more people who seem to agree with something like this.

Btw, I think it should be possible to implement this with cgroups. But
I too don't care that much on cgroups vs. syscalls.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[RFD] CAT user space interface revisited

2015-11-18 Thread Thomas Gleixner
Folks!

After rereading the mail flood on CAT and staring into the SDM for a
while, I think we all should sit back and look at it from scratch
again w/o our preconceptions - I certainly had to put my own away.

Let's look at the properties of CAT again:

   - It's a per socket facility

   - CAT slots can be associated to external hardware. This
 association is per socket as well, so different sockets can have
 different behaviour. I missed that detail when staring the first
 time, thanks for the pointer!

   - The association ifself is per cpu. The COS selection happens on a
 CPU while the set of masks which are selected via COS are shared
 by all CPUs on a socket.

There are restrictions which CAT imposes in terms of configurability:

   - The bits which select a cache partition need to be consecutive

   - The number of possible cache association masks is limited

Let's look at the configurations (CDP omitted and size restricted)

Default:   1 1 1 1 1 1 1 1
   1 1 1 1 1 1 1 1
   1 1 1 1 1 1 1 1
   1 1 1 1 1 1 1 1

Shared:1 1 1 1 1 1 1 1
   0 0 1 1 1 1 1 1
   0 0 0 0 1 1 1 1
   0 0 0 0 0 0 1 1

Isolated:  1 1 1 1 0 0 0 0
   0 0 0 0 1 1 0 0
   0 0 0 0 0 0 1 0
   0 0 0 0 0 0 0 1

Or any combination thereof. Surely some combinations will not make any
sense, but we really should not make any restrictions on the stupidity
of a sysadmin. The worst outcome might be L3 disabled for everything,
so what?

Now that gets even more convoluted if CDP comes into play and we
really need to look at CDP right now. We might end up with something
which looks like this:

   1 1 1 1 0 0 0 0  Code
   1 1 1 1 0 0 0 0  Data
   0 0 0 0 0 0 1 0  Code
   0 0 0 0 1 1 0 0  Data
   0 0 0 0 0 0 0 1  Code
   0 0 0 0 1 1 0 0  Data
or 
   0 0 0 0 0 0 0 1  Code
   0 0 0 0 1 1 0 0  Data
   0 0 0 0 0 0 0 1  Code
   0 0 0 0 0 1 1 0  Data

Let's look at partitioning itself. We have two options:

   1) Per task partitioning

   2) Per CPU partitioning

So far we only talked about #1, but I think that #2 has a value as
well. Let me give you a simple example.

Assume that you have isolated a CPU and run your important task on
it. You give that task a slice of cache. Now that task needs kernel
services which run in kernel threads on that CPU. We really don't want
to (and cannot) hunt down random kernel threads (think cpu bound
worker threads, softirq threads ) and give them another slice of
cache. What we really want is:

 1 1 1 1 0 0 0 0<- Default cache
 0 0 0 0 1 1 1 0<- Cache for important task
 0 0 0 0 0 0 0 1<- Cache for CPU of important task

It would even be sufficient for particular use cases to just associate
a piece of cache to a given CPU and do not bother with tasks at all.

We really need to make this as configurable as possible from userspace
without imposing random restrictions to it. I played around with it on
my new intel toy and the restriction to 16 COS ids (that's 8 with CDP
enabled) makes it really useless if we force the ids to have the same
meaning on all sockets and restrict it to per task partitioning.

Even if next generation systems will have more COS ids available,
there are not going to be enough to have a system wide consistent
view unless we have COS ids > nr_cpus.

Aside of that I don't think that a system wide consistent view is
useful at all.

 - If a task migrates between sockets, it's going to suffer anyway.
   Real sensitive applications will simply pin tasks on a socket to
   avoid that in the first place. If we make the whole thing
   configurable enough then the sysadmin can set it up to support
   even the nonsensical case of identical cache partitions on all
   sockets and let tasks use the corresponding partitions when
   migrating.

 - The number of cache slices is going to be limited no matter what,
   so one still has to come up with a sensible partitioning scheme.

 - Even if we have enough cos ids the system wide view will not make
   the configuration problem any simpler as it remains per socket.

It's hard. Policies are hard by definition, but this one is harder
than most other policies due to the inherent limitations.

So now to the interface part. Unfortunately we need to expose this
very close to the hardware implementation as there are really no
abstractions which allow us to express the various bitmap
combinations. Any abstraction I tried to come up with renders that
thing completely useless.

I was not able to identify any existing infrastructure where this
really fits in. I chose a directory/file based representation. We
certainly could do the same with a syscall, but that's just an
implementation detail.

At top level:

   xxx/cat/max_cosids   <- Assume that all CPUs are the same
   xxx/cat/max_maskbits <- A