Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-18 Thread Thomas Gleixner
On Mon, 17 Oct 2016, Luck, Tony wrote:
> On Tue, Oct 18, 2016 at 01:20:36AM +0200, Thomas Gleixner wrote:
> > It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
> > the above issues. And it would allow you to utilize the 256 groups in an
> > understandable way.
> 
> If you head down that path someone with a 4-socket system will try to
> make 16x16x16x16 = 65536 groups and "understandable" takes a bit of
> a beating. The eight socket system with 16^8 = 4G groups defies any
> rationale hope. Best not to think about 16 sockets.
> 
> The L2 + L3 configuration space gets unbelievably messy too.
> 
> There's a reason why I ripped out the allocation code and went with
> a simple global allocator in this version.  If we decide we need something
> fancier we can adapt later. Some solutions might be transparent to
> applications, others might add a "closid" file into each directory to
> give 2nd generation applications hooks to view (and maybe control)
> which closid is used by each group.

I'm not saying that we want something fancier. I fully agree with your
decision to make a simple global allocator.

I was just puzzled by the 16*16 comment and wondered what this is
about. Looking at Fenghuas explanation and the examples there is nothing
which really looks like we ever want it. In fact the fancy CLOSID matrix
does not make much sense at all.

So I rather would like to see a comment clearly explaining why the chosen
allocator (grouping) gives us the most straight forward way to utilize the
hardware. It surely restricts the theoretical choices, but it limits them to
the subset which makes technically sense.

Thanks,

tglx






Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-18 Thread Thomas Gleixner
On Mon, 17 Oct 2016, Luck, Tony wrote:
> On Tue, Oct 18, 2016 at 01:20:36AM +0200, Thomas Gleixner wrote:
> > It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
> > the above issues. And it would allow you to utilize the 256 groups in an
> > understandable way.
> 
> If you head down that path someone with a 4-socket system will try to
> make 16x16x16x16 = 65536 groups and "understandable" takes a bit of
> a beating. The eight socket system with 16^8 = 4G groups defies any
> rationale hope. Best not to think about 16 sockets.
> 
> The L2 + L3 configuration space gets unbelievably messy too.
> 
> There's a reason why I ripped out the allocation code and went with
> a simple global allocator in this version.  If we decide we need something
> fancier we can adapt later. Some solutions might be transparent to
> applications, others might add a "closid" file into each directory to
> give 2nd generation applications hooks to view (and maybe control)
> which closid is used by each group.

I'm not saying that we want something fancier. I fully agree with your
decision to make a simple global allocator.

I was just puzzled by the 16*16 comment and wondered what this is
about. Looking at Fenghuas explanation and the examples there is nothing
which really looks like we ever want it. In fact the fancy CLOSID matrix
does not make much sense at all.

So I rather would like to see a comment clearly explaining why the chosen
allocator (grouping) gives us the most straight forward way to utilize the
hardware. It surely restricts the theoretical choices, but it limits them to
the subset which makes technically sense.

Thanks,

tglx






Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Fenghua Yu
On Mon, Oct 17, 2016 at 04:37:30PM -0700, Luck, Tony wrote:
> On Tue, Oct 18, 2016 at 01:20:36AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Oct 2016, Fenghua Yu wrote:
> > > part0: L3:0=1;1=1   closid0/cbm=1 on cache0 and closid0/cbm=1 on 
> > > cache1
> > > (closid 15 on cache0 combined with 16 different closids on cache1)
> > > ...
> > > part254: L3:0=;1=7fff   closid15/cbm= on cache0 and 
> > > closid14/cbm=7fff on cache1
> > > part255: L3:0=;1=   closid15/cbm= on cache0 and 
> > > closid15/cbm= on cache1
> > > 
> > > To utilize as much combinations as possbile, we may implement a
> > > more complex allocation than current one.
> > > 
> > > Does this make sense?
> > 
> > Thanks for the explanation. I knew that I'm missing something.
> > 
> > But how is that supposed to work? The schemata files have no idea of
> > closids simply because the closids are assigned automatically. And that
> > makes the whole thing exponentially complex. You must allow to create ALL
> > rdt groups (initialy as a copy of the root group) and then when the
> > schemata file is written you have to look whether the particular CBM value
> > for a particular domain is already used and assign the same cosid for this
> > domain. That of course makes the whole L2 business completely diffuse
> > because you might end up with:
> > 
> > Dom0 = COSID1 and DOM1 = COSID9
> > 
> > So you can set the L2 for Dom0, but not for DOM1 and then if you set L2 for
> > Dom0 you must find a new COSID for Dom0. If there is none, then you must
> > reject the write and leave the admin puzzled.
> > 
> > There is a reason why I suggested:
> > 
> >  https://lkml.kernel.org/r/alpine.DEB.2.11.1511181534450.3761@nanos
> > 
> > It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
> > the above issues. And it would allow you to utilize the 256 groups in an
> > understandable way.
> 
> If you head down that path someone with a 4-socket system will try to
> make 16x16x16x16 = 65536 groups and "understandable" takes a bit of
> a beating. The eight socket system with 16^8 = 4G groups defies any
> rationale hope. Best not to think about 16 sockets.

The number of 16^L3 cache numbers is max partition number limitation 
that a sysadmin can create in theory. Beyond the number, allocation
returns no space. It's kind of like other cases eg many many mkdir in one
directory can fail at one point because mkdir run out of disk space etc.

> 
> The L2 + L3 configuration space gets unbelievably messy too.
> 
> There's a reason why I ripped out the allocation code and went with
> a simple global allocator in this version.  If we decide we need something
> fancier we can adapt later. Some solutions might be transparent to
> applications, others might add a "closid" file into each directory to
> give 2nd generation applications hooks to view (and maybe control)
> which closid is used by each group.

Fully agree with Tony. We understand the complexity of the situation and
just have a simple and working solution for the first version.

Thanks.

-Fenghua


Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Fenghua Yu
On Mon, Oct 17, 2016 at 04:37:30PM -0700, Luck, Tony wrote:
> On Tue, Oct 18, 2016 at 01:20:36AM +0200, Thomas Gleixner wrote:
> > On Mon, 17 Oct 2016, Fenghua Yu wrote:
> > > part0: L3:0=1;1=1   closid0/cbm=1 on cache0 and closid0/cbm=1 on 
> > > cache1
> > > (closid 15 on cache0 combined with 16 different closids on cache1)
> > > ...
> > > part254: L3:0=;1=7fff   closid15/cbm= on cache0 and 
> > > closid14/cbm=7fff on cache1
> > > part255: L3:0=;1=   closid15/cbm= on cache0 and 
> > > closid15/cbm= on cache1
> > > 
> > > To utilize as much combinations as possbile, we may implement a
> > > more complex allocation than current one.
> > > 
> > > Does this make sense?
> > 
> > Thanks for the explanation. I knew that I'm missing something.
> > 
> > But how is that supposed to work? The schemata files have no idea of
> > closids simply because the closids are assigned automatically. And that
> > makes the whole thing exponentially complex. You must allow to create ALL
> > rdt groups (initialy as a copy of the root group) and then when the
> > schemata file is written you have to look whether the particular CBM value
> > for a particular domain is already used and assign the same cosid for this
> > domain. That of course makes the whole L2 business completely diffuse
> > because you might end up with:
> > 
> > Dom0 = COSID1 and DOM1 = COSID9
> > 
> > So you can set the L2 for Dom0, but not for DOM1 and then if you set L2 for
> > Dom0 you must find a new COSID for Dom0. If there is none, then you must
> > reject the write and leave the admin puzzled.
> > 
> > There is a reason why I suggested:
> > 
> >  https://lkml.kernel.org/r/alpine.DEB.2.11.1511181534450.3761@nanos
> > 
> > It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
> > the above issues. And it would allow you to utilize the 256 groups in an
> > understandable way.
> 
> If you head down that path someone with a 4-socket system will try to
> make 16x16x16x16 = 65536 groups and "understandable" takes a bit of
> a beating. The eight socket system with 16^8 = 4G groups defies any
> rationale hope. Best not to think about 16 sockets.

The number of 16^L3 cache numbers is max partition number limitation 
that a sysadmin can create in theory. Beyond the number, allocation
returns no space. It's kind of like other cases eg many many mkdir in one
directory can fail at one point because mkdir run out of disk space etc.

> 
> The L2 + L3 configuration space gets unbelievably messy too.
> 
> There's a reason why I ripped out the allocation code and went with
> a simple global allocator in this version.  If we decide we need something
> fancier we can adapt later. Some solutions might be transparent to
> applications, others might add a "closid" file into each directory to
> give 2nd generation applications hooks to view (and maybe control)
> which closid is used by each group.

Fully agree with Tony. We understand the complexity of the situation and
just have a simple and working solution for the first version.

Thanks.

-Fenghua


Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Luck, Tony
On Tue, Oct 18, 2016 at 01:20:36AM +0200, Thomas Gleixner wrote:
> On Mon, 17 Oct 2016, Fenghua Yu wrote:
> > part0: L3:0=1;1=1   closid0/cbm=1 on cache0 and closid0/cbm=1 on cache1
> > (closid 15 on cache0 combined with 16 different closids on cache1)
> > ...
> > part254: L3:0=;1=7fff   closid15/cbm= on cache0 and 
> > closid14/cbm=7fff on cache1
> > part255: L3:0=;1=   closid15/cbm= on cache0 and 
> > closid15/cbm= on cache1
> > 
> > To utilize as much combinations as possbile, we may implement a
> > more complex allocation than current one.
> > 
> > Does this make sense?
> 
> Thanks for the explanation. I knew that I'm missing something.
> 
> But how is that supposed to work? The schemata files have no idea of
> closids simply because the closids are assigned automatically. And that
> makes the whole thing exponentially complex. You must allow to create ALL
> rdt groups (initialy as a copy of the root group) and then when the
> schemata file is written you have to look whether the particular CBM value
> for a particular domain is already used and assign the same cosid for this
> domain. That of course makes the whole L2 business completely diffuse
> because you might end up with:
> 
> Dom0 = COSID1   and DOM1 = COSID9
> 
> So you can set the L2 for Dom0, but not for DOM1 and then if you set L2 for
> Dom0 you must find a new COSID for Dom0. If there is none, then you must
> reject the write and leave the admin puzzled.
> 
> There is a reason why I suggested:
> 
>  https://lkml.kernel.org/r/alpine.DEB.2.11.1511181534450.3761@nanos
> 
> It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
> the above issues. And it would allow you to utilize the 256 groups in an
> understandable way.

If you head down that path someone with a 4-socket system will try to
make 16x16x16x16 = 65536 groups and "understandable" takes a bit of
a beating. The eight socket system with 16^8 = 4G groups defies any
rationale hope. Best not to think about 16 sockets.

The L2 + L3 configuration space gets unbelievably messy too.

There's a reason why I ripped out the allocation code and went with
a simple global allocator in this version.  If we decide we need something
fancier we can adapt later. Some solutions might be transparent to
applications, others might add a "closid" file into each directory to
give 2nd generation applications hooks to view (and maybe control)
which closid is used by each group.

-Tony


Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Luck, Tony
On Tue, Oct 18, 2016 at 01:20:36AM +0200, Thomas Gleixner wrote:
> On Mon, 17 Oct 2016, Fenghua Yu wrote:
> > part0: L3:0=1;1=1   closid0/cbm=1 on cache0 and closid0/cbm=1 on cache1
> > (closid 15 on cache0 combined with 16 different closids on cache1)
> > ...
> > part254: L3:0=;1=7fff   closid15/cbm= on cache0 and 
> > closid14/cbm=7fff on cache1
> > part255: L3:0=;1=   closid15/cbm= on cache0 and 
> > closid15/cbm= on cache1
> > 
> > To utilize as much combinations as possbile, we may implement a
> > more complex allocation than current one.
> > 
> > Does this make sense?
> 
> Thanks for the explanation. I knew that I'm missing something.
> 
> But how is that supposed to work? The schemata files have no idea of
> closids simply because the closids are assigned automatically. And that
> makes the whole thing exponentially complex. You must allow to create ALL
> rdt groups (initialy as a copy of the root group) and then when the
> schemata file is written you have to look whether the particular CBM value
> for a particular domain is already used and assign the same cosid for this
> domain. That of course makes the whole L2 business completely diffuse
> because you might end up with:
> 
> Dom0 = COSID1   and DOM1 = COSID9
> 
> So you can set the L2 for Dom0, but not for DOM1 and then if you set L2 for
> Dom0 you must find a new COSID for Dom0. If there is none, then you must
> reject the write and leave the admin puzzled.
> 
> There is a reason why I suggested:
> 
>  https://lkml.kernel.org/r/alpine.DEB.2.11.1511181534450.3761@nanos
> 
> It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
> the above issues. And it would allow you to utilize the 256 groups in an
> understandable way.

If you head down that path someone with a 4-socket system will try to
make 16x16x16x16 = 65536 groups and "understandable" takes a bit of
a beating. The eight socket system with 16^8 = 4G groups defies any
rationale hope. Best not to think about 16 sockets.

The L2 + L3 configuration space gets unbelievably messy too.

There's a reason why I ripped out the allocation code and went with
a simple global allocator in this version.  If we decide we need something
fancier we can adapt later. Some solutions might be transparent to
applications, others might add a "closid" file into each directory to
give 2nd generation applications hooks to view (and maybe control)
which closid is used by each group.

-Tony


RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Thomas Gleixner
On Mon, 17 Oct 2016, Luck, Tony wrote:

> > How so? CLOSID 9 is using CLOSID 1 L2 settings. Are we just keeping the L2
> > setting of CLOSID 1 around and do not reset it to default?
> 
> No. When CLOSID 9 arrives at the L2 h/w, it doesn't just take the bits it
> likes an discard the high bits to map to L2_CBM[1]. It just turns into 
> into the maximum allowed value for an L2 CBM.

So all CLOSIDs >= 8 will use all valid CBM bits for L2? That's even more
insane as this breaks any L2 partitioning which is set up by CLOSIDs < 8.

So effectively CLOSIDs >= 8 are useless in the L3=16 and L2=8 case.

Thanks,

tglx



RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Thomas Gleixner
On Mon, 17 Oct 2016, Luck, Tony wrote:

> > How so? CLOSID 9 is using CLOSID 1 L2 settings. Are we just keeping the L2
> > setting of CLOSID 1 around and do not reset it to default?
> 
> No. When CLOSID 9 arrives at the L2 h/w, it doesn't just take the bits it
> likes an discard the high bits to map to L2_CBM[1]. It just turns into 
> into the maximum allowed value for an L2 CBM.

So all CLOSIDs >= 8 will use all valid CBM bits for L2? That's even more
insane as this breaks any L2 partitioning which is set up by CLOSIDs < 8.

So effectively CLOSIDs >= 8 are useless in the L3=16 and L2=8 case.

Thanks,

tglx



Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Thomas Gleixner
On Mon, 17 Oct 2016, Fenghua Yu wrote:
> part0: L3:0=1;1=1   closid0/cbm=1 on cache0 and closid0/cbm=1 on cache1
> (closid 15 on cache0 combined with 16 different closids on cache1)
> ...
> part254: L3:0=;1=7fff   closid15/cbm= on cache0 and closid14/cbm=7fff 
> on cache1
> part255: L3:0=;1=   closid15/cbm= on cache0 and closid15/cbm= 
> on cache1
> 
> To utilize as much combinations as possbile, we may implement a
> more complex allocation than current one.
> 
> Does this make sense?

Thanks for the explanation. I knew that I'm missing something.

But how is that supposed to work? The schemata files have no idea of
closids simply because the closids are assigned automatically. And that
makes the whole thing exponentially complex. You must allow to create ALL
rdt groups (initialy as a copy of the root group) and then when the
schemata file is written you have to look whether the particular CBM value
for a particular domain is already used and assign the same cosid for this
domain. That of course makes the whole L2 business completely diffuse
because you might end up with:

Dom0 = COSID1 and DOM1 = COSID9

So you can set the L2 for Dom0, but not for DOM1 and then if you set L2 for
Dom0 you must find a new COSID for Dom0. If there is none, then you must
reject the write and leave the admin puzzled.

There is a reason why I suggested:

 https://lkml.kernel.org/r/alpine.DEB.2.11.1511181534450.3761@nanos

It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
the above issues. And it would allow you to utilize the 256 groups in an
understandable way.

Thanks,

tglx







Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Thomas Gleixner
On Mon, 17 Oct 2016, Fenghua Yu wrote:
> part0: L3:0=1;1=1   closid0/cbm=1 on cache0 and closid0/cbm=1 on cache1
> (closid 15 on cache0 combined with 16 different closids on cache1)
> ...
> part254: L3:0=;1=7fff   closid15/cbm= on cache0 and closid14/cbm=7fff 
> on cache1
> part255: L3:0=;1=   closid15/cbm= on cache0 and closid15/cbm= 
> on cache1
> 
> To utilize as much combinations as possbile, we may implement a
> more complex allocation than current one.
> 
> Does this make sense?

Thanks for the explanation. I knew that I'm missing something.

But how is that supposed to work? The schemata files have no idea of
closids simply because the closids are assigned automatically. And that
makes the whole thing exponentially complex. You must allow to create ALL
rdt groups (initialy as a copy of the root group) and then when the
schemata file is written you have to look whether the particular CBM value
for a particular domain is already used and assign the same cosid for this
domain. That of course makes the whole L2 business completely diffuse
because you might end up with:

Dom0 = COSID1 and DOM1 = COSID9

So you can set the L2 for Dom0, but not for DOM1 and then if you set L2 for
Dom0 you must find a new COSID for Dom0. If there is none, then you must
reject the write and leave the admin puzzled.

There is a reason why I suggested:

 https://lkml.kernel.org/r/alpine.DEB.2.11.1511181534450.3761@nanos

It's certainly not perfect (missing L2 etc.), but clearly avoids exactly
the above issues. And it would allow you to utilize the 256 groups in an
understandable way.

Thanks,

tglx







RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Luck, Tony
> How so? CLOSID 9 is using CLOSID 1 L2 settings. Are we just keeping the L2
> setting of CLOSID 1 around and do not reset it to default?

No. When CLOSID 9 arrives at the L2 h/w, it doesn't just take the bits it
likes an discard the high bits to map to L2_CBM[1]. It just turns into 
into the maximum allowed value for an L2 CBM.

-Tony


RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Luck, Tony
> How so? CLOSID 9 is using CLOSID 1 L2 settings. Are we just keeping the L2
> setting of CLOSID 1 around and do not reset it to default?

No. When CLOSID 9 arrives at the L2 h/w, it doesn't just take the bits it
likes an discard the high bits to map to L2_CBM[1]. It just turns into 
into the maximum allowed value for an L2 CBM.

-Tony


RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Thomas Gleixner
On Mon, 17 Oct 2016, Luck, Tony wrote:
> > So how are we going to deal with that in the schematas? Assume the L3=16
> > and L2=8 case(no CDP). So effectively any write of L2 to CLOSID=0 will
> > affect the setting of L2 in CLOSID=8.
> >
> > Will the code tell the user that L2 cannot be set for CLOSID >= 8? 
> >
> > Will it print the setting of CLOSID - 8 for CLOSID >= 8 when you read the
> > schemata file of CLOSID >= 8?
> 
> The default and first 7 directories you make will have lines in schemata for
> both L3 and L2.  The next 8 will just have L3 (and won't let you add L2).
> 
> > And of course this becomes very interesting when CLOSID 1 is deleted, then
> > what happens to CLOSID 9? Not to talk about the case when CLOSID 1 is
> > reused again.
> 
> Deleting CLOSID 1 won't affect CLOSID 9.

How so? CLOSID 9 is using CLOSID 1 L2 settings. Are we just keeping the L2
setting of CLOSID 1 around and do not reset it to default?

>  When you make a new directory that gets allocated CLOSID 1, it will have
>  both L3 and L2 lines.

Now CLOSID 1 is reused and reset to all 1's per default and even if not
then the operator will set a new L2 value and therefor wreckaging CLOSID 9.

Thanks,

tglx


RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Thomas Gleixner
On Mon, 17 Oct 2016, Luck, Tony wrote:
> > So how are we going to deal with that in the schematas? Assume the L3=16
> > and L2=8 case(no CDP). So effectively any write of L2 to CLOSID=0 will
> > affect the setting of L2 in CLOSID=8.
> >
> > Will the code tell the user that L2 cannot be set for CLOSID >= 8? 
> >
> > Will it print the setting of CLOSID - 8 for CLOSID >= 8 when you read the
> > schemata file of CLOSID >= 8?
> 
> The default and first 7 directories you make will have lines in schemata for
> both L3 and L2.  The next 8 will just have L3 (and won't let you add L2).
> 
> > And of course this becomes very interesting when CLOSID 1 is deleted, then
> > what happens to CLOSID 9? Not to talk about the case when CLOSID 1 is
> > reused again.
> 
> Deleting CLOSID 1 won't affect CLOSID 9.

How so? CLOSID 9 is using CLOSID 1 L2 settings. Are we just keeping the L2
setting of CLOSID 1 around and do not reset it to default?

>  When you make a new directory that gets allocated CLOSID 1, it will have
>  both L3 and L2 lines.

Now CLOSID 1 is reused and reset to all 1's per default and even if not
then the operator will set a new L2 value and therefor wreckaging CLOSID 9.

Thanks,

tglx


RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Luck, Tony
> So how are we going to deal with that in the schematas? Assume the L3=16
> and L2=8 case(no CDP). So effectively any write of L2 to CLOSID=0 will
> affect the setting of L2 in CLOSID=8.
>
> Will the code tell the user that L2 cannot be set for CLOSID >= 8? 
>
> Will it print the setting of CLOSID - 8 for CLOSID >= 8 when you read the
> schemata file of CLOSID >= 8?

The default and first 7 directories you make will have lines in schemata for
both L3 and L2.  The next 8 will just have L3 (and won't let you add L2).

> And of course this becomes very interesting when CLOSID 1 is deleted, then
> what happens to CLOSID 9? Not to talk about the case when CLOSID 1 is
> reused again.

Deleting CLOSID 1 won't affect CLOSID 9.  When you make a new directory
that gets allocated CLOSID 1, it will have both L3 and L2 lines.

-Tony


RE: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Luck, Tony
> So how are we going to deal with that in the schematas? Assume the L3=16
> and L2=8 case(no CDP). So effectively any write of L2 to CLOSID=0 will
> affect the setting of L2 in CLOSID=8.
>
> Will the code tell the user that L2 cannot be set for CLOSID >= 8? 
>
> Will it print the setting of CLOSID - 8 for CLOSID >= 8 when you read the
> schemata file of CLOSID >= 8?

The default and first 7 directories you make will have lines in schemata for
both L3 and L2.  The next 8 will just have L3 (and won't let you add L2).

> And of course this becomes very interesting when CLOSID 1 is deleted, then
> what happens to CLOSID 9? Not to talk about the case when CLOSID 1 is
> reused again.

Deleting CLOSID 1 won't affect CLOSID 9.  When you make a new directory
that gets allocated CLOSID 1, it will have both L3 and L2 lines.

-Tony


Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Thomas Gleixner
On Mon, 17 Oct 2016, Luck, Tony wrote:
> On Mon, Oct 17, 2016 at 11:14:55PM +0200, Thomas Gleixner wrote:
> > > + /* Compute rdt_max_closid across all resources */
> > > + rdt_max_closid = 0;
> > > + for_each_rdt_resource(r)
> > > + rdt_max_closid = max(rdt_max_closid, r->num_closid);
> > 
> > Oh no! This needs to be min().
> > 
> > Assume you have a system with L3 and L2 CAT. L2 reports COS_MAX=16, L3
> > reports COS_MAX=16 as well. Then you enabled CDP which cuts L3 COS_MAX in
> > half. So the real usable number of CLOSIDs is going to be 8 for both L2 and
> > L3 simply because you do not have a seperation of L2 and L3 in
> > MSR_PQR_ASSOC. And if you allow 16 then any CLOSID > 8 will result in
> > undefined behaviour. See SDM:
> > 
> >  "When CDP is enabled, specifying a COS value in IA32_PQR_ASSOC.COS outside
> >   of the lower half of the COS space will cause undefined performance impact
> >   to code and data fetches due to MSR space re-indexing into code/data masks
> >   when CDP is enabled."
> 
> Bother.  The SDM also has this gem:
> 
> 17.17.5.1 Cache Allocation Technology Dynamic Configuration
> Both the CAT masks and CQM registers are accessible and modifiable at
> any time during execution using RDMSR/WRMSR unless otherwise noted. When
> writing to these MSRs a #GP(0) will be generated if any of the following
> conditions occur:
> 
>  * Writing a COS greater than the supported maximum (specified as the
>maximum value of CPUID.(EAX=10H, ECX=ResID):EDX[15:0] for all valid
>ResID values) is written to the IA32_PQR_ASSOC.CLOS field.
> 
> With the intent here being that if you have more of one resource than
> another, you can use all of the resources in the larger (with the
> resource with fewer mask registers defaulting to the maximum value
> when PQR_ASSOC.COS is too large [and I can't find the text that talks
> about that default behaviour :-( ]
>
> I think this all means that L3/CDP is "special". If CDP is on, we can't
> use the top half of the CLOSID space that CPUID.(EAX=10H, ECX=1):EDX[15:0]
> told us is present. So we can't exceed the half-way point.

That's my understanding.

> But in other cases like L3 with max=16 and L2 with max=8 we should allow
> 16 groups, but 0-7 allow control of L3 and L2, while 8-15 only allow L3
> control (the schemata code enforces this when you get to that part).

Cute. Welcome to configuration hell!

So how are we going to deal with that in the schematas? Assume the L3=16
and L2=8 case(no CDP). So effectively any write of L2 to CLOSID=0 will
affect the setting of L2 in CLOSID=8.

Will the code tell the user that L2 cannot be set for CLOSID >= 8? 

Will it print the setting of CLOSID - 8 for CLOSID >= 8 when you read the
schemata file of CLOSID >= 8?

And of course this becomes very interesting when CLOSID 1 is deleted, then
what happens to CLOSID 9? Not to talk about the case when CLOSID 1 is
reused again.

Seems to be a well thought out hardware feature once again.

> Perhaps we can encode this in another field in the rdt_resource structure
> that says that some maximums globally override all others, while some
> can be legitimately exceeded.

That should work.
 
> I'll have some words with the h/w architect, and get the SDM fixed in
> the next edition.

Whatever the outcome will be :)

Thanks,

tglx


Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Thomas Gleixner
On Mon, 17 Oct 2016, Luck, Tony wrote:
> On Mon, Oct 17, 2016 at 11:14:55PM +0200, Thomas Gleixner wrote:
> > > + /* Compute rdt_max_closid across all resources */
> > > + rdt_max_closid = 0;
> > > + for_each_rdt_resource(r)
> > > + rdt_max_closid = max(rdt_max_closid, r->num_closid);
> > 
> > Oh no! This needs to be min().
> > 
> > Assume you have a system with L3 and L2 CAT. L2 reports COS_MAX=16, L3
> > reports COS_MAX=16 as well. Then you enabled CDP which cuts L3 COS_MAX in
> > half. So the real usable number of CLOSIDs is going to be 8 for both L2 and
> > L3 simply because you do not have a seperation of L2 and L3 in
> > MSR_PQR_ASSOC. And if you allow 16 then any CLOSID > 8 will result in
> > undefined behaviour. See SDM:
> > 
> >  "When CDP is enabled, specifying a COS value in IA32_PQR_ASSOC.COS outside
> >   of the lower half of the COS space will cause undefined performance impact
> >   to code and data fetches due to MSR space re-indexing into code/data masks
> >   when CDP is enabled."
> 
> Bother.  The SDM also has this gem:
> 
> 17.17.5.1 Cache Allocation Technology Dynamic Configuration
> Both the CAT masks and CQM registers are accessible and modifiable at
> any time during execution using RDMSR/WRMSR unless otherwise noted. When
> writing to these MSRs a #GP(0) will be generated if any of the following
> conditions occur:
> 
>  * Writing a COS greater than the supported maximum (specified as the
>maximum value of CPUID.(EAX=10H, ECX=ResID):EDX[15:0] for all valid
>ResID values) is written to the IA32_PQR_ASSOC.CLOS field.
> 
> With the intent here being that if you have more of one resource than
> another, you can use all of the resources in the larger (with the
> resource with fewer mask registers defaulting to the maximum value
> when PQR_ASSOC.COS is too large [and I can't find the text that talks
> about that default behaviour :-( ]
>
> I think this all means that L3/CDP is "special". If CDP is on, we can't
> use the top half of the CLOSID space that CPUID.(EAX=10H, ECX=1):EDX[15:0]
> told us is present. So we can't exceed the half-way point.

That's my understanding.

> But in other cases like L3 with max=16 and L2 with max=8 we should allow
> 16 groups, but 0-7 allow control of L3 and L2, while 8-15 only allow L3
> control (the schemata code enforces this when you get to that part).

Cute. Welcome to configuration hell!

So how are we going to deal with that in the schematas? Assume the L3=16
and L2=8 case(no CDP). So effectively any write of L2 to CLOSID=0 will
affect the setting of L2 in CLOSID=8.

Will the code tell the user that L2 cannot be set for CLOSID >= 8? 

Will it print the setting of CLOSID - 8 for CLOSID >= 8 when you read the
schemata file of CLOSID >= 8?

And of course this becomes very interesting when CLOSID 1 is deleted, then
what happens to CLOSID 9? Not to talk about the case when CLOSID 1 is
reused again.

Seems to be a well thought out hardware feature once again.

> Perhaps we can encode this in another field in the rdt_resource structure
> that says that some maximums globally override all others, while some
> can be legitimately exceeded.

That should work.
 
> I'll have some words with the h/w architect, and get the SDM fixed in
> the next edition.

Whatever the outcome will be :)

Thanks,

tglx


Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Fenghua Yu
On Mon, Oct 17, 2016 at 11:14:55PM +0200, Thomas Gleixner wrote:
> On Fri, 14 Oct 2016, Fenghua Yu wrote:
> > +/*
> > + * Trivial allocator for CLOSIDs. Since h/w only supports a small number,
> > + * we can keep a bitmap of free CLOSIDs in a single integer.
> > + *
> > + * Please note: This only supports global CLOSID across multiple
> > + * resources and multiple sockets. User can create rdtgroups including root
> > + * rdtgroup up to the number of CLOSIDs, which is 16 on Broadwell. When
> > + * number of caches is big or number of supported resources sharing CLOSID
> > + * is growing, it's getting harder to find usable rdtgroups which is 
> > limited
> > + * by the small number of CLOSIDs.
> > + *
> > + * In the future, if it's necessary, we can implement more complex CLOSID
> > + * allocation per socket/per resource domain and utilize CLOSIDs as many
> > + * as possible. E.g. on 2-socket Broadwell, user can create upto 16x16=256
> > + * rdtgroups and each rdtgroup has different combination of two L3 CBMs.
> 
> I'm confused as usual, but a two socket broadwell has exactly two L3 cache
> domains and exactly 16 CLOSIDs per cache domain.
> 
> If you take CDP into account then the number of CLOSIDs is reduced to 8 per
> cache domains.
> 
> So we never can have more than nr(CLOSIDs) * nr(L3 cache domains) unique
> settings. So for a two socket broadwell its 32 for !CDP and 16 for CDP.
> 
> With the proposed user interface the number of unique rdtgroups is simply
> the number of CLOSIDs because we handle the cache domains already per
> resource, i.e. the meaning of CLOSID can be set independently per cache
> domain.
> 
> Can you please explain why you think that we can have 16x16 unique
> rdtgroups if we just have 16 resp. 8 CLOSIDs available?

To simplify, we only consider CAT case. CDP has similar similar situation.

For a two socket broadwell, we have schemata format "L3:0=x;1=y"
suppose the two cache ids are 0 and 1, and x is cache 0's cbm and y is
cache 1's cbm. Then kernel allocates one closid and its cbm=x for cache 0
and one closid and its cbm=y for cache 1.

So we can have the following 16x16 different partitions/rdtgroups.
Each partition/rdgroup has its name and has its own unique
closid combinations on two caches. If a task is assigned to any of
partition, the task has its unique combination of closids when running on
cache 0 and when running on cache 1. Belowing cbm values are example values.

name   schemata closids on cache 0 and 1 allocated by kernel
    
(closid 0 on cache0 combined with 16 different closid on cache1)
part0: L3:0=1;1=1   closid0/cbm=1 on cache0 and closid0/cbm=1 on cache1
part1: L3:0=1;1=3   closid0/cbm=1 on cache0 and closid1/cbm=3 on cache1
part2: L3:0=1;1=7   closid0/cbm=1 on cache0 and closid2/cbm=7 on cache1
part3: L3:0=1;1=f   closid0/cbm=1 on cache0 and closid3/cbm=f on cache1
part4: L3:0=1;1=1f  closid0/cbm=1 on cache0 and closid4/cbm=1f on cache1
part5: L3:0=1;1=3f  closid0/cbm=1 on cache0 and closid5/cbm=3f on cache1
part6: L3:0=1;1=7f  closid0/cbm=1 on cache0 and closid6/cbm=7f on cache1
part7: L3:0=1;1=ff  closid0/cbm=1 on cache0 and closid7/cbm=ff on cache1
part8: L3:0=1;1=1ff closid0/cbm=1 on cache0 and closid8/cbm=1ff on cache1
part9: L3:0=1;1=3ff closid0/cbm=1 on cache0 and closid9/cbm=3ff on cache1
part10: L3:0=1;1=7ffclosid0/cbm=1 on cache0 and closid10/cbm=7ff on cache1
part11: L3:0=1;1=fffclosid0/cbm=1 on cache0 and closid11/cbm=fff on cache1
part12: L3:0=1;1=1fff   closid0/cbm=1 on cache0 and closid12/cbm=1fff on cache1
part13: L3:0=1;1=3fff   closid0/cbm=1 on cache0 and closid13/cbm=3fff on cache1
part14: L3:0=1;1=7fff   closid0/cbm=1 on cache0 and closid14/cbm=7fff on cache1
part15: L3:0=1;1=   closid0/cbm=1 on cache0 and closid15/cbm= on cache1
(closid 1 on cache0 combined with 16 different closid on cache1)
part16: L3:0=3;1=1  closid1/cbm=3 on cache0 and closid0/cbm=1 on cache1
part17: L3:0=3;1=3  closid1/cbm=3 on cache0 and closid1/cbm=3 on cache1
...
part31: L3:0=3;1=   closid1/cbm=3 on cache0 and closid15/cbm= on cache1
(closid 2 on cache0 combined with 16 different closid on cache1)
part16: L3:0=7;1=1  closid2/cbm=3 on cache0 and closid0/cbm=1 on cache1
part17: L3:0=7;1=3  closid2/cbm=3 on cache0 and closid1/cbm=3 on cache1
...
part31: L3:0=7;1=   closid2/cbm=3 on cache0 and closid15/cbm= on cache1
(closid 3 on cache0 combined with 16 different closids on cache1)
...
(closid 4 on cache0 combined with 16 different closids on cache1)
...
(closid 5 on cache0 combined with 16 different closids on cache1)
...
(closid 6 on cache0 combined with 16 different closids on cache1)
...
(closid 7 on cache0 combined with 16 different closids on cache1)
...
(closid 8 on cache0 combined with 16 different closids on cache1)
...
(closid 9 on cache0 combined with 16 different closids on cache1)
...
(closid 10 

Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Fenghua Yu
On Mon, Oct 17, 2016 at 11:14:55PM +0200, Thomas Gleixner wrote:
> On Fri, 14 Oct 2016, Fenghua Yu wrote:
> > +/*
> > + * Trivial allocator for CLOSIDs. Since h/w only supports a small number,
> > + * we can keep a bitmap of free CLOSIDs in a single integer.
> > + *
> > + * Please note: This only supports global CLOSID across multiple
> > + * resources and multiple sockets. User can create rdtgroups including root
> > + * rdtgroup up to the number of CLOSIDs, which is 16 on Broadwell. When
> > + * number of caches is big or number of supported resources sharing CLOSID
> > + * is growing, it's getting harder to find usable rdtgroups which is 
> > limited
> > + * by the small number of CLOSIDs.
> > + *
> > + * In the future, if it's necessary, we can implement more complex CLOSID
> > + * allocation per socket/per resource domain and utilize CLOSIDs as many
> > + * as possible. E.g. on 2-socket Broadwell, user can create upto 16x16=256
> > + * rdtgroups and each rdtgroup has different combination of two L3 CBMs.
> 
> I'm confused as usual, but a two socket broadwell has exactly two L3 cache
> domains and exactly 16 CLOSIDs per cache domain.
> 
> If you take CDP into account then the number of CLOSIDs is reduced to 8 per
> cache domains.
> 
> So we never can have more than nr(CLOSIDs) * nr(L3 cache domains) unique
> settings. So for a two socket broadwell its 32 for !CDP and 16 for CDP.
> 
> With the proposed user interface the number of unique rdtgroups is simply
> the number of CLOSIDs because we handle the cache domains already per
> resource, i.e. the meaning of CLOSID can be set independently per cache
> domain.
> 
> Can you please explain why you think that we can have 16x16 unique
> rdtgroups if we just have 16 resp. 8 CLOSIDs available?

To simplify, we only consider CAT case. CDP has similar similar situation.

For a two socket broadwell, we have schemata format "L3:0=x;1=y"
suppose the two cache ids are 0 and 1, and x is cache 0's cbm and y is
cache 1's cbm. Then kernel allocates one closid and its cbm=x for cache 0
and one closid and its cbm=y for cache 1.

So we can have the following 16x16 different partitions/rdtgroups.
Each partition/rdgroup has its name and has its own unique
closid combinations on two caches. If a task is assigned to any of
partition, the task has its unique combination of closids when running on
cache 0 and when running on cache 1. Belowing cbm values are example values.

name   schemata closids on cache 0 and 1 allocated by kernel
    
(closid 0 on cache0 combined with 16 different closid on cache1)
part0: L3:0=1;1=1   closid0/cbm=1 on cache0 and closid0/cbm=1 on cache1
part1: L3:0=1;1=3   closid0/cbm=1 on cache0 and closid1/cbm=3 on cache1
part2: L3:0=1;1=7   closid0/cbm=1 on cache0 and closid2/cbm=7 on cache1
part3: L3:0=1;1=f   closid0/cbm=1 on cache0 and closid3/cbm=f on cache1
part4: L3:0=1;1=1f  closid0/cbm=1 on cache0 and closid4/cbm=1f on cache1
part5: L3:0=1;1=3f  closid0/cbm=1 on cache0 and closid5/cbm=3f on cache1
part6: L3:0=1;1=7f  closid0/cbm=1 on cache0 and closid6/cbm=7f on cache1
part7: L3:0=1;1=ff  closid0/cbm=1 on cache0 and closid7/cbm=ff on cache1
part8: L3:0=1;1=1ff closid0/cbm=1 on cache0 and closid8/cbm=1ff on cache1
part9: L3:0=1;1=3ff closid0/cbm=1 on cache0 and closid9/cbm=3ff on cache1
part10: L3:0=1;1=7ffclosid0/cbm=1 on cache0 and closid10/cbm=7ff on cache1
part11: L3:0=1;1=fffclosid0/cbm=1 on cache0 and closid11/cbm=fff on cache1
part12: L3:0=1;1=1fff   closid0/cbm=1 on cache0 and closid12/cbm=1fff on cache1
part13: L3:0=1;1=3fff   closid0/cbm=1 on cache0 and closid13/cbm=3fff on cache1
part14: L3:0=1;1=7fff   closid0/cbm=1 on cache0 and closid14/cbm=7fff on cache1
part15: L3:0=1;1=   closid0/cbm=1 on cache0 and closid15/cbm= on cache1
(closid 1 on cache0 combined with 16 different closid on cache1)
part16: L3:0=3;1=1  closid1/cbm=3 on cache0 and closid0/cbm=1 on cache1
part17: L3:0=3;1=3  closid1/cbm=3 on cache0 and closid1/cbm=3 on cache1
...
part31: L3:0=3;1=   closid1/cbm=3 on cache0 and closid15/cbm= on cache1
(closid 2 on cache0 combined with 16 different closid on cache1)
part16: L3:0=7;1=1  closid2/cbm=3 on cache0 and closid0/cbm=1 on cache1
part17: L3:0=7;1=3  closid2/cbm=3 on cache0 and closid1/cbm=3 on cache1
...
part31: L3:0=7;1=   closid2/cbm=3 on cache0 and closid15/cbm= on cache1
(closid 3 on cache0 combined with 16 different closids on cache1)
...
(closid 4 on cache0 combined with 16 different closids on cache1)
...
(closid 5 on cache0 combined with 16 different closids on cache1)
...
(closid 6 on cache0 combined with 16 different closids on cache1)
...
(closid 7 on cache0 combined with 16 different closids on cache1)
...
(closid 8 on cache0 combined with 16 different closids on cache1)
...
(closid 9 on cache0 combined with 16 different closids on cache1)
...
(closid 10 

Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Luck, Tony
On Mon, Oct 17, 2016 at 11:14:55PM +0200, Thomas Gleixner wrote:
> > +   /* Compute rdt_max_closid across all resources */
> > +   rdt_max_closid = 0;
> > +   for_each_rdt_resource(r)
> > +   rdt_max_closid = max(rdt_max_closid, r->num_closid);
> 
> Oh no! This needs to be min().
> 
> Assume you have a system with L3 and L2 CAT. L2 reports COS_MAX=16, L3
> reports COS_MAX=16 as well. Then you enabled CDP which cuts L3 COS_MAX in
> half. So the real usable number of CLOSIDs is going to be 8 for both L2 and
> L3 simply because you do not have a seperation of L2 and L3 in
> MSR_PQR_ASSOC. And if you allow 16 then any CLOSID > 8 will result in
> undefined behaviour. See SDM:
> 
>  "When CDP is enabled, specifying a COS value in IA32_PQR_ASSOC.COS outside
>   of the lower half of the COS space will cause undefined performance impact
>   to code and data fetches due to MSR space re-indexing into code/data masks
>   when CDP is enabled."

Bother.  The SDM also has this gem:

17.17.5.1 Cache Allocation Technology Dynamic Configuration
Both the CAT masks and CQM registers are accessible and modifiable at
any time during execution using RDMSR/WRMSR unless otherwise noted. When
writing to these MSRs a #GP(0) will be generated if any of the following
conditions occur:

 * Writing a COS greater than the supported maximum (specified as the
   maximum value of CPUID.(EAX=10H, ECX=ResID):EDX[15:0] for all valid
   ResID values) is written to the IA32_PQR_ASSOC.CLOS field.

With the intent here being that if you have more of one resource than
another, you can use all of the resources in the larger (with the
resource with fewer mask registers defaulting to the maximum value
when PQR_ASSOC.COS is too large [and I can't find the text that talks
about that default behaviour :-( ]

I think this all means that L3/CDP is "special". If CDP is on, we can't
use the top half of the CLOSID space that CPUID.(EAX=10H, ECX=1):EDX[15:0]
told us is present. So we can't exceed the half-way point. But in other
cases like L3 with max=16 and L2 with max=8 we should allow 16 groups,
but 0-7 allow control of L3 and L2, while 8-15 only allow L3 control (the
schemata code enforces this when you get to that part).

Perhaps we can encode this in another field in the rdt_resource structure
that says that some maximums globally override all others, while some
can be legitimately exceeded.

I'll have some words with the h/w architect, and get the SDM fixed in
the next edition.

-Tony


Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Luck, Tony
On Mon, Oct 17, 2016 at 11:14:55PM +0200, Thomas Gleixner wrote:
> > +   /* Compute rdt_max_closid across all resources */
> > +   rdt_max_closid = 0;
> > +   for_each_rdt_resource(r)
> > +   rdt_max_closid = max(rdt_max_closid, r->num_closid);
> 
> Oh no! This needs to be min().
> 
> Assume you have a system with L3 and L2 CAT. L2 reports COS_MAX=16, L3
> reports COS_MAX=16 as well. Then you enabled CDP which cuts L3 COS_MAX in
> half. So the real usable number of CLOSIDs is going to be 8 for both L2 and
> L3 simply because you do not have a seperation of L2 and L3 in
> MSR_PQR_ASSOC. And if you allow 16 then any CLOSID > 8 will result in
> undefined behaviour. See SDM:
> 
>  "When CDP is enabled, specifying a COS value in IA32_PQR_ASSOC.COS outside
>   of the lower half of the COS space will cause undefined performance impact
>   to code and data fetches due to MSR space re-indexing into code/data masks
>   when CDP is enabled."

Bother.  The SDM also has this gem:

17.17.5.1 Cache Allocation Technology Dynamic Configuration
Both the CAT masks and CQM registers are accessible and modifiable at
any time during execution using RDMSR/WRMSR unless otherwise noted. When
writing to these MSRs a #GP(0) will be generated if any of the following
conditions occur:

 * Writing a COS greater than the supported maximum (specified as the
   maximum value of CPUID.(EAX=10H, ECX=ResID):EDX[15:0] for all valid
   ResID values) is written to the IA32_PQR_ASSOC.CLOS field.

With the intent here being that if you have more of one resource than
another, you can use all of the resources in the larger (with the
resource with fewer mask registers defaulting to the maximum value
when PQR_ASSOC.COS is too large [and I can't find the text that talks
about that default behaviour :-( ]

I think this all means that L3/CDP is "special". If CDP is on, we can't
use the top half of the CLOSID space that CPUID.(EAX=10H, ECX=1):EDX[15:0]
told us is present. So we can't exceed the half-way point. But in other
cases like L3 with max=16 and L2 with max=8 we should allow 16 groups,
but 0-7 allow control of L3 and L2, while 8-15 only allow L3 control (the
schemata code enforces this when you get to that part).

Perhaps we can encode this in another field in the rdt_resource structure
that says that some maximums globally override all others, while some
can be legitimately exceeded.

I'll have some words with the h/w architect, and get the SDM fixed in
the next edition.

-Tony


Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Thomas Gleixner
On Fri, 14 Oct 2016, Fenghua Yu wrote:
> +/*
> + * Trivial allocator for CLOSIDs. Since h/w only supports a small number,
> + * we can keep a bitmap of free CLOSIDs in a single integer.
> + *
> + * Please note: This only supports global CLOSID across multiple
> + * resources and multiple sockets. User can create rdtgroups including root
> + * rdtgroup up to the number of CLOSIDs, which is 16 on Broadwell. When
> + * number of caches is big or number of supported resources sharing CLOSID
> + * is growing, it's getting harder to find usable rdtgroups which is limited
> + * by the small number of CLOSIDs.
> + *
> + * In the future, if it's necessary, we can implement more complex CLOSID
> + * allocation per socket/per resource domain and utilize CLOSIDs as many
> + * as possible. E.g. on 2-socket Broadwell, user can create upto 16x16=256
> + * rdtgroups and each rdtgroup has different combination of two L3 CBMs.

I'm confused as usual, but a two socket broadwell has exactly two L3 cache
domains and exactly 16 CLOSIDs per cache domain.

If you take CDP into account then the number of CLOSIDs is reduced to 8 per
cache domains.

So we never can have more than nr(CLOSIDs) * nr(L3 cache domains) unique
settings. So for a two socket broadwell its 32 for !CDP and 16 for CDP.

With the proposed user interface the number of unique rdtgroups is simply
the number of CLOSIDs because we handle the cache domains already per
resource, i.e. the meaning of CLOSID can be set independently per cache
domain.

Can you please explain why you think that we can have 16x16 unique
rdtgroups if we just have 16 resp. 8 CLOSIDs available?

> +static int closid_free_map;
> +
> +static void closid_init(void)
> +{
> + struct rdt_resource *r = _resources_all[RDT_RESOURCE_L3];
> + int rdt_max_closid;
> +
> + /* Enabling L3 CDP halves the number of CLOSIDs */
> + if (r->cdp_enabled)
> + r->num_closid = r->max_closid / 2;
> + else
> + r->num_closid = r->max_closid;

If you do the L3Data/L3Core thingy then this can go away.

> + /* Compute rdt_max_closid across all resources */
> + rdt_max_closid = 0;
> + for_each_rdt_resource(r)
> + rdt_max_closid = max(rdt_max_closid, r->num_closid);

Oh no! This needs to be min().

Assume you have a system with L3 and L2 CAT. L2 reports COS_MAX=16, L3
reports COS_MAX=16 as well. Then you enabled CDP which cuts L3 COS_MAX in
half. So the real usable number of CLOSIDs is going to be 8 for both L2 and
L3 simply because you do not have a seperation of L2 and L3 in
MSR_PQR_ASSOC. And if you allow 16 then any CLOSID > 8 will result in
undefined behaviour. See SDM:

 "When CDP is enabled, specifying a COS value in IA32_PQR_ASSOC.COS outside
  of the lower half of the COS space will cause undefined performance impact
  to code and data fetches due to MSR space re-indexing into code/data masks
  when CDP is enabled."

> + if (rdt_max_closid > 32) {
> + pr_warn_once("Only using 32/%d CLOSIDs\n", rdt_max_closid);

You have a fancy for pr_*_once(). How often is this going to be executed?
Once per mount and we can really print it each time. mount is hardly a fast
path operation.

Also please spell out: "Using only 32 of %d CLOSIDs"
because "Using only 32/64 COSIDs" does not make any sense.

> + rdt_max_closid = 32;
> + }

> +static void rdt_reset_pqr_assoc_closid(void *v)
> +{
> + struct intel_pqr_state *state = this_cpu_ptr(_state);
> +
> + state->closid = 0;
> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);

So this is called with preemption disabled, but is thread context the only
context in which pqr_state can be modified? If yes, please add a comment
explaining this clearly, otherwise the protection is not sufficient. I'm
too lazy to stare into the monitoring code ...

> +/*
> + * Forcibly remove all of subdirectories under root.
> + */
> +static void rmdir_all_sub(void)
> +{
> + struct rdtgroup *rdtgrp;
> + struct list_head *l, *next;
> +
> + get_cpu();
> + /* Reset PQR_ASSOC MSR on this cpu. */
> + rdt_reset_pqr_assoc_closid(NULL);
> + /* Reset PQR_ASSOC MSR on the rest of cpus. */
> + smp_call_function_many(cpu_online_mask, rdt_reset_pqr_assoc_closid,
> +NULL, 1);

So you have reset the CLOSIDs to 0, but what resets L3/2_QOS_MASK_0 to the
default value (all valid bits set)? Further what clears CDP? 

> +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> +   umode_t mode)
> +{

> + /* allocate the rdtgroup. */
> + rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
> + if (!rdtgrp) {
> + ret = -ENOSPC;
> + goto out_closid_free;
> + }
> + rdtgrp->closid = closid;
> + list_add(>rdtgroup_list, _all_groups);
> +
> + /* kernfs creates the directory for rdtgrp */
> + kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
> + if (IS_ERR(kn)) {
> + 

Re: [PATCH v4 13/18] x86/intel_rdt: Add mkdir to resctrl file system

2016-10-17 Thread Thomas Gleixner
On Fri, 14 Oct 2016, Fenghua Yu wrote:
> +/*
> + * Trivial allocator for CLOSIDs. Since h/w only supports a small number,
> + * we can keep a bitmap of free CLOSIDs in a single integer.
> + *
> + * Please note: This only supports global CLOSID across multiple
> + * resources and multiple sockets. User can create rdtgroups including root
> + * rdtgroup up to the number of CLOSIDs, which is 16 on Broadwell. When
> + * number of caches is big or number of supported resources sharing CLOSID
> + * is growing, it's getting harder to find usable rdtgroups which is limited
> + * by the small number of CLOSIDs.
> + *
> + * In the future, if it's necessary, we can implement more complex CLOSID
> + * allocation per socket/per resource domain and utilize CLOSIDs as many
> + * as possible. E.g. on 2-socket Broadwell, user can create upto 16x16=256
> + * rdtgroups and each rdtgroup has different combination of two L3 CBMs.

I'm confused as usual, but a two socket broadwell has exactly two L3 cache
domains and exactly 16 CLOSIDs per cache domain.

If you take CDP into account then the number of CLOSIDs is reduced to 8 per
cache domains.

So we never can have more than nr(CLOSIDs) * nr(L3 cache domains) unique
settings. So for a two socket broadwell its 32 for !CDP and 16 for CDP.

With the proposed user interface the number of unique rdtgroups is simply
the number of CLOSIDs because we handle the cache domains already per
resource, i.e. the meaning of CLOSID can be set independently per cache
domain.

Can you please explain why you think that we can have 16x16 unique
rdtgroups if we just have 16 resp. 8 CLOSIDs available?

> +static int closid_free_map;
> +
> +static void closid_init(void)
> +{
> + struct rdt_resource *r = _resources_all[RDT_RESOURCE_L3];
> + int rdt_max_closid;
> +
> + /* Enabling L3 CDP halves the number of CLOSIDs */
> + if (r->cdp_enabled)
> + r->num_closid = r->max_closid / 2;
> + else
> + r->num_closid = r->max_closid;

If you do the L3Data/L3Core thingy then this can go away.

> + /* Compute rdt_max_closid across all resources */
> + rdt_max_closid = 0;
> + for_each_rdt_resource(r)
> + rdt_max_closid = max(rdt_max_closid, r->num_closid);

Oh no! This needs to be min().

Assume you have a system with L3 and L2 CAT. L2 reports COS_MAX=16, L3
reports COS_MAX=16 as well. Then you enabled CDP which cuts L3 COS_MAX in
half. So the real usable number of CLOSIDs is going to be 8 for both L2 and
L3 simply because you do not have a seperation of L2 and L3 in
MSR_PQR_ASSOC. And if you allow 16 then any CLOSID > 8 will result in
undefined behaviour. See SDM:

 "When CDP is enabled, specifying a COS value in IA32_PQR_ASSOC.COS outside
  of the lower half of the COS space will cause undefined performance impact
  to code and data fetches due to MSR space re-indexing into code/data masks
  when CDP is enabled."

> + if (rdt_max_closid > 32) {
> + pr_warn_once("Only using 32/%d CLOSIDs\n", rdt_max_closid);

You have a fancy for pr_*_once(). How often is this going to be executed?
Once per mount and we can really print it each time. mount is hardly a fast
path operation.

Also please spell out: "Using only 32 of %d CLOSIDs"
because "Using only 32/64 COSIDs" does not make any sense.

> + rdt_max_closid = 32;
> + }

> +static void rdt_reset_pqr_assoc_closid(void *v)
> +{
> + struct intel_pqr_state *state = this_cpu_ptr(_state);
> +
> + state->closid = 0;
> + wrmsr(MSR_IA32_PQR_ASSOC, state->rmid, 0);

So this is called with preemption disabled, but is thread context the only
context in which pqr_state can be modified? If yes, please add a comment
explaining this clearly, otherwise the protection is not sufficient. I'm
too lazy to stare into the monitoring code ...

> +/*
> + * Forcibly remove all of subdirectories under root.
> + */
> +static void rmdir_all_sub(void)
> +{
> + struct rdtgroup *rdtgrp;
> + struct list_head *l, *next;
> +
> + get_cpu();
> + /* Reset PQR_ASSOC MSR on this cpu. */
> + rdt_reset_pqr_assoc_closid(NULL);
> + /* Reset PQR_ASSOC MSR on the rest of cpus. */
> + smp_call_function_many(cpu_online_mask, rdt_reset_pqr_assoc_closid,
> +NULL, 1);

So you have reset the CLOSIDs to 0, but what resets L3/2_QOS_MASK_0 to the
default value (all valid bits set)? Further what clears CDP? 

> +static int rdtgroup_mkdir(struct kernfs_node *parent_kn, const char *name,
> +   umode_t mode)
> +{

> + /* allocate the rdtgroup. */
> + rdtgrp = kzalloc(sizeof(*rdtgrp), GFP_KERNEL);
> + if (!rdtgrp) {
> + ret = -ENOSPC;
> + goto out_closid_free;
> + }
> + rdtgrp->closid = closid;
> + list_add(>rdtgroup_list, _all_groups);
> +
> + /* kernfs creates the directory for rdtgrp */
> + kn = kernfs_create_dir(parent->kn, name, mode, rdtgrp);
> + if (IS_ERR(kn)) {
> +