from:"Yu, Fenghua"

RE: [PATCH v4 1/4] x86/cpufeatures: Enumerate #DB for bus lock detection

2021-01-27 Thread Yu, Fenghua

Hi, Thomas,

> On Wed, Jan 27, 2021 2:16 PM, Thomas Gleixner wrote:
> On Tue, Nov 24 2020 at 20:52, Fenghua Yu wrote:
> 
> > A bus lock is acquired though either split locked access to writeback
> > (WB) memory or any locked access to non-WB memory. This is typically
> > >1000 cycles slower than an atomic operation within a cache line. It
> > also disrupts performance on other cores.
> >
> > Some CPUs have ability to notify the kernel by an #DB trap after a
> > user instruction acquires a bus lock and is executed. This allows the
> > kernel to enforce user application throttling or mitigations.
> 
> That's nice, but how does that interact with a data breakpoint on the same
> location?

If both data breakpoint and bus lock happen on the same location, the bus lock
is handled first and then the data breakpoint is handled in the same exception:

1. If warn on bus lock, a rate limited warning is printed for the bus lock and 
then
a SIGTRAP is sent to the user process.
2. If fatal on bus lock, a SIGBUS is sent to the user process for the bus lock 
and a
SIGTRAP is also sent to the user process. I think the SIGBUS will be 
delivered first
to the process and then SIGTRAP will be delivered to the process.
3. If ratelimit on bus lock, first the tasks in the user sleep for specified 
time, then
SIGTRAP is sent to the user process.

Is the interaction OK?

> 
> Also the information you pointed to in the cover letter
> 
> >  [1] Intel Instruction Set Extension Chapter 8:
> > https://software.intel.com/sites/default/files/managed/c5/15/architect
> > ure-instruction-set-extensions-programming-reference.pdf
> 
> does not contain anything which is even remotely related to this patch series.
> That chapter describes another bit in TEST_CTRL_MSR ...

I think either I gave an old link or the content in the link was changed to an 
older version of ISE doc after this series was released.

Here is the new ISE doc and the bus lock exception is described in Chapter 9.
https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf

I'll update the link in the next version.

Thanks!

-Fenghua

RE: [PATCH v4 00/17] Miscellaneous fixes for resctrl selftests

2020-12-10 Thread Yu, Fenghua

Hi, Shuah,

> This patch set has several miscellaneous fixes to resctrl selftest tool that 
> are
> easily visible to user. V1 had fixes to CAT test and CMT test but they were
> dropped in V2 because having them here made the patchset humongous. So,
> changes to CAT test and CMT test will be posted in another patchset.
> 
> Change Log:
> v4:
> - Address various comments from Shuah Khan:
>   1. Combine a few patches e.g. a couple of fixing typos patches into one
>  and a couple of unmounting patches into one etc.
>   2. Add config file.
>   3. Remove "Fixes" tags.
>   4. Change strcmp() to strncmp().
>   5. Move the global variable fixing patch to the patch 1 so that the
>  compilation issue is fixed first.

Any comment on this series?

Thank you very much for your review!

-Fenghua

RE: [PATCH v4 0/4] x86/bus_lock: Enable bus lock detection

2020-12-02 Thread Yu, Fenghua

Hi, Dear maintainers,

> A bus lock [1] is acquired through either split locked access to writeback 
> (WB)
> memory or any locked access to non-WB memory. This is typically >1000
> cycles slower than an atomic operation within a cache line. It also disrupts
> performance on other cores.

...

> Change Log:
> v4:
> - Fix a ratelimit wording issue in the doc (Randy).
> - Patch 4 is acked by Randy (Randy).

Friendly reminder about this series...

Thank you very much in advance for your time!

-Fenghua

RE: [PATCH v3 4/4] Documentation/admin-guide: Change doc for split_lock_detect parameter

2020-11-20 Thread Yu, Fenghua

Hi, Randy,

> >>> +   for bus lock detection. 0 < N <= HZ/2 and
> >>> +   N is approximate. Only applied to non-root
> >>> +   users.
> >>
> >> Sorry, but I don't know what this means. I think it's the "and N is
> >> appropriate"
> >> that is confusing me.
> >>
> >>0 < N <= HZ/2 and N is appropriate.
> >
> > You are right. I will remove "and N is appropriate" in the next version.
> >
> > Could you please ack this patch? Can I add Acked-by from you in the
> updated patch?
> >
> > Thank you very much for your review!
> 
> Sure, no problem.
> 
> Acked-by: Randy Dunlap 

Really appreciate your review!

-Fenghua

RE: [PATCH v3 4/4] Documentation/admin-guide: Change doc for split_lock_detect parameter

2020-11-20 Thread Yu, Fenghua

Hi, Randy,

> > +   ratelimit:N -
> > + Set rate limit to N bus locks per second
> > + for bus lock detection. 0 < N <= HZ/2 and
> > + N is approximate. Only applied to non-root
> > + users.
> 
> Sorry, but I don't know what this means. I think it's the "and N is
> appropriate"
> that is confusing me.
> 
>   0 < N <= HZ/2 and N is appropriate.

You are right. I will remove "and N is appropriate" in the next version.

Could you please ack this patch? Can I add Acked-by from you in the updated 
patch?

Thank you very much for your review!

-Fenghua

RE: [PATCH v2 2/4] x86/bus_lock: Handle warn and fatal in #DB for bus lock

2020-11-19 Thread Yu, Fenghua

Hi, Peter,

> > #DB for bus lock is enabled by bus lock detection bit 2 in DEBUGCTL
> > MSR while #AC for split lock is enabled by split lock detection bit 29
> > in TEST_CTRL MSR.
> >
> > Delivery of #DB for bus lock in userspace clears DR6[11]. To avoid
> > confusion in identifying #DB, #DB handler sets the bit to 1 before
> > returning to the interrupted task.
> >
> > Use the existing kernel command line option "split_lock_detect=" to
> > handle #DB for bus lock:
> >
> > split_lock_detect=
> > #AC for split lock  #DB for bus lock
> >
> > off Do nothing  Do nothing
> >
> > warnKernel OOPs Warn once per task and
> > Warn once per task and  and continues to run.
> > disable future checking When both features are
> > supported, warn in #DB
> >
> > fatal   Kernel OOPs Send SIGBUS to user
> > Send SIGBUS to user
> > When both features are
> > supported, fatal in #AC.
> >
> > Default option is "warn".
> >
> > Hardware only generates #DB for bus lock detect when CPL>0 to avoid
> > nested #DB from multiple bus locks while the first #DB is being handled.
> > So no need to handle #DB for bus lock detected in the kernel.
> >
> > Signed-off-by: Fenghua Yu 
> > Reviewed-by: Tony Luck 
> 
> Sane enough I suppose,
> 
> Acked-by: Peter Zijlstra (Intel) 

Thank you very much for your review!

Can I put your Acked-by tag in all 4 patches?

> 
> The one thing I found still missing is a better description of the things 
> tickling
> SLD vs BLD. IIRC BLD detects a wider range of issues.
> Therefore it _might_ make sense to allow SLD && BLD when fatal, instead of
> only SLD.
> 
> Still, that's nitpicking.

You are right. BLD can detect both split lock and UC memory access while SLD
only detects split lock. Enabling both SLD and BLD when fatal can capture
both split lock in #AC and UC access in #DB.

I will send the fixing series to enable both SLD and BLD when fatal.

-Fenghua

RE: [PATCH 2/4] x86/bus_lock: Handle warn and fatal in #DB for bus lock

2020-11-10 Thread Yu, Fenghua

Hi, Peter,

> On Sun, Nov 08, 2020 at 04:29:16AM +, Fenghua Yu wrote:
> > split_lock_detect=
> > #AC for split lock  #DB for bus lock
> >
> > off Do nothing  Do nothing
> >
> > warnKernel OOPs Warn once per task and
> > Warn once per task and  and continues to run.
> > disable future checking When both features are
> > supported, warn in #DB
> >
> > fatal   Kernel OOPs Send SIGBUS to user
> > Send SIGBUS to user
> > When both features are
> > supported, fatal in #AC.
> 
> > +void handle_bus_lock(struct pt_regs *regs) {
> > +   if (!bld)
> > +   return;
> > +
> > +   pr_warn_ratelimited("#DB: %s/%d took a bus_lock trap at address:
> 0x%lx\n",
> > +   current->comm, current->pid, regs->ip); }
> 
> So the Changelog above, and the state_show() below, seem to suggest there
> should be SIGBUS code in #DB, but I'm having trouble spotting it.

You are right. The SIGBUS is missing here. Somehow my tests didn't capture the 
issue.

I will add:
+   force_sig_fault(SIGBUS, BUS_ADRALN, NULL);
to send SIGBUS in fatal case for #DB bus lock.

Thank you very much for your review!

-Fenghua

RE: [PATCH RFC] x86/bus_lock: Enable bus lock detection

2020-07-29 Thread Yu, Fenghua

Hi, Sean,

> > A bus lock [1] is acquired either through split locked access to
> > writeback (WB) memory or by using locks to uncacheable (UC) memory
> > (e.g. direct device
> 
> Does SLD not detect the lock to UC memory?

The statement might not be accurate. Split Lock Detection doesn't detect bus
lock to non-WB memory (including UC memory).

> 
> > assignment). This is typically >1000 cycles slower than an atomic
> > operation within a cache line. It also disrupts performance on other cores.
> >
> > Although split lock can be detected by #AC trap, the trap is triggered
> > before the instruction acquires bus lock. This makes it difficult to
> > mitigate bus lock (e.g. throttle the user application).
> 
> Mitigate _in a non-fatal way_.  The #AC makes it very easy to mitigate split
> locks, it just has the side effect of SIGBUGS or killing the KVM guest.

Adding "in a non-fatal way" is more better. Will fix it.

> 
> > Some CPUs have ability to notify the kernel by an #DB trap after the
> > instruction acquires a bus lock and is executed. This allows the
> > kernel to enforce user application throttling or mitigations and also
> > provides a better environment to debug kernel split lock issues since
> > the kernel can continue instead of crashing.
> >
> > #DB for bus lock detect fixes all issues in #AC for split lock detect:
> 
> Fixes "all" issues... and creates some new ones, e.g. there are use cases
> where preventing the split lock from happening in the first place is strongly
> desired.  It's why that train wreck exists.

Bus Lock Detection doesn't replace Split Lock Detection. If both features
are enabled, default behavior is warning from bus lock, fatal behavior is
still from split lock. See the behavior table as follows.

> 
> > 1) It's architectural ... just need to look at one CPUID bit to know it
> >exists
> > 2) The IA32_DEBUGCTL MSR, which reports bus lock in #DB, is per-thread.
> >So each process or guest can have different behavior.
> > 3) It has support for VMM/guests (new VMEXIT codes, etc).
> >
> > Use the existing kernel command line option "split_lock_detect=" to
> > handle #DB for bus lock:
> 
> Are SLD and BLD mutually exclusive?  Can we even guarantee that given the
> track record of SLD?  If not, we'll likely want to allow the user to choose
> between SDL and BLD via split_lock_detect.

The two hardware features can be enabled on the same platform.
But they are mutually exclusive in the kernel because #AC from SLD happens
before the instruction is executed and #DB happens after the instruction is
executed.

Right now, if both of them are enabled, "warn" behavior goes to
bus lock and "fatal" behavior goes to split lock.

Do you want the user to override the behaviors by something like this?

split_lock_detect=warn[,sld]: if given "sld" while both features are enabled,
warn behavior is from split lock instead of bus lock detection.

split_lock_detect=fatal[,bld]: if given "bld" while both features are enabled,
fatal behavior is from bus lock detection.

> 
> > split_lock_detect=
> > #AC for split lock  #DB for bus lock
> >
> > off Do nothing  Do nothing
> >
> > warnKernel OOPs Kernel warns rate 
> > limited
> > Warn once per task and  and continues to run.
> > disable future checking Warn once per task and
> > and continue to run.
> > When both features are
> > supported, warn in #DB
> >
> > fatal   Kernel OOPs Kernel warn rate limited
> 
> Unless the lock to UC #DB is new behavior, why would we revert to allowing
> split locks in the kernel?

SLD cannot detect lock to non-WB memory (including UC). BLD can detect
both bus locks from both split lock and lock to non-WB.

> 
> > Send SIGBUS to user Send SIGBUS to user
> > When both features are
> > supported, fatal in #AC.
> >
> > ratelimit:N Do nothing  Kernel warns rate limited
> 
> This should be more than "Do nothing" for #AC, e.g. fall back to warn or at
> least print a loud error.

Ok. Will fall back to warn.

> 
> > and continue to run.
> > Limit bus lock rate to
> > N per second in the
> > current non root user.
> >
> > On systems that support #DB for bus lock detection the default is "warn".
> >
> > [1] Chapter 8
> > https://software.intel.com/sites/default/files/managed/c5/15/architect
> > ure-instruction-set-extensions-programming-reference.pdf
> >
> > Signed-off-by: Fenghua Yu 
> > Reviewed-by: Tony Luck 
> > ---
> >  .../admin-guide/kernel-parameters.txt |  48 +-
> >

RE: [RFC PATCH] x86/cpufeatures: Enumerate new AVX512 bfloat16 instructions

2019-06-11 Thread Yu, Fenghua

> On Tuesday, June 11, 2019 3:28 PM, Fenghua Yu wrote:
> On Tue, Jun 11, 2019 at 09:47:02PM +0200, Borislav Petkov wrote:
> > On Tue, Jun 11, 2019 at 11:19:20AM -0700, Fenghua Yu wrote:
> > > So can I re-organize word 11 and 12 as follows?
> > >
> > > 1. Change word 11 to host scattered features.
> > > 2. Move the previos features in word 11 and word 12 to word 11:
> > > /*
> > >  * Extended auxiliary flags: Linux defined - For features scattered
> > > in various
> > >  * CPUID levels and sub-leaves like CPUID level 7 and sub-leaf 1, etc,
> word 19.
> > >  */
> > > #define X86_FEATURE_CQM_LLC (11*32+ 0) /* LLC QoS if 1 */
> > > #define X86_FEATURE_CQM_OCCUP_LLC   (11*32+ 1) /* LLC
> occupancy monitoring */
> > > #define X86_FEATURE_CQM_MBM_TOTAL   (11*32+ 2) /* LLC Total
> MBM monitoring */
> > > #define X86_FEATURE_CQM_MBM_LOCAL   (11*32+ 3) /* LLC Local
> MBM monitoring */
> >
> > Yap.
> >
> > > 3. Change word 12 to host CPUID.(EAX=7,ECX=1):EAX:
> > > /* Intel-defined CPU features, CPUID level 0x7:1 (EAX), word 12 */
> > > #define X86_FEATURE_AVX512_BF16 (12*32+ 0) /* BFLOAT16
> instructions */
> >
> > This needs to be (12*32+ 5) if word 12 is going to map leaf
> > CPUID.(EAX=7,ECX=1):EAX.
> >
> > At least judging from the arch extensions doc which lists EAX as:
> >
> > Bits 04-00: Reserved.
> > Bit 05: AVX512_BF16. Vector Neural Network Instructions supporting
> BFLOAT16 inputs and conversion instructions from IEEE single precision.
> > Bits 31-06: Reserved.
> 
> Yes, you are absolutely right. I'll defint it as (12*32+ 5).
> 
> >
> > > 4. Do other necessary changes to match the new word 11 and word 12.
> >
> > But split in two patches: first does steps 1+2, second patch adds the
> > new leaf to word 12.
> 
> There are two varialbes defined in cpuinfo_x86: x86_cache_max_rmid and
> x86_cache_occ_scale. c->x86_cache_max_rmid is read from CPUID.0xf.1:ECX
> and c->x86_cache_occ_scale is read from CPUID.0xf.1:EBX.
> 
> After getting X86_FEATURE_CQM_* from scattered, the two variables need
> to be read from CPUID again. So the code of reading the two variables need
> to be moved from before init_scattered_cpuid_features(c) to after the
> function. This make the get_cpu_cap() code awkward.
> 
> And the two variables are ONLY used in resctrl monitoring configuration.
> There is no need to store them in cpuinfo_x86 on each CPU.
> 
> I'm thinking to simplify and clean this part of code:
> 
> 1. In patch #1:
> - remove the definitions of x86_cache_max_rmid and x86_cache_occ_scale
> from cpuinfo_x86
> - remove assignment of c->x86_cache_max_rmid and c-
> >x86_cache_occ_scale from get_cpu_cap(c)
> - get r->mon_scale and r->num_rmid in rdt_get_mon_l3_config(r) directly
> from CPUID.0xf.1:EBX and CPUID.0xf.1:ECX.
> 2. In patch #2: do steps 1+2 to recycle word 11. After patch #1, I can totally
> remove the code to get c->x86_cache_max_rmd and
> c->x86_cache_occ_scale in get_cpu_cap(c). And patch #2 is cleaner.
> 3. In patch #3: add new word 12 to host CPUID.7.1:EAX
> 
> Do you think the patch #1 is necessary and this is a right patch set?

My bad. I studied a bit more and found the patch #1 is not needed. Please 
ignore my last email.
 
Thanks.
 
-Fenghua

RE: [PATCH v3 3/5] x86/umwait: Add sysfs interface to control umwait C0.2 state

2019-05-30 Thread Yu, Fenghua

> On Thursday, May 30, 2019 2:11 PM Andy Lutomirski [mailto:l...@kernel.org] 
> wrote:
> On Fri, May 24, 2019 at 5:05 PM Fenghua Yu  wrote:
> >
> > C0.2 state in umwait and tpause instructions can be enabled or
> > disabled on a processor through IA32_UMWAIT_CONTROL MSR register.
> >
> > By default, C0.2 is enabled and the user wait instructions result in
> > lower power consumption with slower wakeup time.
> >
> > But in real time systems which requrie faster wakeup time although
> > power savings could be smaller, the administrator needs to disable
> > C0.2 and all
> > C0.2 requests from user applications revert to C0.1.
> >
> > A sysfs interface "/sys/devices/system/cpu/umwait_control/enable_c0_2"
> > is created to allow the administrator to control C0.2 state during run time.
> >
> > Signed-off-by: Fenghua Yu 
> > Reviewed-by: Ashok Raj 
> > Reviewed-by: Tony Luck 
> > ---
> >  arch/x86/power/umwait.c | 75
> > ++---
> >  1 file changed, 71 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c index
> > 80cc53a9c2d0..cf5de7e1cc24 100644
> > --- a/arch/x86/power/umwait.c
> > +++ b/arch/x86/power/umwait.c
> > @@ -7,6 +7,7 @@
> >  static bool umwait_c0_2_enabled = true;
> >  /* Umwait max time is in TSC-quanta. Bits[1:0] are zero. */  static
> > u32 umwait_max_time = 10;
> > +static DEFINE_MUTEX(umwait_lock);
> >
> >  /* Return value that will be used to set IA32_UMWAIT_CONTROL MSR */
> > static u32 umwait_compute_msr_value(void) @@ -22,7 +23,7 @@ static
> u32
> > umwait_compute_msr_value(void)
> >(umwait_max_time & MSR_IA32_UMWAIT_CONTROL_MAX_TIME);
> >  }
> >
> > -static void umwait_control_msr_update(void)
> > +static void umwait_control_msr_update(void *unused)
> >  {
> > u32 msr_val;
> >
> > @@ -33,7 +34,9 @@ static void umwait_control_msr_update(void)
> >  /* Set up IA32_UMWAIT_CONTROL MSR on CPU using the current global
> > setting. */  static int umwait_cpu_online(unsigned int cpu)  {
> > -   umwait_control_msr_update();
> > +   mutex_lock(_lock);
> > +   umwait_control_msr_update(NULL);
> > +   mutex_unlock(_lock);
> 
> What's the mutex for?  Can't you just use READ_ONCE?

umwait_control_msr_update() will write both umwait_c0_2_enabled and 
umwait_max_time (which also can be
changed through sysfs in the next patch) to the TEST_CTRL MSR.

Just using READ_ONCE() for the two variables cannot guarantee all CPUs have the 
same setting of C0.2 and max time.
READ_ONCE() and WRITE_ONCE() can only guarantee atomicity for reading and 
writng the same variable.

For e.g. without mutex protection:

initial values: umwait_c0_2_enabled=1 and umwait_max_time=10

1. umwait_cpu_online(X): read umwait_c0_2_enabled as 1
2. enable_c0_2_store(): umwait_c0_2_enabled = 0 and update all online CPUs as 
C0.2 disabled.
3. umwait_cpu_online(X): read umwait_max_time=10
4. umwait_cpu_online(Y): read umwait_c0_2_enabled as 0
5. umwait_max_time_store(): umwait_max_time=500 and update all online CPUs as 
max time = 500 cycles.
6. umwait_cpu_online(Y): read umwait_max_time as 500
7. umwati_cpu_online(X): wrmsr() enables C0.2 and sets max time 10 on CPU X
8. umwait_cpu_online(Y): disables C0.2 and sets  max time 500 on CPU Y

With the mutex to protect the two variables and wrmsr(), each CPU will have the 
same setting of C0.2 and max time.

> 
> > +static void umwait_control_msr_update_all_cpus(void)
> > +{
> > +   u32 msr_val;
> > +
> > +   msr_val = umwait_compute_msr_value();
> > +   /* All CPUs have same umwait control setting */
> > +   on_each_cpu(umwait_control_msr_update, NULL, 1);
> 
> Why are you calling umwait_compute_msr_value()?

Umwait_compute_msr_value() computes the TEST_CTL value from two variables 
umwait_c0_2_enabled and umwait_max_time.
Any of the two variables may be changed when  
umwait_control_msr_update_all_cpus() is called. So need to re-calculate the
MSR value then write the value to MSR on all CPUs.

Thanks.

-Fenghua

RE: [PATCH v7 12/13] selftests/resctrl: Disable MBA and MBM tests for AMD

2019-05-10 Thread Yu, Fenghua

> On Friday, May 10, 2019 10:40 AM
> Andre Przywara [mailto:andre.przyw...@arm.com] wrote:
> To: Yu, Fenghua 
> Cc: Thomas Gleixner ; Ingo Molnar
> ; Borislav Petkov ; H Peter Anvin
> ; Luck, Tony ; Chatre, Reinette
> ; Shankar, Ravi V ;
> Shen, Xiaochen ; Pathan, Arshiya Hayatkhan
> ; Prakhya, Sai Praneeth
> ; Babu Moger ;
> linux-kernel ; James Morse
> 
> Subject: Re: [PATCH v7 12/13] selftests/resctrl: Disable MBA and MBM tests
> for AMD
> 
> On Sat,  9 Feb 2019 18:50:41 -0800
> Fenghua Yu  wrote:
> 
> Hi,
> 
> > From: Babu Moger 
> >
> > For now, disable MBA and MBM tests for AMD. Deciding test pass/fail is
> > not clear right now. We can enable when we have some clarity.
> 
> I don't think this is the right way. The availability of features should be
> queryable, for instance by looking into /sys/fs/resctrl/info. Checking for a
> certain vendor to skip tests just sounds wrong to me, and is definitely not
> scalable or future proof.
> 
> We should really check the availability of a feature, then skip the whole
> subsystem test in resctrl_tests.c.

Babu may correct if I'm wrong: AMD does support the MBA and MBM features. So if 
querying the info directory, the features are there. But AMD doesn't want to 
support the testing for the features in the current patch set. They may support 
the testing in the future.

Thanks.

-Fenghua

RE: [PATCH v7 00/13] selftests/resctrl: Add resctrl selftest

2019-05-10 Thread Yu, Fenghua

>On Friday, May 10, 2019 10:36 AM
>Andre Przywara [mailto:andre.przyw...@arm.com] wrote:
> On Sat,  9 Feb 2019 18:50:29 -0800
> Fenghua Yu  wrote:
> 
> Hi Fenghua, Babu,
> 
> > With more and more resctrl features are being added by Intel, AMD and
> > ARM, a test tool is becoming more and more useful to validate that
> > both hardware and software functionalities work as expected.
> 
> That's very much appreciated! We decided to use that tool here to detect
> regressions in James' upcoming resctrl rework series. While doing so we
> spotted some shortcomings:

This is great!

> - There is some unconditional x86 inline assembly which obviously breaks
> the build on ARM.

Will fix this as much as possible.

BTW, does ARM support perf imc_count events which are used in CAT tests?

> - The result output is somewhat confusing, I find it hard to see whether a
> test succeeded or not, also it's not easily parseable by scripts.
> - There are some basic tests missing (does the filesystem exist? Is the
> mount point visible?)

Is it OK to fold the basic tests into bigger tests? E.g. the filesystem and 
mount point
tests are actually part of CAT test and we just dump the sub-test results 
during CAT test.
Seems your test results show the sub-test results during CAT test.

> - Some tests create files in the current directory. This is somewhat confusing
> if you happen to be in some sysfs directory, for instance. I don't think we
> really need temporary files (pipes should cover most of the cases), but in
> case we do, we should create them in some /tmp directory, probably by
> using a glibc function for this.

You are right. Will generate the temp file in /tmp.

> - There were some problems that newer GCC and clang complained about.

Which GCC do you use to see the issues? We use 8.2.1 and no compilation issue 
is reported.

> 
> For now I didn't look too closely at the actual test algorithms, but decided
> to fix those things first. I replied to some of the smaller issues I spotted 
> on
> the way in the individual patch mails.
> 
> For the bigger things (new tests, result formatting) I just went ahead and
> fixed them in my own branch [1] (work in progress!). I went with TAP
> formatting, because this is both readable by humans and can be processed
> by the "prove" tool (or any other simple UNIX tool, for that matter).
> 
> I used this amended version to do some regression testing on James'
> resctrl rework series, which involved boot testing more than 100 patches, so
> automation is key here. The result looks like this:
> ===
> TAP version 13
> ok kernel supports resctrl filesystem
> ok resctrl mountpoint "/sys/fs/resctrl" exists # resctrl filesystem not
> mounted # Starting MBM BW change ...
> ok MBM: bw change # SKIP memory bandwidth not supported # Starting
> MBA Schemata Change ...
> ok MBA: change-schemata # SKIP memory bandwidth not supported #
> Starting CQM test ...
> ok mounting resctrl to "/sys/fs/resctrl"
> ok CQM: test # SKIP L3 not supported
> # Starting CAT test ...
> ok mounting resctrl to "/sys/fs/resctrl"
> cache size :31457280
> ok writing benchmark parameters to resctrl FS ok Write schema "L3:0=7fff"
> to resctrl FS ok cache difference more than 4 % # Percent diff=16 # Number
> of bits: 15 # Avg_llc_perf_miss: 306994 # Allocated cache lines: 368640 ok
> CAT: test
> 1..11
> =
> 
> Appreciate any feedback on this!
> 
> Cheers,
> Andre.
> 
> [1] http://linux-arm.org/git?p=linux-ap.git;a=shortlog;h=refs/heads/resctrl-
> selftests-wip
> git://linux-arm.org/linux-ap.gitbranch: resctrl-selftests-wip

RE: [PATCH v7 02/13] selftests/resctrl: Add basic resctrl file system operations and data

2019-05-10 Thread Yu, Fenghua

> From: Andre Przywara [mailto:andre.przyw...@arm.com]
> On Sat,  9 Feb 2019 18:50:31 -0800
> Fenghua Yu  wrote:
> 
> Hi,
> 
> some comments inline.
> 
> > From: Sai Praneeth Prakhya 
> >
> > The basic resctrl file system operations and data are added for future
> > usage by resctrl selftest tool.
> >
> > Signed-off-by: Sai Praneeth Prakhya 
> > Signed-off-by: Arshiya Hayatkhan Pathan
> > 
> > Signed-off-by: Fenghua Yu 
> > Signed-off-by: Babu Moger 
> > ---
> >  tools/testing/selftests/resctrl/Makefile|  10 +
> >  tools/testing/selftests/resctrl/resctrl.h   |  48 +++
> >  tools/testing/selftests/resctrl/resctrlfs.c | 464
> > 
> >  3 files changed, 522 insertions(+)
> >  create mode 100644 tools/testing/selftests/resctrl/Makefile
> >  create mode 100644 tools/testing/selftests/resctrl/resctrl.h
> >  create mode 100644 tools/testing/selftests/resctrl/resctrlfs.c
> >
> > diff --git a/tools/testing/selftests/resctrl/Makefile
> > b/tools/testing/selftests/resctrl/Makefile
> > new file mode 100644
> > index ..bd5c5418961e
> > --- /dev/null
> > +++ b/tools/testing/selftests/resctrl/Makefile
> > @@ -0,0 +1,10 @@
> > +CC = gcc
> 
> Changing this to
> CC = $(CROSS_COMPILE)gcc
> make this cross compileable.
> 
> > +CFLAGS = -g -Wall
> 
> Can we add -O here? For once -O0 generates horrible code, but also -O
> tends to catch more bugs.
Sure.

> 
> > +
> > +*.o: *.c
> > +   $(CC) $(CFLAGS) -c *.c
> 
> This is a built-in rule in make, so you can remove it here:
> https://www.gnu.org/software/make/manual/html_node/Catalogue-of-
> Rules.html#Catalogue-of-Rules
Will do.

> 
> > +
> > +.PHONY: clean
> > +
> > +clean:
> > +   $(RM) *.o *~
> > diff --git a/tools/testing/selftests/resctrl/resctrl.h
> > b/tools/testing/selftests/resctrl/resctrl.h
> > new file mode 100644
> > index ..2e112934d48a
> > --- /dev/null
> > +++ b/tools/testing/selftests/resctrl/resctrl.h
> > @@ -0,0 +1,48 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */ #define _GNU_SOURCE #ifndef
> > +RESCTRL_H #define RESCTRL_H #include  #include 
> > +#include  #include  #include  #include
> > + #include  #include  #include
> > + #include  #include  #include
> > + #include  #include  #include
> > +
> > +
> > +#define RESCTRL_PATH   "/sys/fs/resctrl"
> > +#define PHYS_ID_PATH   "/sys/devices/system/cpu/cpu"
> > +
> > +#define PARENT_EXIT(err_msg)   \
> > +   do {\
> > +   perror(err_msg);\
> > +   kill(ppid, SIGKILL);\
> > +   exit(EXIT_FAILURE); \
> > +   } while (0)
> > +
> > +pid_t bm_pid, ppid;
> > +
> > +int remount_resctrlfs(bool mum_resctrlfs); int get_resource_id(int
> > +cpu_no, int *resource_id); int validate_bw_report_request(char
> > +*bw_report); int validate_resctrl_feature_request(char *resctrl_val);
> > +int taskset_benchmark(pid_t bm_pid, int cpu_no); void
> > +run_benchmark(int signum, siginfo_t *info, void *ucontext); int
> > +write_schemata(char *ctrlgrp, char *schemata, int cpu_no,
> > +  char *resctrl_val);
> > +int write_bm_pid_to_resctrl(pid_t bm_pid, char *ctrlgrp, char *mongrp,
> > +   char *resctrl_val);
> > +int perf_event_open(struct perf_event_attr *hw_event, pid_t pid, int
> cpu,
> > +   int group_fd, unsigned long flags); int
> run_fill_buf(unsigned
> > +long span, int malloc_and_init_memory, int memflush,
> > +int op, char *resctrl_va);
> > +
> > +#endif /* RESCTRL_H */
> > diff --git a/tools/testing/selftests/resctrl/resctrlfs.c
> > b/tools/testing/selftests/resctrl/resctrlfs.c
> > new file mode 100644
> > index ..5afcaa89f418
> > --- /dev/null
> > +++ b/tools/testing/selftests/resctrl/resctrlfs.c
> > @@ -0,0 +1,464 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +/*
> > + * Basic resctrl file system operations
> > + *
> > + * Copyright (C) 2018 Intel Corporation
> > + *
> > + * Authors:
> > + *Arshiya Hayatkhan Pathan 
> > + *Sai Praneeth Prakhya ,
> > + *Fenghua Yu 
> > + */
> > +#include "resctrl.h"
> > +
> > +#define RESCTRL_MBM"L3 monitoring detected"
> > +#define RESCTRL_MBA"MB allocation detected"
> > +enum {
> > +   RESCTRL_FEATURE_MBM,
> > +   RESCTRL_FEATURE_MBA,
> > +   MAX_RESCTRL_FEATURES
> > +};
> > +
> > +/*
> > + * remount_resctrlfs - Remount resctrl FS at /sys/fs/resctrl
> > + * @mum_resctrlfs: Should the resctrl FS be remounted?
> > + *
> > + * If not mounted, mount it.
> > + * If mounted and mum_resctrlfs then remount resctrl FS.
> > + * If mounted and !mum_resctrlfs then noop
> > + *
> > + * Return: 0 on success, non-zero on failure  */ int
> > +remount_resctrlfs(bool mum_resctrlfs) {
> > +   DIR *dp;
> > +   struct dirent *ep;
> > +   unsigned int count = 0;
> > +
> > +   /*
> > +* If kernel is built with CONFIG_RESCTRL, then /sys/fs/resctrl should
> > +* be present by default
> > +*/
> > +   dp

RE: [PATCH v7 02/13] selftests/resctrl: Add basic resctrl file system operations and data

2019-05-10 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@alien8.de]
> On Sat, Feb 09, 2019 at 06:50:31PM -0800, Fenghua Yu wrote:
> > From: Sai Praneeth Prakhya 
> >
> > The basic resctrl file system operations and data are added for future
> > usage by resctrl selftest tool.
> >
> > Signed-off-by: Sai Praneeth Prakhya 
> > Signed-off-by: Arshiya Hayatkhan Pathan
> > 
> > Signed-off-by: Fenghua Yu 
> > Signed-off-by: Babu Moger 
> 
> And while feedback from Andre is being addressed, you those SOB chains
> need to be fixed.

Will fix the SOB chains.

Thanks.

-Fenghua

RE: [PATCH v4 05/17] x86/cpufeatures: Enumerate IA32_CORE_CAPABILITIES MSR

2019-03-04 Thread Yu, Fenghua

> From: Hansen, Dave
> On 3/1/19 6:44 PM, Fenghua Yu wrote:
> > diff --git a/arch/x86/include/asm/cpufeatures.h
> b/arch/x86/include/asm/cpufeatures.h
> > index 6d6122524711..350eeccd0ce9 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -349,6 +349,7 @@
> >  #define X86_FEATURE_INTEL_STIBP(18*32+27) /* "" Single
> Thread Indirect Branch Predictors */
> >  #define X86_FEATURE_FLUSH_L1D  (18*32+28) /* Flush L1D
> cache */
> >  #define X86_FEATURE_ARCH_CAPABILITIES  (18*32+29) /*
> IA32_ARCH_CAPABILITIES MSR (Intel) */
> > +#define X86_FEATURE_CORE_CAPABILITY(18*32+30) /*
> IA32_CORE_CAPABILITY MSR */
> >  #define X86_FEATURE_SPEC_CTRL_SSBD (18*32+31) /* "" Speculative
> Store Bypass Disable */
> 
> What does this feature end up looking like in /proc/cpuinfo?

The flag string is "core_capability" in /proc/cpuinfo.

Thanks.

-Fenghua

RE: [PATCH v2 1/3] x86/cpufeatures: Enumerate user wait instructions

2019-02-21 Thread Yu, Fenghua

> From: Fenghua Yu [mailto:fenghua...@intel.com]
> On Wed, Feb 20, 2019 at 10:37:27PM -0800, Andy Lutomirski wrote:
> Or to simplify the situation, how about we still use zero as global max wait
> time (i.e. no limitation for global wait time) as implemented in this patch
> set? User can still change the value to any value based on their usage. Isn't
> this a right way to take care of any usage?

Plus, if scheduler timers works, umwait will be forced time out by timer in HZ 
anyway.

Only for case without scheduler timer, sysadmin may need to set a small max 
umwait time to force timout. But in this case (real time), I doubt user app 
really wants to call umwait to save power with long latency of waking up from 
umwait. So likely in this case, user app won't call umwait and there is no 
usage to set a small value for max umwait time. 

Thanks.

-Fenghua

RE: [PATCH v3 10/10] x86/split_lock: Handle #AC exception for split lock

2019-02-13 Thread Yu, Fenghua

> From: Ingo Molnar [mailto:mingo.kernel@gmail.com] On Behalf Of Ingo Molnar
> > On Mon, Feb 11, 2019 at 11:53:39AM +0100, Ingo Molnar wrote:
> > > > +   do_trap(trapnr, signr, str, regs, error_code, 
> > > > BUS_ADRALN,
> NULL);
> > > > +   }
> > > > +}
> > >
> > > Is there any experience with how frequently this signal is killing
> > > user-space processes on a modern distro? Any expectation of how
> > > frequent such SIGBUS task terminations are going to be in the field?
> >
> > We did pretty intensive testing internally (zero day tests, many
> > engineers and testers use the patches in their dailly work, etc). So
> > far I'm not aware of any user space process hiting split lock issue. I
> > guess GCC or other compilers takes care of the issue already. Inline
> > assembly code may hit the issue if code is not right, but there are
> > fewer inline assembly code in user space.
> >
> > So far we only find two kernel split lock issues and fix them in the
> > first two patches in this patch set. We also find one BIOS split lock
> > issue internally which has been fixed in production BIOS.
> >
> > Thanks.
> 
> Ok, still, for binary compatibility's sake, could you please add a sysctl knob
> (root-only, etc.) and a kernel boot parameter that disables this check?

In this patch set, we already extend kernel option "clearcpuid=" to support CPU
cap flags. So "clearcpuid=split_lock_detection" can disable this split lock
check during boot time.

In next version, I will add "/sys/kernel/split_lock_detection" to enable or 
disable
the split lock check during run time.

Hopefully these work.

Thanks.

-Fenghua

RE: [PATCH v3 08/10] x86/setcpuid: Add kernel option setcpuid

2019-02-12 Thread Yu, Fenghua

> From: Peter Zijlstra [mailto:pet...@infradead.org]
> On Tue, Feb 12, 2019 at 02:51:00PM +0100, Thomas Gleixner wrote:
> > On Tue, 12 Feb 2019, Peter Zijlstra wrote:
> >
> > > On Mon, Feb 11, 2019 at 11:16:43AM -0800, Fenghua Yu wrote:
> > > > 4. The feature can be disabled by kernel option
> > > > "clearcpuid=split_lock_detection" during early boot time.
> > >
> > > IFF clearcpuid lives, it should also employ CPUID faulting and clear
> > > it for userspace too.
> >
> > We have it already,
> 
> D'0h right, I thought that was introduced here, but that's just extending it
> to multiple arguments.

In this patch set, patch #5 extends clearcpuid to support multiple options and
Patch #6 extends clearcpuid to support cpu cap flag strings. So we already have
string support and can use "clearcpuid=split_lock_detection" to disable
the split lock feature at the boot time.

Thanks.

-Fenghua

RE: [PATCH v2 3/3] x86/umwait: Control umwait maximum time

2019-02-08 Thread Yu, Fenghua

> From: Andrew Cooper [mailto:andrew.coop...@citrix.com]
> On 17/01/2019 00:00, Andy Lutomirski wrote:
> > On Wed, Jan 16, 2019 at 1:24 PM Fenghua Yu 
> wrote:
> >> IA32_UMWAIT_CONTROL[31:2] determines the maximum time in TSC-
> quanta
> >> that processor can stay in C0.1 or C0.2.
> >>
> >> The maximum time value in IA32_UMWAIT_CONTROL[31-2] is set as zero
> >> which means there is no global time limit for UMWAIT and TPAUSE
> instructions.
> >> Each process sets its own umwait maximum time as the instructions
> operand.
> >>
> >> User can specify global umwait maximum time through interface:
> >> /sys/devices/system/cpu/umwait_control/umwait_max_time
> >> The value in the interface is in decimal in TSC-quanta. Bits[1:0] are
> >> cleared when the value is stored.
> >>
> >> Signed-off-by: Fenghua Yu 
> >> ---
> >>  arch/x86/include/asm/msr-index.h |  2 ++
> >>  arch/x86/power/umwait.c  | 42
> +++-
> >>  2 files changed, 43 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/include/asm/msr-index.h
> >> b/arch/x86/include/asm/msr-index.h
> >> index b56bfecae0de..42b9104fc15b 100644
> >> --- a/arch/x86/include/asm/msr-index.h
> >> +++ b/arch/x86/include/asm/msr-index.h
> >> @@ -62,6 +62,8 @@
> >>  #define MSR_IA32_UMWAIT_CONTROL0xe1
> >>  #define UMWAIT_CONTROL_C02_BIT 0x0
> >>  #define UMWAIT_CONTROL_C02_MASK0x0001
> >> +#define UMWAIT_CONTROL_MAX_TIME_BIT0x2
> >> +#define UMWAIT_CONTROL_MAX_TIME_MASK   0xfffc
> >>
> >>  #define MSR_PKG_CST_CONFIG_CONTROL 0x00e2
> >>  #define NHM_C3_AUTO_DEMOTE (1UL << 25)
> >> diff --git a/arch/x86/power/umwait.c b/arch/x86/power/umwait.c index
> >> 95b3867aac1e..4a1a507d3bb7 100644
> >> --- a/arch/x86/power/umwait.c
> >> +++ b/arch/x86/power/umwait.c
> >> @@ -10,6 +10,7 @@
> >>  #include 
> >>
> >>  static int umwait_enable_c0_2 = 1; /* 0: disable C0.2. 1: enable
> >> C0.2. */
> >> +static u32 umwait_max_time; /* In TSC-quanta. Only bits [31:2] are
> >> +used. */
> >>  static DEFINE_MUTEX(umwait_lock);
> >>
> >>  /* Return value that will be used to set umwait control MSR */ @@
> >> -20,7 +21,8 @@ static inline u32 umwait_control_val(void)
> >>  * When bit 0 is 1, C0.2 is disabled. Otherwise, C0.2 is enabled.
> >>  * So value in bit 0 is opposite of umwait_enable_c0_2.
> >>  */
> >> -   return ~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK;
> >> +   return (~umwait_enable_c0_2 & UMWAIT_CONTROL_C02_MASK) |
> >> +  umwait_max_time;
> >>  }
> >>
> >>  static ssize_t umwait_enable_c0_2_show(struct device *dev, @@ -61,8
> >> +63,46 @@ static ssize_t umwait_enable_c0_2_store(struct device *dev,
> >>
> >>  static DEVICE_ATTR_RW(umwait_enable_c0_2);
> >>
> >> +static ssize_t umwait_max_time_show(struct device *kobj,
> >> +   struct device_attribute *attr,
> >> +char *buf) {
> >> +   return sprintf(buf, "%u\n", umwait_max_time); }
> >> +
> >> +static ssize_t umwait_max_time_store(struct device *kobj,
> >> +struct device_attribute *attr,
> >> +const char *buf, size_t count) {
> >> +   u32 msr_val, max_time;
> >> +   int cpu, ret;
> >> +
> >> +   ret = kstrtou32(buf, 10, _time);
> >> +   if (ret)
> >> +   return ret;
> >> +
> >> +   mutex_lock(_lock);
> >> +
> >> +   /* Only get max time value from bits [31:2] */
> >> +   max_time &= UMWAIT_CONTROL_MAX_TIME_MASK;
> >> +   /* Update the max time value in memory */
> >> +   umwait_max_time = max_time;
> >> +   msr_val = umwait_control_val();
> >> +   get_online_cpus();
> >> +   /* All CPUs have same umwait max time */
> >> +   for_each_online_cpu(cpu)
> >> +   wrmsr_on_cpu(cpu, MSR_IA32_UMWAIT_CONTROL, msr_val,
> 0);
> >> +   put_online_cpus();
> >> +
> >> +   mutex_unlock(_lock);
> >> +
> >> +   return count;
> >> +}
> >> +
> >> +static DEVICE_ATTR_RW(umwait_max_time);
> >> +
> >>  static struct attribute *umwait_attrs[] = {
> >> _attr_umwait_enable_c0_2.attr,
> >> +   _attr_umwait_max_time.attr,
> >> NULL
> >>  };
> > You need something to make sure that newly onlined CPUs get the right
> > value in the MSR.  You also need to make sure you restore it on resume
> > from suspend.  Something like cpu_init() might be the right place.
> >
> > Also, as previously discussed, I think we should set the default to
> > something quite small, maybe 100 microseconds.  IMO the goal is to
> > pick a value that is a high enough multiple of the C0.2 entry+exit
> > latency that we get most of the power and SMT resource savings while
> > being small enough that no one things that UMWAIT is more than a
> > glorified, slightly improved, and far more misleading version of REP
> > NOP.
> >
> > Andrew, would having Linux default to a small value do much to
> > mitigate your concerns that

RE: [PATCH v4 10/10] selftests/resctrl: Add the test in MAINTAINERS

2019-01-02 Thread Yu, Fenghua

> From: Moger, Babu [mailto:babu.mo...@amd.com]
> >  F: Documentation/x86/intel_rdt*
> 
> This patch does not apply cleanly.  We have changed the above names from
> intel_rdt to resctrl now. You need to re-base again.

That's right. Will rebase this patch.

Thanks.

-Fenghua

RE: [PATCH v4 08/10] selftests/resctrl Add Cache QoS Monitoring (CQM) selftest

2019-01-02 Thread Yu, Fenghua

> From: Moger, Babu [mailto:babu.mo...@amd.com]
>   for (int i = 0; i < argc; i++) {
>^
> resctrl_tests.c:39:56: note: previous declaration of 'i' was here
>   int res, c, core_id = 1, span = 250, argc_new = argc, i, no_of_bits = 5;
> ^
> resctrl_tests.c:46:2: error: 'for' loop initial declarations are only allowed 
> in
> C99 mode
>   for (int i = 0; i < argc; i++) {
>   ^
> resctrl_tests.c:46:2: note: use option -std=c99 or -std=gnu99 to compile your
> code

Sure. Will fix this.

Thanks.

-Fenghua

RE: [REGRESSION][v4.14.y][v4.15] x86/intel_rdt/cqm: Improve limbo list processing

2018-01-16 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> On Tue, 16 Jan 2018, Joseph Salisbury wrote:
> > On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> > > Vikas on vacation until end of the month. Fenghua will look into
> > > this issue.
> > >
> > > On Jan 16, 2018, at 5:09 AM, Thomas Gleixner  > > > wrote:
> > >
> > >>
> > >> Vikas, Fenghua can you please look at that ASAP?
> > >>
> > >> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
> > >>
> > >>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
> > >>>
> >  Hi Vikas,
> > 
> >  A kernel bug report was opened against Ubuntu [0].  After a
> >  kernel bisect, it was found that reverting the following commit
> >  resolved this bug:
> > 
> >  commit 24247aeeabe99eab13b798c2dec066dd6f07
> >  Author: Vikas Shivappa  >  >
> >  Date:   Tue Aug 15 18:00:43 2017 -0700
> > 
> >      x86/intel_rdt/cqm: Improve limbo list processing
> > 
> > 
> >  The regression was introduced as of v4.14-r1 and still exists
> >  with current mainline.  The trace with v4.15-rc7 is in comment #44[1].
> > 
> >  I was hoping to get your feedback, since you are the patch
> >  author.  Do you think gathering any additional data will help
> >  diagnose this issue, or would it be best to submit a revert request?
> > >>>
> > >>> That stinks like a use after free. Can you run with KASAN enabled?
> > >>>
> > >>> Thanks,
> > >>>
> > >>>    tglx
> >
> >
> > Here is some data wiht KASAN enabled:
> > https://bugs.launchpad.net/ubuntu/+source/linux-
> hwe/+bug/1733662/comme
> > nts/51
> >
> > Are there any specific logs you would like to see, or specific actions
> > executed?
> 
> No, the KASAN output is pretty clear where the issue is.
> 
> Thanks,
> 
>   tglx

Is this a Haswell specific issue?

I run the following test forever without issue on Broadwell and 4.15.0-rc6 with 
rdt mounted:
for ((;;)) do
for ((i=1;i<88;i++)) do
echo 0 >/sys/devices/system/cpu/cpu$i/online
done
echo "online cpus:"
grep processor /proc/cpuinfo |wc
for ((i=1;i<88;i++)) do
echo 1 >/sys/devices/system/cpu/cpu$i/online
done
echo "online cpus:"
grep processor /proc/cpuinfo|wc
done

I'm finding a Haswell to reproduce the issue.

Thanks.

-Fenghua

RE: [REGRESSION][v4.14.y][v4.15] x86/intel_rdt/cqm: Improve limbo list processing

2018-01-16 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> On Tue, 16 Jan 2018, Joseph Salisbury wrote:
> > On 01/16/2018 08:32 AM, Shankar, Ravi V wrote:
> > > Vikas on vacation until end of the month. Fenghua will look into
> > > this issue.
> > >
> > > On Jan 16, 2018, at 5:09 AM, Thomas Gleixner  > > > wrote:
> > >
> > >>
> > >> Vikas, Fenghua can you please look at that ASAP?
> > >>
> > >> On Sun, 14 Jan 2018, Thomas Gleixner wrote:
> > >>
> > >>> On Fri, 12 Jan 2018, Joseph Salisbury wrote:
> > >>>
> >  Hi Vikas,
> > 
> >  A kernel bug report was opened against Ubuntu [0].  After a
> >  kernel bisect, it was found that reverting the following commit
> >  resolved this bug:
> > 
> >  commit 24247aeeabe99eab13b798c2dec066dd6f07
> >  Author: Vikas Shivappa  >  >
> >  Date:   Tue Aug 15 18:00:43 2017 -0700
> > 
> >      x86/intel_rdt/cqm: Improve limbo list processing
> > 
> > 
> >  The regression was introduced as of v4.14-r1 and still exists
> >  with current mainline.  The trace with v4.15-rc7 is in comment #44[1].
> > 
> >  I was hoping to get your feedback, since you are the patch
> >  author.  Do you think gathering any additional data will help
> >  diagnose this issue, or would it be best to submit a revert request?
> > >>>
> > >>> That stinks like a use after free. Can you run with KASAN enabled?
> > >>>
> > >>> Thanks,
> > >>>
> > >>>    tglx
> >
> >
> > Here is some data wiht KASAN enabled:
> > https://bugs.launchpad.net/ubuntu/+source/linux-
> hwe/+bug/1733662/comme
> > nts/51
> >
> > Are there any specific logs you would like to see, or specific actions
> > executed?
> 
> No, the KASAN output is pretty clear where the issue is.
> 
> Thanks,
> 
>   tglx

Is this a Haswell specific issue?

I run the following test forever without issue on Broadwell and 4.15.0-rc6 with 
rdt mounted:
for ((;;)) do
for ((i=1;i<88;i++)) do
echo 0 >/sys/devices/system/cpu/cpu$i/online
done
echo "online cpus:"
grep processor /proc/cpuinfo |wc
for ((i=1;i<88;i++)) do
echo 1 >/sys/devices/system/cpu/cpu$i/online
done
echo "online cpus:"
grep processor /proc/cpuinfo|wc
done

I'm finding a Haswell to reproduce the issue.

Thanks.

-Fenghua

RE: [PATCH 4/6] x86/intel_rdt: Add two new resources for L2 Code and Data Prioritization (CDP)

2017-12-20 Thread Yu, Fenghua

> > When L2 CDP is enabled, the schemata will have the two resources in
> > this format:
> > L2DATA:l2id0=;l2id1=;
> > L2CODE:l2id0=;l2id1=;
> 
> Hi,
> 
> What do the  represent?

The  represents CBM (Cache Bit Mask) values in the schemata, similar to all 
others (L2 CAT/L3 CAT/L3 CDP).

Thanks.

-Fenghua

RE: [PATCH 4/6] x86/intel_rdt: Add two new resources for L2 Code and Data Prioritization (CDP)

2017-12-20 Thread Yu, Fenghua

> > When L2 CDP is enabled, the schemata will have the two resources in
> > this format:
> > L2DATA:l2id0=;l2id1=;
> > L2CODE:l2id0=;l2id1=;
> 
> Hi,
> 
> What do the  represent?

The  represents CBM (Cache Bit Mask) values in the schemata, similar to all 
others (L2 CAT/L3 CAT/L3 CDP).

Thanks.

-Fenghua

RE: [PATCH v2] x86/cpufeatures: Enable new SSE/AVX/AVX512 cpu features

2017-10-31 Thread Yu, Fenghua

> On Tuesday, October 31, 2017 11:33 AM, Borislav Petkov wrote:
> On Tue, Oct 31, 2017 at 06:25:55PM +0000, Yu, Fenghua wrote:
> > We may need to send a patch to fix a few legacy names that don't match
> > exactly specs, e.g. AVX512VBMI as you mentioned.
> 
> Or we can make them all uniform and ignore the spec. It's not like they would
> be harder to grep afterwards.

Should we change the legacy names as well? User apps may use the names already. 
Changing the names may break the apps.

If we do make all uniform, do you prefer adding "_" after AVX512?

Thanks.

-Fenghua

RE: [PATCH v2] x86/cpufeatures: Enable new SSE/AVX/AVX512 cpu features

2017-10-31 Thread Yu, Fenghua

> On Tuesday, October 31, 2017 11:33 AM, Borislav Petkov wrote:
> On Tue, Oct 31, 2017 at 06:25:55PM +0000, Yu, Fenghua wrote:
> > We may need to send a patch to fix a few legacy names that don't match
> > exactly specs, e.g. AVX512VBMI as you mentioned.
> 
> Or we can make them all uniform and ignore the spec. It's not like they would
> be harder to grep afterwards.

Should we change the legacy names as well? User apps may use the names already. 
Changing the names may break the apps.

If we do make all uniform, do you prefer adding "_" after AVX512?

Thanks.

-Fenghua

RE: [PATCH v2] x86/cpufeatures: Enable new SSE/AVX/AVX512 cpu features

2017-10-31 Thread Yu, Fenghua

> On Tuesday, October 31, 2017 11:03 AM, Yu, Fenghua wrote
> > On Tuesday, October 31, 2017 3:06 AM, Borislav Petkov wrote:
> > On Mon, Oct 30, 2017 at 06:20:29PM -0700, Gayatri Kammela wrote:
> > >  #define X86_FEATURE_AVX512VBMI  (16*32+ 1) /* AVX512 Vector Bit
> > Manipulation instructions*/
> >
> > So we have previous AVX512 feature bits which do not separate AVX512
> > with a "_" but the new ones do. I think we should unify this and the
> > SDM should be fixed too.
> 
> This patch exactly follows the names in the spec.
> 
> As you said, the legacy code doesn't follow spec naming strictly and the spec
> doesn't have uniform naming convention either. We are contacting spec
> author to see if we can follow the same naming convention in the future
> specs.

The spec author doesn't want to change the legacy names to insert "_" in order 
to have uniform names. The "_" in a name is just for readability. So in the 
future specs, there will be mixed names, some with "_" and some without "_".

We may need to send a patch to fix a few legacy names that don't match exactly 
specs, e.g. AVX512VBMI as you mentioned.

Thanks.

-Fenghua

RE: [PATCH v2] x86/cpufeatures: Enable new SSE/AVX/AVX512 cpu features

2017-10-31 Thread Yu, Fenghua

> On Tuesday, October 31, 2017 11:03 AM, Yu, Fenghua wrote
> > On Tuesday, October 31, 2017 3:06 AM, Borislav Petkov wrote:
> > On Mon, Oct 30, 2017 at 06:20:29PM -0700, Gayatri Kammela wrote:
> > >  #define X86_FEATURE_AVX512VBMI  (16*32+ 1) /* AVX512 Vector Bit
> > Manipulation instructions*/
> >
> > So we have previous AVX512 feature bits which do not separate AVX512
> > with a "_" but the new ones do. I think we should unify this and the
> > SDM should be fixed too.
> 
> This patch exactly follows the names in the spec.
> 
> As you said, the legacy code doesn't follow spec naming strictly and the spec
> doesn't have uniform naming convention either. We are contacting spec
> author to see if we can follow the same naming convention in the future
> specs.

The spec author doesn't want to change the legacy names to insert "_" in order 
to have uniform names. The "_" in a name is just for readability. So in the 
future specs, there will be mixed names, some with "_" and some without "_".

We may need to send a patch to fix a few legacy names that don't match exactly 
specs, e.g. AVX512VBMI as you mentioned.

Thanks.

-Fenghua

RE: [PATCH v2] x86/cpufeatures: Enable new SSE/AVX/AVX512 cpu features

2017-10-31 Thread Yu, Fenghua

> On Tuesday, October 31, 2017 3:06 AM, Borislav Petkov wrote:
> On Mon, Oct 30, 2017 at 06:20:29PM -0700, Gayatri Kammela wrote:
> >  #define X86_FEATURE_AVX512VBMI  (16*32+ 1) /* AVX512 Vector Bit
> Manipulation instructions*/
> 
> So we have previous AVX512 feature bits which do not separate AVX512
> with a "_" but the new ones do. I think we should unify this and the SDM
> should be fixed too.

This patch exactly follows the names in the spec.

As you said, the legacy code doesn't follow spec naming strictly and the spec 
doesn't have uniform naming convention either. We are contacting spec author to 
see if we can follow the same naming convention in the future specs.

Thanks.

-Fenghua

RE: [PATCH v2] x86/cpufeatures: Enable new SSE/AVX/AVX512 cpu features

2017-10-31 Thread Yu, Fenghua

> On Tuesday, October 31, 2017 3:06 AM, Borislav Petkov wrote:
> On Mon, Oct 30, 2017 at 06:20:29PM -0700, Gayatri Kammela wrote:
> >  #define X86_FEATURE_AVX512VBMI  (16*32+ 1) /* AVX512 Vector Bit
> Manipulation instructions*/
> 
> So we have previous AVX512 feature bits which do not separate AVX512
> with a "_" but the new ones do. I think we should unify this and the SDM
> should be fixed too.

This patch exactly follows the names in the spec.

As you said, the legacy code doesn't follow spec naming strictly and the spec 
doesn't have uniform naming convention either. We are contacting spec author to 
see if we can follow the same naming convention in the future specs.

Thanks.

-Fenghua

RE: [PATCH 3/3] x86/intel_rdt: Fix potential deadlock during resctrl mount

2017-10-20 Thread Yu, Fenghua

> From: Chatre, Reinette on Friday, October 20, 2017 2:17 AM
> Subject: [PATCH 3/3] x86/intel_rdt: Fix potential deadlock during resctrl
> mount

Acked-by: Fenghua Yu <fenghua...@intel.com>

RE: [PATCH 3/3] x86/intel_rdt: Fix potential deadlock during resctrl mount

2017-10-20 Thread Yu, Fenghua

> From: Chatre, Reinette on Friday, October 20, 2017 2:17 AM
> Subject: [PATCH 3/3] x86/intel_rdt: Fix potential deadlock during resctrl
> mount

Acked-by: Fenghua Yu

RE: [PATCH 2/3] x86/intel_rdt: Fix potential deadlock during resctrl unmount

2017-10-20 Thread Yu, Fenghua

> From: Chatre, Reinette on Friday, October 20, 2017 2:17 AM
> Subject: [PATCH 2/3] x86/intel_rdt: Fix potential deadlock during resctrl
> unmount
> 
Acked-by: Fenghua Yu <fenghua...@intel.com>

RE: [PATCH 2/3] x86/intel_rdt: Fix potential deadlock during resctrl unmount

2017-10-20 Thread Yu, Fenghua

> From: Chatre, Reinette on Friday, October 20, 2017 2:17 AM
> Subject: [PATCH 2/3] x86/intel_rdt: Fix potential deadlock during resctrl
> unmount
> 
Acked-by: Fenghua Yu

RE: [PATCH 00/14] Fix wrong %pF and %pS printk format specifier usages

2017-09-08 Thread Yu, Fenghua

> From: Sergey Senozhatsky [mailto:sergey.senozhatsky.w...@gmail.com]
> On (09/07/17 16:05), Luck, Tony wrote:
> +static inline bool __mod_text_address(struct module *mod,
> + unsigned long addr) {
> +   /* Make sure it's within the text section. */
> +   if (!within(addr, mod->init_layout.base, mod->init_layout.text_size)
> +   && !within(addr, mod->core_layout.base, mod-
> >core_layout.text_size))
> +   return false;
> +   return true;
> +}

The __mod_text_address() may be defined only  on IA64, PPC64 and PARISC since 
it's only called in those cases.

> +
>  #ifdef CONFIG_KALLSYMS
>  /*
>   * This ignores the intensely annoying "mapping symbols" found @@ -3942,6
> +3952,14 @@ const char *module_address_lookup(unsigned long addr,
> preempt_disable();
> mod = __module_address(addr);
> if (mod) {
> +#if defined(CONFIG_IA64) || defined(CONFIG_PPC64) ||
> defined(CONFIG_PARISC)
> +   unsigned long deref_addr;
> +
> +   if (!__mod_text_address(mod, addr))
> +   deref_addr = dereference_function_descriptor(addr);
> +   if (__mod_text_address(mod, deref_addr))
> +   addr = deref_addr; #endif

Thanks.

-Fenghua

RE: [PATCH 00/14] Fix wrong %pF and %pS printk format specifier usages

2017-09-08 Thread Yu, Fenghua

> From: Sergey Senozhatsky [mailto:sergey.senozhatsky.w...@gmail.com]
> On (09/07/17 16:05), Luck, Tony wrote:
> +static inline bool __mod_text_address(struct module *mod,
> + unsigned long addr) {
> +   /* Make sure it's within the text section. */
> +   if (!within(addr, mod->init_layout.base, mod->init_layout.text_size)
> +   && !within(addr, mod->core_layout.base, mod-
> >core_layout.text_size))
> +   return false;
> +   return true;
> +}

The __mod_text_address() may be defined only  on IA64, PPC64 and PARISC since 
it's only called in those cases.

> +
>  #ifdef CONFIG_KALLSYMS
>  /*
>   * This ignores the intensely annoying "mapping symbols" found @@ -3942,6
> +3952,14 @@ const char *module_address_lookup(unsigned long addr,
> preempt_disable();
> mod = __module_address(addr);
> if (mod) {
> +#if defined(CONFIG_IA64) || defined(CONFIG_PPC64) ||
> defined(CONFIG_PARISC)
> +   unsigned long deref_addr;
> +
> +   if (!__mod_text_address(mod, addr))
> +   deref_addr = dereference_function_descriptor(addr);
> +   if (__mod_text_address(mod, deref_addr))
> +   addr = deref_addr; #endif

Thanks.

-Fenghua

RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread Yu, Fenghua

> From: Andi Kleen [mailto:a...@firstfloor.org]
> "Luck, Tony"  writes:
> > 9)  Measure per logical CPU (pick active RMID in same precedence for
> task/cpu as CAT picks CLOSID)
> > 10) Put multiple CPUs into a group
> 
> I'm not sure this is a real requirement. It's just an optimization, right? If 
> you
> can assign policies to threads, you can implicitly set it per CPU through 
> affinity
> (or the other way around).
> The only benefit would be possibly less context switch overhead, but if all
> the thread (including idle) assigned to a CPU have the same policy it would
> have the same results.
> 
> I suspect dropping this would likely simplify the interface significantly.

Assigning a pid P to a CPU and monitoring the P don't count all events 
happening on the CPU.
Other processes/threads (e.g. kernel threads) than the assigned P can run on 
the CPU.
Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot 
cases.

Thanks.

-Fenghua

RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-02-01 Thread Yu, Fenghua

> From: Andi Kleen [mailto:a...@firstfloor.org]
> "Luck, Tony"  writes:
> > 9)  Measure per logical CPU (pick active RMID in same precedence for
> task/cpu as CAT picks CLOSID)
> > 10) Put multiple CPUs into a group
> 
> I'm not sure this is a real requirement. It's just an optimization, right? If 
> you
> can assign policies to threads, you can implicitly set it per CPU through 
> affinity
> (or the other way around).
> The only benefit would be possibly less context switch overhead, but if all
> the thread (including idle) assigned to a CPU have the same policy it would
> have the same results.
> 
> I suspect dropping this would likely simplify the interface significantly.

Assigning a pid P to a CPU and monitoring the P don't count all events 
happening on the CPU.
Other processes/threads (e.g. kernel threads) than the assigned P can run on 
the CPU.
Monitoring P assigned to the CPU is not equal to monitoring the CPU in a lot 
cases.

Thanks.

-Fenghua

RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-18 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
> > On Tue, 17 Jan 2017, Thomas Gleixner wrote:
> > > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> > > > - Issue(1): Inaccurate data for per package data, systemwide. Just
> > > > prints zeros or arbitrary numbers.
> > > >
> > > > Fix: Patches fix this by just throwing an error if the mode is not
> > > > supported.
> > > > The modes supported is task monitoring and cgroup monitoring.
> > > > Also the per package
> > > > data for say socket x is returned with the -C  -G
> > > > cgrpy option.
> > > > The systemwide data can be looked up by monitoring root cgroup.
> > >
> > > Fine. That just lacks any comment in the implementation. Otherwise I
> > > would not have asked the question about cpu monitoring. Though I
> > > fundamentaly hate the idea of requiring cgroups for this to work.
> > >
> > > If I just want to look at CPU X why on earth do I have to set up all
> > > that cgroup muck? Just because your main focus is cgroups?
> >
> > The upstream per cpu data is broken because its not overriding the
> > other task event RMIDs on that cpu with the cpu event RMID.
> >
> > Can be fixed by adding a percpu struct to hold the RMID thats
> > affinitized to the cpu, however then we miss all the task
> > llc_occupancy in that - still evaluating it.
> 
> The point here is that CQM is closely connected to the cache allocation
> technology. After a lengthy discussion we ended up having
> 
>   - per cpu CLOSID
>   - per task CLOSID
> 
> where all tasks which do not have a CLOSID assigned use the CLOSID which is
> assigned to the CPU they are running on.
> 
> So if I configure a system by simply partitioning the cache per cpu, which is
> the proper way to do it for HPC and RT usecases where workloads are
> partitioned on CPUs as well, then I really want to have an equaly simple way
> to monitor the occupancy for that reservation.
> 
> And looking at that from the CAT point of view, which is the proper way to do
> it, makes it obvious that CQM should be modeled to match CAT.
> 
> So lets assume the following:
> 
>CPU 0-3 default CLOSID 0
>CPU 4 CLOSID 1
>CPU 5 CLOSID 2
>CPU 6 CLOSID 3
>CPU 7 CLOSID 3
> 
>T1CLOSID 4
>T2CLOSID 5
>T3CLOSID 6
>T4CLOSID 6
> 
>All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>they run on.
> 
> then the obvious basic monitoring requirement is to have a RMID for each
> CLOSID.

So the mapping between RMID and CLOSID is 1:1 mapping, right?

Then changing the current resctrl interface in kernel as follows:

1. In rdtgroup_mkdir() (i.e. creating a partition in resctrl), allocate one 
RMID for the partition. Then the mapping between RMID and CLOSID is set up in 
mkdir.
2. In rdtgroup_rmdir() (i.e. removing a partition in resctrl), free the RMID. 
Then the mapping between RMID and CLOSID is dismissed.

In user space:
1. Create a partition in resctrl and allocate L3 CBM in schemata and assign a 
PID in "tasks".
2. Start a user monitoring tool (e.g. perf) to monitor the PID. The monitoring 
tool needs to be updated to know resctrl interface. We may update perf to work 
with resctrl interface.

Since the PID is assigned to the partition which has a CLOSID and RMID mapping, 
the PID is monitored while it's running in the allocated portion of L3.

Is above proposal the right way to go?

Thanks.

-Fenghua

RE: [PATCH 00/12] Cqm2: Intel Cache quality monitoring fixes

2017-01-18 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> On Tue, 17 Jan 2017, Shivappa Vikas wrote:
> > On Tue, 17 Jan 2017, Thomas Gleixner wrote:
> > > On Fri, 6 Jan 2017, Vikas Shivappa wrote:
> > > > - Issue(1): Inaccurate data for per package data, systemwide. Just
> > > > prints zeros or arbitrary numbers.
> > > >
> > > > Fix: Patches fix this by just throwing an error if the mode is not
> > > > supported.
> > > > The modes supported is task monitoring and cgroup monitoring.
> > > > Also the per package
> > > > data for say socket x is returned with the -C  -G
> > > > cgrpy option.
> > > > The systemwide data can be looked up by monitoring root cgroup.
> > >
> > > Fine. That just lacks any comment in the implementation. Otherwise I
> > > would not have asked the question about cpu monitoring. Though I
> > > fundamentaly hate the idea of requiring cgroups for this to work.
> > >
> > > If I just want to look at CPU X why on earth do I have to set up all
> > > that cgroup muck? Just because your main focus is cgroups?
> >
> > The upstream per cpu data is broken because its not overriding the
> > other task event RMIDs on that cpu with the cpu event RMID.
> >
> > Can be fixed by adding a percpu struct to hold the RMID thats
> > affinitized to the cpu, however then we miss all the task
> > llc_occupancy in that - still evaluating it.
> 
> The point here is that CQM is closely connected to the cache allocation
> technology. After a lengthy discussion we ended up having
> 
>   - per cpu CLOSID
>   - per task CLOSID
> 
> where all tasks which do not have a CLOSID assigned use the CLOSID which is
> assigned to the CPU they are running on.
> 
> So if I configure a system by simply partitioning the cache per cpu, which is
> the proper way to do it for HPC and RT usecases where workloads are
> partitioned on CPUs as well, then I really want to have an equaly simple way
> to monitor the occupancy for that reservation.
> 
> And looking at that from the CAT point of view, which is the proper way to do
> it, makes it obvious that CQM should be modeled to match CAT.
> 
> So lets assume the following:
> 
>CPU 0-3 default CLOSID 0
>CPU 4 CLOSID 1
>CPU 5 CLOSID 2
>CPU 6 CLOSID 3
>CPU 7 CLOSID 3
> 
>T1CLOSID 4
>T2CLOSID 5
>T3CLOSID 6
>T4CLOSID 6
> 
>All other tasks use the per cpu defaults, i.e. the CLOSID of the CPU
>they run on.
> 
> then the obvious basic monitoring requirement is to have a RMID for each
> CLOSID.

So the mapping between RMID and CLOSID is 1:1 mapping, right?

Then changing the current resctrl interface in kernel as follows:

1. In rdtgroup_mkdir() (i.e. creating a partition in resctrl), allocate one 
RMID for the partition. Then the mapping between RMID and CLOSID is set up in 
mkdir.
2. In rdtgroup_rmdir() (i.e. removing a partition in resctrl), free the RMID. 
Then the mapping between RMID and CLOSID is dismissed.

In user space:
1. Create a partition in resctrl and allocate L3 CBM in schemata and assign a 
PID in "tasks".
2. Start a user monitoring tool (e.g. perf) to monitor the PID. The monitoring 
tool needs to be updated to know resctrl interface. We may update perf to work 
with resctrl interface.

Since the PID is assigned to the partition which has a CLOSID and RMID mapping, 
the PID is monitored while it's running in the allocated portion of L3.

Is above proposal the right way to go?

Thanks.

-Fenghua

RE: [PATCH] x86/intel_rdt: fix rdt_mount error handling

2016-11-08 Thread Yu, Fenghua

> From: Arnd Bergmann [mailto:a...@arndb.de]
> The newly introduced rdt_mount function returns an unintialized pointer if
> rdtgroup_create_info_dir() fails:
> 
> arch/x86/kernel/cpu/intel_rdt_rdtgroup.c: In function ‘rdt_mount’:
> arch/x86/kernel/cpu/intel_rdt_rdtgroup.c:710:9: error: ‘dentry’ may be used
> uninitialized in this function [-Werror=maybe-uninitialized]
> 
> Clearly the intention was to propagate the error code here as we do in the
> other failure cases.
> 
> Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support")
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> index a90ad22b9823..e66c7a58505e 100644
> --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> @@ -691,8 +691,10 @@ static struct dentry *rdt_mount(struct
> file_system_type *fs_type,
>   closid_init();
> 
>   ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
> - if (ret)
> + if (ret) {
> + dentry = ERR_PTR(ret);
>   goto out_cdp;
> + }
> 
>   dentry = kernfs_mount(fs_type, flags, rdt_root,
> RDTGROUP_SUPER_MAGIC, NULL);

The warning has been fixed and pushed in tip tree already:
http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=x86/cache=7bff0af51012500718971f9cc3485f67252353eb

Thanks.

-Fenghua

RE: [PATCH] x86/intel_rdt: fix rdt_mount error handling

2016-11-08 Thread Yu, Fenghua

> From: Arnd Bergmann [mailto:a...@arndb.de]
> The newly introduced rdt_mount function returns an unintialized pointer if
> rdtgroup_create_info_dir() fails:
> 
> arch/x86/kernel/cpu/intel_rdt_rdtgroup.c: In function ‘rdt_mount’:
> arch/x86/kernel/cpu/intel_rdt_rdtgroup.c:710:9: error: ‘dentry’ may be used
> uninitialized in this function [-Werror=maybe-uninitialized]
> 
> Clearly the intention was to propagate the error code here as we do in the
> other failure cases.
> 
> Fixes: 5ff193fbde20 ("x86/intel_rdt: Add basic resctrl filesystem support")
> Signed-off-by: Arnd Bergmann 
> ---
>  arch/x86/kernel/cpu/intel_rdt_rdtgroup.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> index a90ad22b9823..e66c7a58505e 100644
> --- a/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> +++ b/arch/x86/kernel/cpu/intel_rdt_rdtgroup.c
> @@ -691,8 +691,10 @@ static struct dentry *rdt_mount(struct
> file_system_type *fs_type,
>   closid_init();
> 
>   ret = rdtgroup_create_info_dir(rdtgroup_default.kn);
> - if (ret)
> + if (ret) {
> + dentry = ERR_PTR(ret);
>   goto out_cdp;
> + }
> 
>   dentry = kernfs_mount(fs_type, flags, rdt_root,
> RDTGROUP_SUPER_MAGIC, NULL);

The warning has been fixed and pushed in tip tree already:
http://git.kernel.org/cgit/linux/kernel/git/tip/tip.git/commit/?h=x86/cache=7bff0af51012500718971f9cc3485f67252353eb

Thanks.

-Fenghua

RE: [PATCH v6 00/10] Intel Cache Allocation Technology

2016-10-30 Thread Yu, Fenghua

> Gentlemen!
> 
> After more than two years of tinkering and real engineering, we finaly have
> skinned the CAT!

Yeah! We made it!

> 
> That was the most amazing review journey I ever made as a maintainer. Just
> a few statistics:
> 
> Design variants: 6
> 
>   6 different approaches for a user space interface. 3 of them have been
>   actually implemented.
> 
>   Unfortunately the real interface discussion happened way after the first
>   rounds of patches had been sent and reviewed. See below.
> 
> Patchsets:21
> 
>   21 patch sets were posted. These can be split into two generations.
> 
>   Gen116   Oct 2014 - Dec 2015
> 
>   Gen25Jul 2016 - Oct 2016
> 
> LKML-Mails:   1216

Wow! You have such detailed statistics data!

> 
>   That's the number of mails related to this project sent to LKML,
>   according to my archive. About 1/3 of those mails are the postings of
>   the patchsets alone.
> 
>   I cannot tell how many offlist mails have been sent around in total on
>   this matter, but at least in my personal mail are close to hundred.
> 
> Beers:   Uncountable
> 
>   This applies to both the number of beers consumed and the number of
> beers
>   owed.
> 
> I'm pretty happy with the final outcome of these patches and I want to say
> thanks to everyone!
> 
> I know that I've been a pain in the neck for some of you due to my pedantery
> about the details, but getting this wrong would have been a major disaster. If
> I offended someone personally in course of the sometimes heated
> discussions, then I offer my excuses.
> 
> Some lessons can be learned from this endeavour:
> 
>1) Chip vendors should give access to the full documentation early
> 
>2) Reviewers should never trust patch submitters, that they have read
>   the documentation correctly and came to the right conclusions how to
>   handle such a facility.
> 
>3) User space interface discussions should be done upfront with a full
>   explanation of the inner workings of such a facility and full
>   documentation available.
> 
> Anything else is just the usual churn of patch submissions, which are handled
> by the submitters with different effectiveness levels.
> 
> That said, all which needs to be done now is proper testing and a massive
> exposure of the user space interface to fuzzers. I've implemented my share
> of string parsers in the past and as careful as I was, there was always a 
> hole in
> them.

We have a few internal test cases and we are enhancing them and doing more
tests from our side.

> 
> If any of the involved folks are at KS/LPC then I suggest we get together at a
> bar during the week and drown the skinned CAT with the appropriate
> beverages. The first round of drinks is my shout.

I won't be in KS/LPC, unfortunately. But I really wish we can meet face to face
in the near future.

Is there any plan for you to travel to any Intel site recently? It would be 
great to
have a celebration beer with you and discuss what we are going to do next:)

Thanks.

-Fenghua

RE: [PATCH v6 00/10] Intel Cache Allocation Technology

2016-10-30 Thread Yu, Fenghua

> Gentlemen!
> 
> After more than two years of tinkering and real engineering, we finaly have
> skinned the CAT!

Yeah! We made it!

> 
> That was the most amazing review journey I ever made as a maintainer. Just
> a few statistics:
> 
> Design variants: 6
> 
>   6 different approaches for a user space interface. 3 of them have been
>   actually implemented.
> 
>   Unfortunately the real interface discussion happened way after the first
>   rounds of patches had been sent and reviewed. See below.
> 
> Patchsets:21
> 
>   21 patch sets were posted. These can be split into two generations.
> 
>   Gen116   Oct 2014 - Dec 2015
> 
>   Gen25Jul 2016 - Oct 2016
> 
> LKML-Mails:   1216

Wow! You have such detailed statistics data!

> 
>   That's the number of mails related to this project sent to LKML,
>   according to my archive. About 1/3 of those mails are the postings of
>   the patchsets alone.
> 
>   I cannot tell how many offlist mails have been sent around in total on
>   this matter, but at least in my personal mail are close to hundred.
> 
> Beers:   Uncountable
> 
>   This applies to both the number of beers consumed and the number of
> beers
>   owed.
> 
> I'm pretty happy with the final outcome of these patches and I want to say
> thanks to everyone!
> 
> I know that I've been a pain in the neck for some of you due to my pedantery
> about the details, but getting this wrong would have been a major disaster. If
> I offended someone personally in course of the sometimes heated
> discussions, then I offer my excuses.
> 
> Some lessons can be learned from this endeavour:
> 
>1) Chip vendors should give access to the full documentation early
> 
>2) Reviewers should never trust patch submitters, that they have read
>   the documentation correctly and came to the right conclusions how to
>   handle such a facility.
> 
>3) User space interface discussions should be done upfront with a full
>   explanation of the inner workings of such a facility and full
>   documentation available.
> 
> Anything else is just the usual churn of patch submissions, which are handled
> by the submitters with different effectiveness levels.
> 
> That said, all which needs to be done now is proper testing and a massive
> exposure of the user space interface to fuzzers. I've implemented my share
> of string parsers in the past and as careful as I was, there was always a 
> hole in
> them.

We have a few internal test cases and we are enhancing them and doing more
tests from our side.

> 
> If any of the involved folks are at KS/LPC then I suggest we get together at a
> bar during the week and drown the skinned CAT with the appropriate
> beverages. The first round of drinks is my shout.

I won't be in KS/LPC, unfortunately. But I really wish we can meet face to face
in the near future.

Is there any plan for you to travel to any Intel site recently? It would be 
great to
have a celebration beer with you and discuss what we are going to do next:)

Thanks.

-Fenghua

RE: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID

2016-10-17 Thread Yu, Fenghua

> > > I wonder whether this is the proper abstraction level. We might as
> > > well do the following:
> > >
> > > rdtresources[] = {
> > >  {
> > >   .name   = "L3",
> > >  },
> > >  {
> > >   .name   = "L3Data",
> > >  },
> > >  {
> > >   .name   = "L3Code",
> > >  },
> > >
> > > and enable either L3 or L3Data+L3Code. Not sure if that makes things
> > > simpler, but it's definitely worth a thought or two.
> >
> > This way will be better than having cdp_enabled/capable for L3 and not
> > for L2.  And this doesn't change current userinterface design either,
> > I think.
> 
> User interface would change if you did this. The schemata file would look like
> this with CDP enabled:
> 
> # cat schemata
> L3Data:0=f;1=f;2=f;3=f
> L3Code:0=f;1=f;2=f;3=f
> 
> but that is easier to read than the current:
> 
> # cat schemata
> L3:0=f,f;1=f,f;2=f,f;3=f,f
> 
> which gives you no clue on which mask is code and which is data.

Right.

Also changing to uniform format :=cbm1;=cbm2;...
is lot easier to parse schemata line in CDP mode.

So I'll change the code and doc to have two new resources: L3Data and L3Code 
for CDP mode.

> 
> We'd also end up with "info/L3Data/" and "info/L3code/"
> which would be a little redundant (since the files in each would contain the
> same numbers), but perhaps that is worth it to get the better schemata file.
> 
> -Tony

RE: [PATCH v4 08/18] x86/intel_rdt: Pick up L3/L2 RDT parameters from CPUID

2016-10-17 Thread Yu, Fenghua

> > > I wonder whether this is the proper abstraction level. We might as
> > > well do the following:
> > >
> > > rdtresources[] = {
> > >  {
> > >   .name   = "L3",
> > >  },
> > >  {
> > >   .name   = "L3Data",
> > >  },
> > >  {
> > >   .name   = "L3Code",
> > >  },
> > >
> > > and enable either L3 or L3Data+L3Code. Not sure if that makes things
> > > simpler, but it's definitely worth a thought or two.
> >
> > This way will be better than having cdp_enabled/capable for L3 and not
> > for L2.  And this doesn't change current userinterface design either,
> > I think.
> 
> User interface would change if you did this. The schemata file would look like
> this with CDP enabled:
> 
> # cat schemata
> L3Data:0=f;1=f;2=f;3=f
> L3Code:0=f;1=f;2=f;3=f
> 
> but that is easier to read than the current:
> 
> # cat schemata
> L3:0=f,f;1=f,f;2=f,f;3=f,f
> 
> which gives you no clue on which mask is code and which is data.

Right.

Also changing to uniform format :=cbm1;=cbm2;...
is lot easier to parse schemata line in CDP mode.

So I'll change the code and doc to have two new resources: L3Data and L3Code 
for CDP mode.

> 
> We'd also end up with "info/L3Data/" and "info/L3code/"
> which would be a little redundant (since the files in each would contain the
> same numbers), but perhaps that is worth it to get the better schemata file.
> 
> -Tony

RE: [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT

2016-09-15 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Thursday, September 08, 2016 2:54 AM
> 
> On Thu, 8 Sep 2016, Fenghua Yu wrote:
> > +extern struct static_key rdt_enable_key; void
> > +__intel_rdt_sched_in(void *dummy);
> > +
> >  struct clos_cbm_table {
> > unsigned long cbm;
> > unsigned int clos_refcnt;
> >  };
> >
> > +/*
> > + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> > + *
> > + * Following considerations are made so that this has minimal impact
> > + * on scheduler hot path:
> > + * - This will stay as no-op unless we are running on an Intel SKU
> > + * which supports L3 cache allocation.
> > + * - When support is present and enabled, does not do any
> > + * IA32_PQR_MSR writes until the user starts really using the feature
> > + * ie creates a rdtgroup directory and assigns a cache_mask thats
> > + * different from the root rdtgroup's cache_mask.
> > + * - Caches the per cpu CLOSid values and does the MSR write only
> > + * when a task with a different CLOSid is scheduled in. That
> > + * means the task belongs to a different rdtgroup.
> > + * - Closids are allocated so that different rdtgroup directories
> > + * with same cache_mask gets the same CLOSid. This minimizes CLOSids
> > + * used and reduces MSR write frequency.
> > + */
> > +static inline void intel_rdt_sched_in(void) {
> > +   /*
> > +* Call the schedule in code only when RDT is enabled.
> > +*/
> > +   if (static_key_false(_enable_key))
> 
> static_branch_[un]likely() is the proper function to use.

How to do this? Should I change the line to

+ if (static_branch_unlikely(_enable_key))
?

Thanks.

-Fenghua

RE: [PATCH v2 10/33] x86/intel_rdt: Implement scheduling support for Intel RDT

2016-09-15 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Thursday, September 08, 2016 2:54 AM
> 
> On Thu, 8 Sep 2016, Fenghua Yu wrote:
> > +extern struct static_key rdt_enable_key; void
> > +__intel_rdt_sched_in(void *dummy);
> > +
> >  struct clos_cbm_table {
> > unsigned long cbm;
> > unsigned int clos_refcnt;
> >  };
> >
> > +/*
> > + * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> > + *
> > + * Following considerations are made so that this has minimal impact
> > + * on scheduler hot path:
> > + * - This will stay as no-op unless we are running on an Intel SKU
> > + * which supports L3 cache allocation.
> > + * - When support is present and enabled, does not do any
> > + * IA32_PQR_MSR writes until the user starts really using the feature
> > + * ie creates a rdtgroup directory and assigns a cache_mask thats
> > + * different from the root rdtgroup's cache_mask.
> > + * - Caches the per cpu CLOSid values and does the MSR write only
> > + * when a task with a different CLOSid is scheduled in. That
> > + * means the task belongs to a different rdtgroup.
> > + * - Closids are allocated so that different rdtgroup directories
> > + * with same cache_mask gets the same CLOSid. This minimizes CLOSids
> > + * used and reduces MSR write frequency.
> > + */
> > +static inline void intel_rdt_sched_in(void) {
> > +   /*
> > +* Call the schedule in code only when RDT is enabled.
> > +*/
> > +   if (static_key_false(_enable_key))
> 
> static_branch_[un]likely() is the proper function to use.

How to do this? Should I change the line to

+ if (static_branch_unlikely(_enable_key))
?

Thanks.

-Fenghua

RE: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface

2016-09-09 Thread Yu, Fenghua

> > Hmm, I don't know how applications are going to use the interface.
> > Nobody knows it right now. But we do have some candicate workloads
> > which want to configure the cache partition at runtime, so it's not
> > just a boot time stuff. I'm wondering why we have such limitation. The
> > framework is there, it's quite easy to implement process move in
> > kernel but fairly hard to get it right in userspace.
> 
> You are correct - if there is a need for this, it would be better done in the
> kernel.
> 
> I'm just not sure how to explain both a "procs" and "tasks" interface file in 
> a
> way that won't confuse people.
> 
> We have:
> 
> # echo {task-id} > tasks
>    adds a single task to this resource group # cat tasks
>   ... shows all the tasks in this resource group
> 
> and you want:
> 
> # echo {process-id} > procs
>... adds all threads in {process-id} to this resource group # cat procs
>   ... shows all processes (like "cat tasks" above, but only shows main thread 
> in
> a multi-threads process)

The advantage of "tasks" is user can allocate each thread into its own 
partition.
The advantage of "procs" is convenience for user to just allocate thread group
lead pid and rest of the thread group members go with the lead.

If no "procs" is really inconvenience, we may support "procs" in future.

One way to implement this is we can extend the current interface to accept
a resctrl file system mount parameter to switch b/w "procs" and "tasks" during
mount time. So the file sytem has either "procs" or "tasks" during run time. I 
don't think it's right to have both of them at the same time in the file system.

Is this the right way to go?

Thanks.

-Fenghua

RE: [PATCH v2 06/33] Documentation, x86: Documentation for Intel resource allocation user interface

2016-09-09 Thread Yu, Fenghua

> > Hmm, I don't know how applications are going to use the interface.
> > Nobody knows it right now. But we do have some candicate workloads
> > which want to configure the cache partition at runtime, so it's not
> > just a boot time stuff. I'm wondering why we have such limitation. The
> > framework is there, it's quite easy to implement process move in
> > kernel but fairly hard to get it right in userspace.
> 
> You are correct - if there is a need for this, it would be better done in the
> kernel.
> 
> I'm just not sure how to explain both a "procs" and "tasks" interface file in 
> a
> way that won't confuse people.
> 
> We have:
> 
> # echo {task-id} > tasks
>    adds a single task to this resource group # cat tasks
>   ... shows all the tasks in this resource group
> 
> and you want:
> 
> # echo {process-id} > procs
>... adds all threads in {process-id} to this resource group # cat procs
>   ... shows all processes (like "cat tasks" above, but only shows main thread 
> in
> a multi-threads process)

The advantage of "tasks" is user can allocate each thread into its own 
partition.
The advantage of "procs" is convenience for user to just allocate thread group
lead pid and rest of the thread group members go with the lead.

If no "procs" is really inconvenience, we may support "procs" in future.

One way to implement this is we can extend the current interface to accept
a resctrl file system mount parameter to switch b/w "procs" and "tasks" during
mount time. So the file sytem has either "procs" or "tasks" during run time. I 
don't think it's right to have both of them at the same time in the file system.

Is this the right way to go?

Thanks.

-Fenghua

RE: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection

2016-09-08 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> On Thu, Sep 08, 2016 at 02:57:01AM -0700, Fenghua Yu wrote:
> > From: Vikas Shivappa <vikas.shiva...@linux.intel.com>
> >
> > This patch includes CPUID enumeration routines for Cache allocation
> > and new values to track resources to the cpuinfo_x86 structure.
> >
> > Cache allocation provides a way for the Software (OS/VMM) to restrict
> > cache allocation to a defined 'subset' of cache which may be
> > overlapping with other 'subsets'. This feature is used when allocating
> > a line in cache ie when pulling new data into the cache. The
> > programming of the hardware is done via programming MSRs (model
> specific registers).
> >
> > Signed-off-by: Vikas Shivappa <vikas.shiva...@linux.intel.com>
> > Signed-off-by: Fenghua Yu <fenghua...@intel.com>
> > Reviewed-by: Tony Luck <tony.l...@intel.com>
> > ---
> 
> ...
> 
> > diff --git a/arch/x86/include/asm/cpufeatures.h
> > b/arch/x86/include/asm/cpufeatures.h
> > index 92a8308..62d979b9 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -12,7 +12,7 @@
> >  /*
> >   * Defines x86 CPU feature bits
> >   */
> > -#define NCAPINTS   18  /* N 32-bit words worth of info */
> > +#define NCAPINTS   19  /* N 32-bit words worth of info */
> >  #define NBUGINTS   1   /* N 32-bit bug flags */
> >
> >  /*
> > @@ -220,6 +220,7 @@
> >  #define X86_FEATURE_RTM( 9*32+11) /* Restricted Transactional
> Memory */
> >  #define X86_FEATURE_CQM( 9*32+12) /* Cache QoS Monitoring
> */
> >  #define X86_FEATURE_MPX( 9*32+14) /* Memory Protection
> Extension */
> > +#define X86_FEATURE_RDT( 9*32+15) /* Resource Director
> Technology */
> >  #define X86_FEATURE_AVX512F( 9*32+16) /* AVX-512 Foundation */
> >  #define X86_FEATURE_AVX512DQ   ( 9*32+17) /* AVX-512 DQ
> (Double/Quad granular) Instructions */
> >  #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction
> */
> > @@ -286,6 +287,9 @@
> >  #define X86_FEATURE_SUCCOR (17*32+1) /* Uncorrectable error
> containment and recovery */
> >  #define X86_FEATURE_SMCA   (17*32+3) /* Scalable MCA */
> >
> > +/* Intel-defined CPU features, CPUID level 0x0010:0 (ebx), word
> > +18 */
> 
> Seems like this leaf is dedicated to CAT and has only 2 feature bits defined 
> in
> the SDM. Please use init_scattered_cpuid_features() instead of adding a
> whole CAP word.

Actually this leaf will be extended to have more bits for more resources 
allocation.

Thanks.

-Fenghua

RE: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection

2016-09-08 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> On Thu, Sep 08, 2016 at 02:57:01AM -0700, Fenghua Yu wrote:
> > From: Vikas Shivappa 
> >
> > This patch includes CPUID enumeration routines for Cache allocation
> > and new values to track resources to the cpuinfo_x86 structure.
> >
> > Cache allocation provides a way for the Software (OS/VMM) to restrict
> > cache allocation to a defined 'subset' of cache which may be
> > overlapping with other 'subsets'. This feature is used when allocating
> > a line in cache ie when pulling new data into the cache. The
> > programming of the hardware is done via programming MSRs (model
> specific registers).
> >
> > Signed-off-by: Vikas Shivappa 
> > Signed-off-by: Fenghua Yu 
> > Reviewed-by: Tony Luck 
> > ---
> 
> ...
> 
> > diff --git a/arch/x86/include/asm/cpufeatures.h
> > b/arch/x86/include/asm/cpufeatures.h
> > index 92a8308..62d979b9 100644
> > --- a/arch/x86/include/asm/cpufeatures.h
> > +++ b/arch/x86/include/asm/cpufeatures.h
> > @@ -12,7 +12,7 @@
> >  /*
> >   * Defines x86 CPU feature bits
> >   */
> > -#define NCAPINTS   18  /* N 32-bit words worth of info */
> > +#define NCAPINTS   19  /* N 32-bit words worth of info */
> >  #define NBUGINTS   1   /* N 32-bit bug flags */
> >
> >  /*
> > @@ -220,6 +220,7 @@
> >  #define X86_FEATURE_RTM( 9*32+11) /* Restricted Transactional
> Memory */
> >  #define X86_FEATURE_CQM( 9*32+12) /* Cache QoS Monitoring
> */
> >  #define X86_FEATURE_MPX( 9*32+14) /* Memory Protection
> Extension */
> > +#define X86_FEATURE_RDT( 9*32+15) /* Resource Director
> Technology */
> >  #define X86_FEATURE_AVX512F( 9*32+16) /* AVX-512 Foundation */
> >  #define X86_FEATURE_AVX512DQ   ( 9*32+17) /* AVX-512 DQ
> (Double/Quad granular) Instructions */
> >  #define X86_FEATURE_RDSEED ( 9*32+18) /* The RDSEED instruction
> */
> > @@ -286,6 +287,9 @@
> >  #define X86_FEATURE_SUCCOR (17*32+1) /* Uncorrectable error
> containment and recovery */
> >  #define X86_FEATURE_SMCA   (17*32+3) /* Scalable MCA */
> >
> > +/* Intel-defined CPU features, CPUID level 0x0010:0 (ebx), word
> > +18 */
> 
> Seems like this leaf is dedicated to CAT and has only 2 feature bits defined 
> in
> the SDM. Please use init_scattered_cpuid_features() instead of adding a
> whole CAP word.

Actually this leaf will be extended to have more bits for more resources 
allocation.

Thanks.

-Fenghua

RE: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection

2016-09-08 Thread Yu, Fenghua

> On Thu, 8 Sep 2016, Fenghua Yu wrote:
> > +   cpuid_count(0x0010, 1, , , ,
> );
> > +   c->x86_l3_max_closid = edx + 1;
> > +   c->x86_l3_max_cbm_len = eax + 1;
> 
> According to the SDM:
> 
> EAX Bits  4:0:  Length of the capacity bit mask for the corresponding 
> ResID.
> Bits 31:05: Reserved
> 
> EDX   Bits 15:0:  Highest COS number supported for this ResID.
>   Bits 31:16: Reserved
> 
> So why are we assuming that bits 31-5 of EAX and 16-31 of EDX are going to
> be zero forever and if not that they are just extending the existing bits?
> If that's the case then we don't need to mask out the upper bits, but the
> code wants a proper comment about this.

You are right. We cannot assume the upper bits are always zero. I fixed the
issue by masking out the upper bits.

Thanks.

-Fenghua

RE: [PATCH v2 07/33] x86/intel_rdt: Add support for Cache Allocation detection

2016-09-08 Thread Yu, Fenghua

> On Thu, 8 Sep 2016, Fenghua Yu wrote:
> > +   cpuid_count(0x0010, 1, , , ,
> );
> > +   c->x86_l3_max_closid = edx + 1;
> > +   c->x86_l3_max_cbm_len = eax + 1;
> 
> According to the SDM:
> 
> EAX Bits  4:0:  Length of the capacity bit mask for the corresponding 
> ResID.
> Bits 31:05: Reserved
> 
> EDX   Bits 15:0:  Highest COS number supported for this ResID.
>   Bits 31:16: Reserved
> 
> So why are we assuming that bits 31-5 of EAX and 16-31 of EDX are going to
> be zero forever and if not that they are just extending the existing bits?
> If that's the case then we don't need to mask out the upper bits, but the
> code wants a proper comment about this.

You are right. We cannot assume the upper bits are always zero. I fixed the
issue by masking out the upper bits.

Thanks.

-Fenghua

RE: [PATCH 13/32] Documentation, x86: Documentation for Intel resource allocation user interface

2016-08-04 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Tuesday, July 19, 2016 5:32 AM
> On Thu, 14 Jul 2016, Luck, Tony wrote:
> > So the core part of __intel_rdt_sched_in() will look like:
> >
> > rdtgrp = root_rdtgroup;
> That can be done simpler. The default cpu_rdtgroup should be
> root_rdtgroup. So you spare one conditional.
> 
> Thanks
> 
>   tglx

Hi, Thomas et al,

Do we need to consider using generic schemas format instead of the current 
architecture specific
schemas format?

Currently we use CBM (Cache Bit Mask) in the "schemas". This is architecture 
specific format.
A feedback I got is this may not be expanded to hypothetical future other 
architecture(s),
user/sysadmin may not have knowledge of CBM, and difficult VM migration b/w 
different
machines which have different length of CBM.

Our current answer to the feedback is user tool/knowledge is needed to set up 
schemas.
Kernel only does CBM or architecture level schemas. User management tools can 
be designed
to hide all kernel CBM details and allow users to allocate cache with high 
level knowledge
(e.g. % of cache or size of cache etc). I believe user level tool can handle 
this and kernel only
needs to handle minimal CBM level. This is how we have been designing this user 
interface.

But on the other hand, we can handle high level schemas info in kernel user 
interface as well.
We can introduce allocation policies to user interface. User specifies an 
allocation policy during
resctrl file system mount time. Each policy has its own driver in kernel. 
Default policy is to use
current CBM schemas that is implemented in this patch set. We can implement 
other drivers
and schemas format may be different in each driver. For example, % policy 
driver which
specifies % of L3 allocation in the schemas. Another example, size policy 
driver which specifies
size of L3 in the schemas. And people can write other creative policies in the 
future. Each driver
has its own CONFIG.

If doing this, this updated user interface can address the above concerns in 
kernel level plus
it can handle cases (example??) that user tool cannot handle in user space. 
Should we update
the current user interface to do this? Code and document changes should be 
about 50 lines
more on top of current patch set.

Just raise the question for open discussion.

Thanks.

-Fenghua

RE: [PATCH 13/32] Documentation, x86: Documentation for Intel resource allocation user interface

2016-08-04 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Tuesday, July 19, 2016 5:32 AM
> On Thu, 14 Jul 2016, Luck, Tony wrote:
> > So the core part of __intel_rdt_sched_in() will look like:
> >
> > rdtgrp = root_rdtgroup;
> That can be done simpler. The default cpu_rdtgroup should be
> root_rdtgroup. So you spare one conditional.
> 
> Thanks
> 
>   tglx

Hi, Thomas et al,

Do we need to consider using generic schemas format instead of the current 
architecture specific
schemas format?

Currently we use CBM (Cache Bit Mask) in the "schemas". This is architecture 
specific format.
A feedback I got is this may not be expanded to hypothetical future other 
architecture(s),
user/sysadmin may not have knowledge of CBM, and difficult VM migration b/w 
different
machines which have different length of CBM.

Our current answer to the feedback is user tool/knowledge is needed to set up 
schemas.
Kernel only does CBM or architecture level schemas. User management tools can 
be designed
to hide all kernel CBM details and allow users to allocate cache with high 
level knowledge
(e.g. % of cache or size of cache etc). I believe user level tool can handle 
this and kernel only
needs to handle minimal CBM level. This is how we have been designing this user 
interface.

But on the other hand, we can handle high level schemas info in kernel user 
interface as well.
We can introduce allocation policies to user interface. User specifies an 
allocation policy during
resctrl file system mount time. Each policy has its own driver in kernel. 
Default policy is to use
current CBM schemas that is implemented in this patch set. We can implement 
other drivers
and schemas format may be different in each driver. For example, % policy 
driver which
specifies % of L3 allocation in the schemas. Another example, size policy 
driver which specifies
size of L3 in the schemas. And people can write other creative policies in the 
future. Each driver
has its own CONFIG.

If doing this, this updated user interface can address the above concerns in 
kernel level plus
it can handle cases (example??) that user tool cannot handle in user space. 
Should we update
the current user interface to do this? Code and document changes should be 
about 50 lines
more on top of current patch set.

Just raise the question for open discussion.

Thanks.

-Fenghua

RE: [PATCH 06/32] x86/intel_rdt: Hot cpu support for Cache Allocation

2016-07-14 Thread Yu, Fenghua

> 
> > +static inline void intel_rdt_cpu_start(int cpu) {
> > +   struct intel_pqr_state *state = _cpu(pqr_state, cpu);
> > +
> > +   state->closid = 0;
> > +   mutex_lock(_group_mutex);
> > +   if (rdt_cpumask_update(cpu))
> > +   smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
> > +   mutex_unlock(_group_mutex);
> 
> what happens if cpu's with a cache_id not available at boot comes online?

For L3, that case happens when a new socket is hot plugged into the platform.
We don't handle that right now because that needs platform support and I don't
have that kind of platform to test.

But maybe I can add that support in code and do a test in a simulated mode.
Basically that will create a new domain for the new cache_id.

Thanks.

-Fenghua

RE: [PATCH 06/32] x86/intel_rdt: Hot cpu support for Cache Allocation

2016-07-14 Thread Yu, Fenghua

> 
> > +static inline void intel_rdt_cpu_start(int cpu) {
> > +   struct intel_pqr_state *state = _cpu(pqr_state, cpu);
> > +
> > +   state->closid = 0;
> > +   mutex_lock(_group_mutex);
> > +   if (rdt_cpumask_update(cpu))
> > +   smp_call_function_single(cpu, cbm_update_msrs, NULL, 1);
> > +   mutex_unlock(_group_mutex);
> 
> what happens if cpu's with a cache_id not available at boot comes online?

For L3, that case happens when a new socket is hot plugged into the platform.
We don't handle that right now because that needs platform support and I don't
have that kind of platform to test.

But maybe I can add that support in code and do a test in a simulated mode.
Basically that will create a new domain for the new cache_id.

Thanks.

-Fenghua

RE: [PATCH 30/32] x86/intel_rdt_rdtgroup.c: Process schemas input from rscctrl interface

2016-07-14 Thread Yu, Fenghua

> From: David Carrillo-Cisneros [mailto:davi...@google.com]
> > +static int get_res_type(char **res, enum resource_type *res_type) {
> > +   char *tok;
> > +
> > +   tok = strsep(res, ":");
> > +   if (tok == NULL)
> > +   return -EINVAL;
> > +
> > +   if (!strcmp(tok, "L3")) {
> 
> Maybe use strstrip to allow a more readable input ? i.e. "L3 :  "


> 
> > +   *res_type = RESOURCE_L3;
> > +   return 0;
> > +   }
> > +
> > +   return -EINVAL;
> > +}
> > +
> > +static int divide_resources(char *buf, char *resources[RESOURCE_NUM])
> > +{
> > +   char *tok;
> > +   unsigned int resource_num = 0;
> > +   int ret = 0;
> > +   char *res;
> > +   char *res_block;
> > +   size_t size;
> > +   enum resource_type res_type;
> > +
> > +   size = strlen(buf) + 1;
> > +   res = kzalloc(size, GFP_KERNEL);
> > +   if (!res) {
> > +   ret = -ENOSPC;
> 
> -ENOMEM?

Will change to -ENOMEM.

> 
> > +
> > +   res_block = res;
> > +   ret = get_res_type(_block, _type);
> > +   if (ret) {
> > +   pr_info("Unknown resource type!");
> > +   goto out;
> > +   }
> 
> does this work if res_block doesn't have ":"? don't you need to check
> res_block?

get_res_type() checks ":" and return -EINVAL if no ":".

> 
> > +static int get_cache_schema(char *buf, struct cache_resource *l, int level,
> > +struct rdtgroup *rdtgrp) {
> > +   char *tok, *tok_cache_id;
> > +   int ret;
> > +   int domain_num;
> > +   int input_domain_num;
> > +   int len;
> > +   unsigned int input_cache_id;
> > +   unsigned int cid;
> > +   unsigned int leaf;
> > +
> > +   if (!cat_enabled(level) && strcmp(buf, ";")) {
> > +   pr_info("Disabled resource should have empty schema\n");
> > +   return -EINVAL;
> > +   }
> > +
> > +   len = strlen(buf);
> > +   /*
> > +* Translate cache id based cbm from one line string with format
> > +* ":=;=;..." for
> > +* disabled cdp.
> > +* Or
> > +* ":=x,x; id1>=x,x;..."
> > +* for enabled cdp.
> > +*/
> > +   input_domain_num = 0;
> > +   while ((tok = strsep(, ";")) != NULL) {
> > +   tok_cache_id = strsep(, "=");
> > +   if (tok_cache_id == NULL)
> > +   goto cache_id_err;
> 
> what if no "=" ? , also would be nice to allow spaces around "="  .

Without "=-", reports id error. Sure I can strip the spaces around "=".

> 
> > +
> > +   ret = kstrtouint(tok_cache_id, 16, _cache_id);
> > +   if (ret)
> > +   goto cache_id_err;
> > +
> > +   leaf = level_to_leaf(level);
> 
> why is this in the loop?

Leaf is the cache  index number which is contiguous starting from 0. We need to 
save and get
cache id info from cache index.

 uniquely identifies a cache.

Architecturally level can not be used to identify a cache. There could be 2 
caches in one level, ie. icache and dcache.

So we need to translate from level 
> 
> > +   cid = cache_domains[leaf].shared_cache_id[input_domain_num];
> > +   if (input_cache_id != cid)
> > +   goto cache_id_err;
> 
> so schemata must be present for all cache_id's and sorted in increasing order
> of cache_id? what's the point of having the cache_id# then?

Cache_id may be not be contiguous. It can not be used directly as array index.
For user interface, user inputs cache_id to identify a cache. Internally kernel 
uses
domain, which is contiguous and used as index for internally saved cbm. Kernel
interface code does the mapping between cache_id and domain number.

> 
> > +
> > +/*
> > + * Check if the reference counts are all ones in rdtgrp's domain.
> > + */
> > +static bool one_refcnt(struct rdtgroup *rdtgrp, int domain) {
> > +   int refcnt;
> > +   int closid;
> > +
> > +   closid = rdtgrp->resource.closid[domain];
> > +   if (cat_l3_enabled) {
> 
> if cat_l3_enabled == false, then reference counts are always one?

I can change to return false if cat_l3_enabled==false.

> 
> > + * Go through all shared domains. Check if there is an existing
> > +closid
> > + * in all rdtgroups that matches l3 cbms in the shared
> > + * domain. If find one, reuse the closid. Otherwise, allocate a new one.
> > + */
> > +static int get_rdtgroup_resources(struct resources *resources_set,
> > + struct rdtgroup *rdtgrp) {
> > +   struct cache_resource *l3;
> > +   bool l3_cbm_found;
> > +   struct list_head *l;
> > +   struct rdtgroup *r;
> > +   u64 cbm;
> > +   int rdt_closid[MAX_CACHE_DOMAINS];
> > +   int rdt_closid_type[MAX_CACHE_DOMAINS];
> > +   int domain;
> > +   int closid;
> > +   int ret;
> > +
> > +   l3 = resources_set->l3;
> 
> l3

RE: [PATCH 30/32] x86/intel_rdt_rdtgroup.c: Process schemas input from rscctrl interface

2016-07-14 Thread Yu, Fenghua

> From: David Carrillo-Cisneros [mailto:davi...@google.com]
> > +static int get_res_type(char **res, enum resource_type *res_type) {
> > +   char *tok;
> > +
> > +   tok = strsep(res, ":");
> > +   if (tok == NULL)
> > +   return -EINVAL;
> > +
> > +   if (!strcmp(tok, "L3")) {
> 
> Maybe use strstrip to allow a more readable input ? i.e. "L3 :  "


> 
> > +   *res_type = RESOURCE_L3;
> > +   return 0;
> > +   }
> > +
> > +   return -EINVAL;
> > +}
> > +
> > +static int divide_resources(char *buf, char *resources[RESOURCE_NUM])
> > +{
> > +   char *tok;
> > +   unsigned int resource_num = 0;
> > +   int ret = 0;
> > +   char *res;
> > +   char *res_block;
> > +   size_t size;
> > +   enum resource_type res_type;
> > +
> > +   size = strlen(buf) + 1;
> > +   res = kzalloc(size, GFP_KERNEL);
> > +   if (!res) {
> > +   ret = -ENOSPC;
> 
> -ENOMEM?

Will change to -ENOMEM.

> 
> > +
> > +   res_block = res;
> > +   ret = get_res_type(_block, _type);
> > +   if (ret) {
> > +   pr_info("Unknown resource type!");
> > +   goto out;
> > +   }
> 
> does this work if res_block doesn't have ":"? don't you need to check
> res_block?

get_res_type() checks ":" and return -EINVAL if no ":".

> 
> > +static int get_cache_schema(char *buf, struct cache_resource *l, int level,
> > +struct rdtgroup *rdtgrp) {
> > +   char *tok, *tok_cache_id;
> > +   int ret;
> > +   int domain_num;
> > +   int input_domain_num;
> > +   int len;
> > +   unsigned int input_cache_id;
> > +   unsigned int cid;
> > +   unsigned int leaf;
> > +
> > +   if (!cat_enabled(level) && strcmp(buf, ";")) {
> > +   pr_info("Disabled resource should have empty schema\n");
> > +   return -EINVAL;
> > +   }
> > +
> > +   len = strlen(buf);
> > +   /*
> > +* Translate cache id based cbm from one line string with format
> > +* ":=;=;..." for
> > +* disabled cdp.
> > +* Or
> > +* ":=x,x; id1>=x,x;..."
> > +* for enabled cdp.
> > +*/
> > +   input_domain_num = 0;
> > +   while ((tok = strsep(, ";")) != NULL) {
> > +   tok_cache_id = strsep(, "=");
> > +   if (tok_cache_id == NULL)
> > +   goto cache_id_err;
> 
> what if no "=" ? , also would be nice to allow spaces around "="  .

Without "=-", reports id error. Sure I can strip the spaces around "=".

> 
> > +
> > +   ret = kstrtouint(tok_cache_id, 16, _cache_id);
> > +   if (ret)
> > +   goto cache_id_err;
> > +
> > +   leaf = level_to_leaf(level);
> 
> why is this in the loop?

Leaf is the cache  index number which is contiguous starting from 0. We need to 
save and get
cache id info from cache index.

 uniquely identifies a cache.

Architecturally level can not be used to identify a cache. There could be 2 
caches in one level, ie. icache and dcache.

So we need to translate from level 
> 
> > +   cid = cache_domains[leaf].shared_cache_id[input_domain_num];
> > +   if (input_cache_id != cid)
> > +   goto cache_id_err;
> 
> so schemata must be present for all cache_id's and sorted in increasing order
> of cache_id? what's the point of having the cache_id# then?

Cache_id may be not be contiguous. It can not be used directly as array index.
For user interface, user inputs cache_id to identify a cache. Internally kernel 
uses
domain, which is contiguous and used as index for internally saved cbm. Kernel
interface code does the mapping between cache_id and domain number.

> 
> > +
> > +/*
> > + * Check if the reference counts are all ones in rdtgrp's domain.
> > + */
> > +static bool one_refcnt(struct rdtgroup *rdtgrp, int domain) {
> > +   int refcnt;
> > +   int closid;
> > +
> > +   closid = rdtgrp->resource.closid[domain];
> > +   if (cat_l3_enabled) {
> 
> if cat_l3_enabled == false, then reference counts are always one?

I can change to return false if cat_l3_enabled==false.

> 
> > + * Go through all shared domains. Check if there is an existing
> > +closid
> > + * in all rdtgroups that matches l3 cbms in the shared
> > + * domain. If find one, reuse the closid. Otherwise, allocate a new one.
> > + */
> > +static int get_rdtgroup_resources(struct resources *resources_set,
> > + struct rdtgroup *rdtgrp) {
> > +   struct cache_resource *l3;
> > +   bool l3_cbm_found;
> > +   struct list_head *l;
> > +   struct rdtgroup *r;
> > +   u64 cbm;
> > +   int rdt_closid[MAX_CACHE_DOMAINS];
> > +   int rdt_closid_type[MAX_CACHE_DOMAINS];
> > +   int domain;
> > +   int closid;
> > +   int ret;
> > +
> > +   l3 = resources_set->l3;
> 
> l3 is NULL if

RE: [PATCH 30/32] x86/intel_rdt_rdtgroup.c: Process schemas input from rscctrl interface

2016-07-14 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 11:11 PM
> On Wed, 13 Jul 2016, David Carrillo-Cisneros wrote:
> > > +static void free_cache_resource(struct cache_resource *l) {
> > > +   kfree(l->cbm);
> > > +   kfree(l->cbm2);
> > > +   kfree(l->closid);
> > > +   kfree(l->refcnt);
> >
> > this function is used to clean up alloc_cache_resource in the error
> > path of get_resources where it's not necessarily true that all of l's
> > members were allocated.
> 
> kfree handles kfree(NULL) nicely.

Yes, that's right. If I check the pointer before kfree(), checkpatch.pl will
report warning for that and suggest kfree(NULL) is safe and code is 
short.

Thanks.

-Fenghua

RE: [PATCH 30/32] x86/intel_rdt_rdtgroup.c: Process schemas input from rscctrl interface

2016-07-14 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 11:11 PM
> On Wed, 13 Jul 2016, David Carrillo-Cisneros wrote:
> > > +static void free_cache_resource(struct cache_resource *l) {
> > > +   kfree(l->cbm);
> > > +   kfree(l->cbm2);
> > > +   kfree(l->closid);
> > > +   kfree(l->refcnt);
> >
> > this function is used to clean up alloc_cache_resource in the error
> > path of get_resources where it's not necessarily true that all of l's
> > members were allocated.
> 
> kfree handles kfree(NULL) nicely.

Yes, that's right. If I check the pointer before kfree(), checkpatch.pl will
report warning for that and suggest kfree(NULL) is safe and code is 
short.

Thanks.

-Fenghua

RE: [PATCH 24/32] Task fork and exit for rdtgroup

2016-07-13 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 2:03 PM
> On Wed, 13 Jul 2016, Yu, Fenghua wrote:
> > On Wed, July 2016, Thomas Gleixner wrote
> > > On Tue, 12 Jul 2016, Fenghua Yu wrote:
> > > > +void rdtgroup_post_fork(struct task_struct *child) {
> > > > +   if (!use_rdtgroup_tasks)
> > > > +   return;
> > > > +
> > > > +   spin_lock_irq(_task_lock);
> > > > +   if (list_empty(>rg_list)) {
> > >
> > > Why would the list be non empty after a fork?
> >
> > In this situation for a pid:
> > 1.rdtgroup_fork(): rg_list=null.
> > 2.setup_task_rg_lists(): rg_list is setup
> > 3.rdtgroup_fork(): rg_list is not empty
> 
> Why would rdtgroup_fork() be called twice for a given thread?
> 
> > This situation happens only during rscctrl mount time. Before mount,
> > post_fork() returns from !use_rdtgroup_tasks and doesn't set up
> > rg_list. After mount, rg_list() is always empty in post_fork(). But we need
> to check rg_list for above situation.
> >
> > Does that make sense?
> 
> No, that does not make any sense at all.
> 
> > Any suggestion for better soluation?
> 
> The problem you have is:
> 
> fork
>   list_init(rg_list);
>   write_lock(tasklist_lock);
> 
> task becomes visible
> 
>   write_unlock(tasklist_lock);
> 
>   rdtgroup_post_fork();
>   if (!use_rdtgroup_tasks)
>  return;
> 
>   spin_lock_irq(_task_lock);
>   list_add();
>   spin_unlock_irq(_task_lock);
> 
> I have no idea why this lock must be taken with _irq, but thats another story.
> Let's look at the mount side:
> 
>spin_lock_irq(_task_lock);
>read_lock(_lock);
> 
>do_each_thread(g, p) {
>  WARN_ON(More magic crap happening there)
> 
>  spin_lock_irq(>sighand->siglock);
>  list_add();
>  spin_unlock_irq(>sighand->siglock);
> 
> Great: You undo the irq disable of (_task_lock) above! Oh well
> 
>read_unlock(_lock);
>spin_unlock_irq(_task_lock);
> 
> So you need all this magic in rdtgroup_post_fork() and setup_task_rg_lists()
> just because you blindly positioned rdtgroup_post_fork() at the point where
> the cgroup_post_fork() stuff is. But you did not think a second about the
> locking rules here otherwise they would be documented somewhere.
> 
> You need a read_lock(_lock) for the mount part anyway. So why
> don't you do the obvious:
> 
> fork
>   list_init(rg_list);
>   write_lock(tasklist_lock);
> 
>   rdtgroup_post_fork();
>   if (use_rdtgroup_tasks)
>   spin_lock(_task_lock);
>   list_add();
>   spin_unlock(_task_lock);
> 
> task becomes visible
> 
>   write_unlock(tasklist_lock);
> 
> And reorder the lock ordering in the mount path:
> 
>read_lock(_lock);
>spin_lock(_task_lock);
> 
> Now using rdtgroup_task_lock to protect current->rdtgroup is horrible as
> well. You need task->sighand->siglock in the mount path anyway to prevent
> exit races. So you can simplify the whole magic to:
> 
> fork
>   list_init(rg_list);
>   write_lock(tasklist_lock);
> 
> spin_lock(>sighand->siglock);
> 
>   rdtgroup_post_fork();
>   if (use_rdtgroup_tasks)
>   list_add();
> 
>   spin_unlock(>sighand->siglock);
>   write_unlock(tasklist_lock);
> 
> That removes an extra lock/unlock operation from the fork path because
> current->sighand->siglock is taken inside of the tasklist_lock write
> current->sighand->locked
> section already.
> 
> So you need protection for use_rdtgroup_task, which is a complete
> misnomer btw. (rdtgroup_active would be too obvious, right?). That
> protection is simple because you can set that flag with tasklist_lock read
> locked which you hold anyway for iterating all threads in the mount path.
> 
> Aside of that you need to take tsk->sighand->siglock when you change
> tsk->rdtgroup, but that's a no-brainer and it gives you the extra
> tsk->benefit that
> you can protect such an operation against exit of the task that way by
> checking PF_EXITING under the lock. I don't see any protection against exit in
> your current implementation when a task is moved to a different partition.
> 
> Please sit down and describe the complete locking and protection scheme of
> this stuff. I'm not going to figure this out from the obscure code another 
> time.
> 
> Thanks,
> 
>   tglx

Sure, I'll rethink of the locking and protection scheme for the tasks.

Thanks.

-Fenghua

RE: [PATCH 24/32] Task fork and exit for rdtgroup

2016-07-13 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 2:03 PM
> On Wed, 13 Jul 2016, Yu, Fenghua wrote:
> > On Wed, July 2016, Thomas Gleixner wrote
> > > On Tue, 12 Jul 2016, Fenghua Yu wrote:
> > > > +void rdtgroup_post_fork(struct task_struct *child) {
> > > > +   if (!use_rdtgroup_tasks)
> > > > +   return;
> > > > +
> > > > +   spin_lock_irq(_task_lock);
> > > > +   if (list_empty(>rg_list)) {
> > >
> > > Why would the list be non empty after a fork?
> >
> > In this situation for a pid:
> > 1.rdtgroup_fork(): rg_list=null.
> > 2.setup_task_rg_lists(): rg_list is setup
> > 3.rdtgroup_fork(): rg_list is not empty
> 
> Why would rdtgroup_fork() be called twice for a given thread?
> 
> > This situation happens only during rscctrl mount time. Before mount,
> > post_fork() returns from !use_rdtgroup_tasks and doesn't set up
> > rg_list. After mount, rg_list() is always empty in post_fork(). But we need
> to check rg_list for above situation.
> >
> > Does that make sense?
> 
> No, that does not make any sense at all.
> 
> > Any suggestion for better soluation?
> 
> The problem you have is:
> 
> fork
>   list_init(rg_list);
>   write_lock(tasklist_lock);
> 
> task becomes visible
> 
>   write_unlock(tasklist_lock);
> 
>   rdtgroup_post_fork();
>   if (!use_rdtgroup_tasks)
>  return;
> 
>   spin_lock_irq(_task_lock);
>   list_add();
>   spin_unlock_irq(_task_lock);
> 
> I have no idea why this lock must be taken with _irq, but thats another story.
> Let's look at the mount side:
> 
>spin_lock_irq(_task_lock);
>read_lock(_lock);
> 
>do_each_thread(g, p) {
>  WARN_ON(More magic crap happening there)
> 
>  spin_lock_irq(>sighand->siglock);
>  list_add();
>  spin_unlock_irq(>sighand->siglock);
> 
> Great: You undo the irq disable of (_task_lock) above! Oh well
> 
>read_unlock(_lock);
>spin_unlock_irq(_task_lock);
> 
> So you need all this magic in rdtgroup_post_fork() and setup_task_rg_lists()
> just because you blindly positioned rdtgroup_post_fork() at the point where
> the cgroup_post_fork() stuff is. But you did not think a second about the
> locking rules here otherwise they would be documented somewhere.
> 
> You need a read_lock(_lock) for the mount part anyway. So why
> don't you do the obvious:
> 
> fork
>   list_init(rg_list);
>   write_lock(tasklist_lock);
> 
>   rdtgroup_post_fork();
>   if (use_rdtgroup_tasks)
>   spin_lock(_task_lock);
>   list_add();
>   spin_unlock(_task_lock);
> 
> task becomes visible
> 
>   write_unlock(tasklist_lock);
> 
> And reorder the lock ordering in the mount path:
> 
>read_lock(_lock);
>spin_lock(_task_lock);
> 
> Now using rdtgroup_task_lock to protect current->rdtgroup is horrible as
> well. You need task->sighand->siglock in the mount path anyway to prevent
> exit races. So you can simplify the whole magic to:
> 
> fork
>   list_init(rg_list);
>   write_lock(tasklist_lock);
> 
> spin_lock(>sighand->siglock);
> 
>   rdtgroup_post_fork();
>   if (use_rdtgroup_tasks)
>   list_add();
> 
>   spin_unlock(>sighand->siglock);
>   write_unlock(tasklist_lock);
> 
> That removes an extra lock/unlock operation from the fork path because
> current->sighand->siglock is taken inside of the tasklist_lock write
> current->sighand->locked
> section already.
> 
> So you need protection for use_rdtgroup_task, which is a complete
> misnomer btw. (rdtgroup_active would be too obvious, right?). That
> protection is simple because you can set that flag with tasklist_lock read
> locked which you hold anyway for iterating all threads in the mount path.
> 
> Aside of that you need to take tsk->sighand->siglock when you change
> tsk->rdtgroup, but that's a no-brainer and it gives you the extra
> tsk->benefit that
> you can protect such an operation against exit of the task that way by
> checking PF_EXITING under the lock. I don't see any protection against exit in
> your current implementation when a task is moved to a different partition.
> 
> Please sit down and describe the complete locking and protection scheme of
> this stuff. I'm not going to figure this out from the obscure code another 
> time.
> 
> Thanks,
> 
>   tglx

Sure, I'll rethink of the locking and protection scheme for the tasks.

Thanks.

-Fenghua

RE: [PATCH 08/32] Define CONFIG_INTEL_RDT

2016-07-13 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 2:10 PM
> On Wed, 13 Jul 2016, Yu, Fenghua wrote:
> > Here is why this patch behaves like this:
> >
> > This patch and actually first 12 patches are directly from last year's
> > cgroup base CAT patch series. The last year's patch series had gone 16
> > versions already. Because the first 12 patches have been reviewed many
> > times, we keep them untouched (except removing cgroup code in patch 8
> > and some unused cdp code in patch 11) and release other patches on top
> > of the first 12 patches.
> 
> Which is not making the review any simpler. In order to understand the
> modifications I have to go back and page in the original stuff from last year
> once again. So I have to read the original patch first to understand the
> modifications and then get the overall picture of the new stuff. Please fold
> stuff back to the proper places so I can start reviewing this thing under the
> new design idea instead of twisting my brain around two designs.

Ok. I will do that.

> 
> > I fully agree this patch should be split if we want to have a good
> > overall patch series.
> 
> Good.
> 
> Thanks,
> 
>   tglx

RE: [PATCH 08/32] Define CONFIG_INTEL_RDT

2016-07-13 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 2:10 PM
> On Wed, 13 Jul 2016, Yu, Fenghua wrote:
> > Here is why this patch behaves like this:
> >
> > This patch and actually first 12 patches are directly from last year's
> > cgroup base CAT patch series. The last year's patch series had gone 16
> > versions already. Because the first 12 patches have been reviewed many
> > times, we keep them untouched (except removing cgroup code in patch 8
> > and some unused cdp code in patch 11) and release other patches on top
> > of the first 12 patches.
> 
> Which is not making the review any simpler. In order to understand the
> modifications I have to go back and page in the original stuff from last year
> once again. So I have to read the original patch first to understand the
> modifications and then get the overall picture of the new stuff. Please fold
> stuff back to the proper places so I can start reviewing this thing under the
> new design idea instead of twisting my brain around two designs.

Ok. I will do that.

> 
> > I fully agree this patch should be split if we want to have a good
> > overall patch series.
> 
> Good.
> 
> Thanks,
> 
>   tglx

RE: [PATCH 08/32] Define CONFIG_INTEL_RDT

2016-07-13 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 3:26 AM
> On Tue, 12 Jul 2016, Fenghua Yu wrote:
> 
> Subject: Define CONFIG_INTEL_RDT
> 
> That does not qualify as a proper patch subject
> 
> > From: Vikas Shivappa 
> >
> > CONFIG_INTEL_RDT is defined.
> 
> That tells us what?
> 
> > --- a/arch/x86/include/asm/intel_rdt.h
> > +++ b/arch/x86/include/asm/intel_rdt.h
> > @@ -24,8 +24,16 @@ struct clos_cbm_table {
> >   * on scheduler hot path:
> >   * - This will stay as no-op unless we are running on an Intel SKU
> >   * which supports L3 cache allocation.
> > + * - When support is present and enabled, does not do any
> > + * IA32_PQR_MSR writes until the user starts really using the feature
> > + * ie creates a rdtgroup directory and assigns a cache_mask thats
> > + * different from the root rdtgroup's cache_mask.
> >   * - Caches the per cpu CLOSid values and does the MSR write only
> > - * when a task with a different CLOSid is scheduled in.
> > + * when a task with a different CLOSid is scheduled in. That
> > + * means the task belongs to a different rdtgroup.
> > + * - Closids are allocated so that different rdtgroup directories
> > + * with same cache_mask gets the same CLOSid. This minimizes CLOSids
> > + * used and reduces MSR write frequency.
> 
> How is this and the following changes related to $subject ?

No, this piece of code is not related to $subject.

Here is why this patch behaves like this:

This patch and actually first 12 patches are directly from last year's cgroup 
base CAT patch
series. The last year's patch series had gone 16 versions already. Because
the first 12 patches have been reviewed many times, we keep them untouched
(except removing cgroup code in patch 8 and some unused cdp code in patch 11)
and release other patches on top of the first 12 patches.

I fully agree this patch should be split if we want to have a good overall
patch series.

Thanks.

-Fenghua

RE: [PATCH 08/32] Define CONFIG_INTEL_RDT

2016-07-13 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 3:26 AM
> On Tue, 12 Jul 2016, Fenghua Yu wrote:
> 
> Subject: Define CONFIG_INTEL_RDT
> 
> That does not qualify as a proper patch subject
> 
> > From: Vikas Shivappa 
> >
> > CONFIG_INTEL_RDT is defined.
> 
> That tells us what?
> 
> > --- a/arch/x86/include/asm/intel_rdt.h
> > +++ b/arch/x86/include/asm/intel_rdt.h
> > @@ -24,8 +24,16 @@ struct clos_cbm_table {
> >   * on scheduler hot path:
> >   * - This will stay as no-op unless we are running on an Intel SKU
> >   * which supports L3 cache allocation.
> > + * - When support is present and enabled, does not do any
> > + * IA32_PQR_MSR writes until the user starts really using the feature
> > + * ie creates a rdtgroup directory and assigns a cache_mask thats
> > + * different from the root rdtgroup's cache_mask.
> >   * - Caches the per cpu CLOSid values and does the MSR write only
> > - * when a task with a different CLOSid is scheduled in.
> > + * when a task with a different CLOSid is scheduled in. That
> > + * means the task belongs to a different rdtgroup.
> > + * - Closids are allocated so that different rdtgroup directories
> > + * with same cache_mask gets the same CLOSid. This minimizes CLOSids
> > + * used and reduces MSR write frequency.
> 
> How is this and the following changes related to $subject ?

No, this piece of code is not related to $subject.

Here is why this patch behaves like this:

This patch and actually first 12 patches are directly from last year's cgroup 
base CAT patch
series. The last year's patch series had gone 16 versions already. Because
the first 12 patches have been reviewed many times, we keep them untouched
(except removing cgroup code in patch 8 and some unused cdp code in patch 11)
and release other patches on top of the first 12 patches.

I fully agree this patch should be split if we want to have a good overall
patch series.

Thanks.

-Fenghua

RE: [PATCH 19/32] sched.h: Add rg_list and rdtgroup in task_struct

2016-07-13 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 5:56 AM
> On Tue, 12 Jul 2016, Fenghua Yu wrote:
> > From: Fenghua Yu <fenghua...@intel.com>
> >
> > rg_list is linked list to connect to other tasks in a rdtgroup.
> 
> Can you please prefix that member proper, e.g. rdt_group_list

There is another similar name in task_struct, which is cg_list for cgroup list.
I just follow that name to have a rg_list for rdtgroup list.

If you think rdt_group_list is a better name, I sure will change rg_list
to rdt_group_list.

> 
> > The point of rdtgroup allows the task to access its own rdtgroup directly.
> 
> 'pointer' perhaps? Also please use a proper prefix: rdt_group
> 
> A proper description might be:
> 
>   rdt_group points to the rdt group to which the task belongs.

In the patch set, I use the name "rdtgroup" to represent a rdt group.

Should I change the name "rdtgroup" to "rdt_group" in all patches
Including descriptions and code?

Thanks.

-Fenghua

RE: [PATCH 19/32] sched.h: Add rg_list and rdtgroup in task_struct

2016-07-13 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 5:56 AM
> On Tue, 12 Jul 2016, Fenghua Yu wrote:
> > From: Fenghua Yu 
> >
> > rg_list is linked list to connect to other tasks in a rdtgroup.
> 
> Can you please prefix that member proper, e.g. rdt_group_list

There is another similar name in task_struct, which is cg_list for cgroup list.
I just follow that name to have a rg_list for rdtgroup list.

If you think rdt_group_list is a better name, I sure will change rg_list
to rdt_group_list.

> 
> > The point of rdtgroup allows the task to access its own rdtgroup directly.
> 
> 'pointer' perhaps? Also please use a proper prefix: rdt_group
> 
> A proper description might be:
> 
>   rdt_group points to the rdt group to which the task belongs.

In the patch set, I use the name "rdtgroup" to represent a rdt group.

Should I change the name "rdtgroup" to "rdt_group" in all patches
Including descriptions and code?

Thanks.

-Fenghua

RE: [PATCH 23/32] x86/intel_rdt.c: Extend RDT to per cache and per resources

2016-07-13 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 6:07 AM
> To: Yu, Fenghua <fenghua...@intel.com>
> Cc: Ingo Molnar <mi...@elte.hu>; Anvin, H Peter
> <h.peter.an...@intel.com>; Luck, Tony <tony.l...@intel.com>; Tejun Heo
> <t...@kernel.org>; Borislav Petkov <b...@suse.de>; Stephane Eranian
> <eran...@google.com>; Peter Zijlstra <pet...@infradead.org>; Marcelo
> Tosatti <mtosa...@redhat.com>; David Carrillo-Cisneros
> <davi...@google.com>; Shankar, Ravi V <ravi.v.shan...@intel.com>; Vikas
> Shivappa <vikas.shiva...@linux.intel.com>; Prakhya, Sai Praneeth
> <sai.praneeth.prak...@intel.com>; linux-kernel  ker...@vger.kernel.org>; x86 <x...@kernel.org>
> Subject: Re: [PATCH 23/32] x86/intel_rdt.c: Extend RDT to per cache and per
> resources
> 
> On Tue, 12 Jul 2016, Fenghua Yu wrote:
> 
> > From: Fenghua Yu <fenghua...@intel.com>
> >
> > QoS mask MSRs array is per cache. We need to allocate CLOSID per cache
> > instead global CLOSID.
> >
> > A few different resources can share same QoS mask MSRs array. For
> > example, one L2 cache can share QoS MSRs with its next level
> > L3 cache. A domain number represents the L2 cache, the L3 cache, the
> > L2 cache's shared cpumask, and the L3 cache's shared cpumask.
> >
> > cctable is extended to be index by domain number so that each cache
> > has its own control table.
> >
> > shared_domain is introduced to cover multiple resources sharing
> > CLOSID.
> 
> This patch does a dozen different things at once. Can you please split the
> cleanup parts, the parts where statics are removed and the actual changes in
> the representation model apart so this can be reviewed?
> 
> And while at this, please fold back thes cleanups into the original patches.
> There is no point to have these changes seperate.

Sure. Will do this.

Thanks.

-Fenghua

RE: [PATCH 23/32] x86/intel_rdt.c: Extend RDT to per cache and per resources

2016-07-13 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, July 13, 2016 6:07 AM
> To: Yu, Fenghua 
> Cc: Ingo Molnar ; Anvin, H Peter
> ; Luck, Tony ; Tejun Heo
> ; Borislav Petkov ; Stephane Eranian
> ; Peter Zijlstra ; Marcelo
> Tosatti ; David Carrillo-Cisneros
> ; Shankar, Ravi V ; Vikas
> Shivappa ; Prakhya, Sai Praneeth
> ; linux-kernel  ker...@vger.kernel.org>; x86 
> Subject: Re: [PATCH 23/32] x86/intel_rdt.c: Extend RDT to per cache and per
> resources
> 
> On Tue, 12 Jul 2016, Fenghua Yu wrote:
> 
> > From: Fenghua Yu 
> >
> > QoS mask MSRs array is per cache. We need to allocate CLOSID per cache
> > instead global CLOSID.
> >
> > A few different resources can share same QoS mask MSRs array. For
> > example, one L2 cache can share QoS MSRs with its next level
> > L3 cache. A domain number represents the L2 cache, the L3 cache, the
> > L2 cache's shared cpumask, and the L3 cache's shared cpumask.
> >
> > cctable is extended to be index by domain number so that each cache
> > has its own control table.
> >
> > shared_domain is introduced to cover multiple resources sharing
> > CLOSID.
> 
> This patch does a dozen different things at once. Can you please split the
> cleanup parts, the parts where statics are removed and the actual changes in
> the representation model apart so this can be reviewed?
> 
> And while at this, please fold back thes cleanups into the original patches.
> There is no point to have these changes seperate.

Sure. Will do this.

Thanks.

-Fenghua

RE: [PATCH v2 2/3] Documentation, ABI: Add a document entry for cache id

2016-07-08 Thread Yu, Fenghua

> From: Ingo Molnar [mailto:mingo.kernel@gmail.com] On Behalf Of Ingo
> Molnar
> Sent: Friday, July 08, 2016 1:42 AM
> * Fenghua Yu <fenghua...@intel.com> wrote:
> 
> > From: Fenghua Yu <fenghua...@intel.com>
> >
> > Add an ABI document entry for
> /sys/devices/system/cpu/cpu*/cache/index*/id.
> >
> > Signed-off-by: Fenghua Yu <fenghua...@intel.com>
> > ---
> >  Documentation/ABI/testing/sysfs-devices-system-cpu | 13
> +
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu
> b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > index 1650133..cc62034 100644
> > --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> > +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > @@ -272,6 +272,19 @@ Description:   Parameters for the CPU cache
> attributes
> >  the modified cache line is written to main
> >  memory only when it is replaced
> >
> > +
> > +What:  /sys/devices/system/cpu/cpu*/cache/index*/id
> > +Date:  July 2016
> > +Contact:   Linux kernel mailing list <linux-kernel@vger.kernel.org>
> > +Description:   Cache id
> > +
> > +   The id identifies a cache in the platform. In same index, the id
> > +   is unique across the platform.
> 
> What does 'In same index' mean?

It means one cache's id is unique in all caches with same cache index number. 
For example, in all caches with index3 (i.e. level3), cache id 0 is unique to 
identify a L3 cache. But in caches with index 0 (i.e. Level0), there is also a 
cache id 0. So cache id is unique in one index. But not unique in two different 
index.

Does that make sense? I hope I express that correctly.

Thanks.

-Fenghua

RE: [PATCH v2 2/3] Documentation, ABI: Add a document entry for cache id

2016-07-08 Thread Yu, Fenghua

> From: Ingo Molnar [mailto:mingo.kernel@gmail.com] On Behalf Of Ingo
> Molnar
> Sent: Friday, July 08, 2016 1:42 AM
> * Fenghua Yu  wrote:
> 
> > From: Fenghua Yu 
> >
> > Add an ABI document entry for
> /sys/devices/system/cpu/cpu*/cache/index*/id.
> >
> > Signed-off-by: Fenghua Yu 
> > ---
> >  Documentation/ABI/testing/sysfs-devices-system-cpu | 13
> +
> >  1 file changed, 13 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-devices-system-cpu
> b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > index 1650133..cc62034 100644
> > --- a/Documentation/ABI/testing/sysfs-devices-system-cpu
> > +++ b/Documentation/ABI/testing/sysfs-devices-system-cpu
> > @@ -272,6 +272,19 @@ Description:   Parameters for the CPU cache
> attributes
> >  the modified cache line is written to main
> >  memory only when it is replaced
> >
> > +
> > +What:  /sys/devices/system/cpu/cpu*/cache/index*/id
> > +Date:  July 2016
> > +Contact:   Linux kernel mailing list 
> > +Description:   Cache id
> > +
> > +   The id identifies a cache in the platform. In same index, the id
> > +   is unique across the platform.
> 
> What does 'In same index' mean?

It means one cache's id is unique in all caches with same cache index number. 
For example, in all caches with index3 (i.e. level3), cache id 0 is unique to 
identify a L3 cache. But in caches with index 0 (i.e. Level0), there is also a 
cache id 0. So cache id is unique in one index. But not unique in two different 
index.

Does that make sense? I hope I express that correctly.

Thanks.

-Fenghua

RE: [PATCH v2 0/3] Cache id

2016-07-07 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Thursday, July 07, 2016 9:21 AM
> On Wed, Jul 06, 2016 at 03:07:15PM -0700, Fenghua Yu wrote:
> > From: Fenghua Yu <fenghua...@intel.com>
> >
> > This patch set introduces cache id to identify a cache in platform. It
> > can be useful in such areas as Cach Allocation Technology (CAT) where
> > user needs to specify how much cache is allocated on which cache.
> > Cache id provides a concise way to identify the cache. CAT patches
> > will be released separately.
> >
> > Changes:
> > v2: Split one patch into three patches and add ABI documentation.
> >
> > Fenghua Yu (3):
> >   cacheinfo: Introduce cache id
> >   Documentation, ABI: Add a document entry for cache id
> >   x86, intel_cacheinfo: Enable cache id in x86
> >
> >  Documentation/ABI/testing/sysfs-devices-system-cpu | 13
> +
> >  arch/x86/kernel/cpu/intel_cacheinfo.c  | 20
> 
> >  drivers/base/cacheinfo.c   |  5 +
> >  include/linux/cacheinfo.h  |  3 +++
> >  4 files changed, 41 insertions(+)
> 
> All 3:
> 
> Acked-by: Borislav Petkov <b...@suse.de>

That's great!

Is it possible to merge the patches to 4.8? Then I don't need to carry these 
patches with upcoming CAT enabling patches:)

Thanks.

-Fenghua

RE: [PATCH v2 0/3] Cache id

2016-07-07 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Thursday, July 07, 2016 9:21 AM
> On Wed, Jul 06, 2016 at 03:07:15PM -0700, Fenghua Yu wrote:
> > From: Fenghua Yu 
> >
> > This patch set introduces cache id to identify a cache in platform. It
> > can be useful in such areas as Cach Allocation Technology (CAT) where
> > user needs to specify how much cache is allocated on which cache.
> > Cache id provides a concise way to identify the cache. CAT patches
> > will be released separately.
> >
> > Changes:
> > v2: Split one patch into three patches and add ABI documentation.
> >
> > Fenghua Yu (3):
> >   cacheinfo: Introduce cache id
> >   Documentation, ABI: Add a document entry for cache id
> >   x86, intel_cacheinfo: Enable cache id in x86
> >
> >  Documentation/ABI/testing/sysfs-devices-system-cpu | 13
> +
> >  arch/x86/kernel/cpu/intel_cacheinfo.c  | 20
> 
> >  drivers/base/cacheinfo.c   |  5 +
> >  include/linux/cacheinfo.h  |  3 +++
> >  4 files changed, 41 insertions(+)
> 
> All 3:
> 
> Acked-by: Borislav Petkov 

That's great!

Is it possible to merge the patches to 4.8? Then I don't need to carry these 
patches with upcoming CAT enabling patches:)

Thanks.

-Fenghua

RE: [PATCH] cacheinfo: Introduce cache id

2016-07-01 Thread Yu, Fenghua


> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Friday, July 01, 2016 11:36 AM
> To: Yu, Fenghua <fenghua...@intel.com>
> Cc: Luck, Tony <tony.l...@intel.com>; Thomas Gleixner
> <t...@linutronix.de>; Ingo Molnar <mi...@elte.hu>; Anvin, H Peter
> <h.peter.an...@intel.com>; Peter Zijlstra <pet...@infradead.org>;
> Stephane Eranian <eran...@google.com>; Shankar, Ravi V
> <ravi.v.shan...@intel.com>; Vikas Shivappa
> <vikas.shiva...@linux.intel.com>; linux-kernel  ker...@vger.kernel.org>; x86 <x...@kernel.org>
> Subject: Re: [PATCH] cacheinfo: Introduce cache id
> 
> On Fri, Jul 01, 2016 at 06:01:06PM +, Yu, Fenghua wrote:
> > Cache id is unique on the same level of cache across platform.
> 
> Btw, I forgot to ask in the reply to Tony: how are those cache IDs exactly
> going to be used? An example please...
> 
> > Could you check cpu#/cache/index#/shared_cpu_map or shared_cpu_list.
> 
> Bah, my kvm is on an AMD guest - forget what it says :-)

I haven't tested on AMD. But I think AMD should have the same code.

Could you please check if 
/sys/device/system/cpu/cpu#/cache/index#/shared_cpu_map contains only the cpu 
itself? From the cache id you dump on KVM, it tells there are 8 cache leaf 0, 8 
cache leaf 1, and 8 cache leaf 2. That means each CPU has its own 3 caches (I 
can't tell from the cache id which level they are. The cache/index#/level will 
tell that).

If this is not the case, maybe cache id doesn't work on AMD. Maybe I don't 
enable cache id for AMD?

> 
> > Cache id is unique on the same level across platform.
> >
> > #find /sys/device/system/cpu/. -name id|xargs cat
> 
> Btw, you could do
> 
> grep . /sys/devices/system/cpu/cpu*/cache/index*/id
> 
> and it'll give you the absolute filepaths too, so that the output is parseable
> easier.

Thanks for the command. Here is more nicer info:)

From the info, cache id on one level is unique on that level across the board.

/sys/devices/system/cpu/cpu0/cache/index0/id:0
/sys/devices/system/cpu/cpu0/cache/index1/id:0
/sys/devices/system/cpu/cpu0/cache/index2/id:0
/sys/devices/system/cpu/cpu0/cache/index3/id:0
/sys/devices/system/cpu/cpu10/cache/index0/id:17
/sys/devices/system/cpu/cpu10/cache/index1/id:17
/sys/devices/system/cpu/cpu10/cache/index2/id:17
/sys/devices/system/cpu/cpu10/cache/index3/id:0
/sys/devices/system/cpu/cpu11/cache/index0/id:18
/sys/devices/system/cpu/cpu11/cache/index1/id:18
/sys/devices/system/cpu/cpu11/cache/index2/id:18
/sys/devices/system/cpu/cpu11/cache/index3/id:0
/sys/devices/system/cpu/cpu12/cache/index0/id:19
/sys/devices/system/cpu/cpu12/cache/index1/id:19
/sys/devices/system/cpu/cpu12/cache/index2/id:19
/sys/devices/system/cpu/cpu12/cache/index3/id:0
/sys/devices/system/cpu/cpu13/cache/index0/id:20
/sys/devices/system/cpu/cpu13/cache/index1/id:20
/sys/devices/system/cpu/cpu13/cache/index2/id:20
/sys/devices/system/cpu/cpu13/cache/index3/id:0
/sys/devices/system/cpu/cpu14/cache/index0/id:24
/sys/devices/system/cpu/cpu14/cache/index1/id:24
/sys/devices/system/cpu/cpu14/cache/index2/id:24
/sys/devices/system/cpu/cpu14/cache/index3/id:0
/sys/devices/system/cpu/cpu15/cache/index0/id:25
/sys/devices/system/cpu/cpu15/cache/index1/id:25
/sys/devices/system/cpu/cpu15/cache/index2/id:25
/sys/devices/system/cpu/cpu15/cache/index3/id:0
/sys/devices/system/cpu/cpu16/cache/index0/id:26
/sys/devices/system/cpu/cpu16/cache/index1/id:26
/sys/devices/system/cpu/cpu16/cache/index2/id:26
/sys/devices/system/cpu/cpu16/cache/index3/id:0
/sys/devices/system/cpu/cpu17/cache/index0/id:27
/sys/devices/system/cpu/cpu17/cache/index1/id:27
/sys/devices/system/cpu/cpu17/cache/index2/id:27
/sys/devices/system/cpu/cpu17/cache/index3/id:0
/sys/devices/system/cpu/cpu18/cache/index0/id:32
/sys/devices/system/cpu/cpu18/cache/index1/id:32
/sys/devices/system/cpu/cpu18/cache/index2/id:32
/sys/devices/system/cpu/cpu18/cache/index3/id:1
/sys/devices/system/cpu/cpu19/cache/index0/id:33
/sys/devices/system/cpu/cpu19/cache/index1/id:33
/sys/devices/system/cpu/cpu19/cache/index2/id:33
/sys/devices/system/cpu/cpu19/cache/index3/id:1
/sys/devices/system/cpu/cpu1/cache/index0/id:1
/sys/devices/system/cpu/cpu1/cache/index1/id:1
/sys/devices/system/cpu/cpu1/cache/index2/id:1
/sys/devices/system/cpu/cpu1/cache/index3/id:0
/sys/devices/system/cpu/cpu20/cache/index0/id:34
/sys/devices/system/cpu/cpu20/cache/index1/id:34
/sys/devices/system/cpu/cpu20/cache/index2/id:34
/sys/devices/system/cpu/cpu20/cache/index3/id:1
/sys/devices/system/cpu/cpu21/cache/index0/id:35
/sys/devices/system/cpu/cpu21/cache/index1/id:35
/sys/devices/system/cpu/cpu21/cache/index2/id:35
/sys/devices/system/cpu/cpu21/cache/index3/id:1
/sys/devices/system/cpu/cpu22/cache/index0/id:36
/sys/devices/system/cpu/cpu22/cache/index1/id:36
/sys/devices/system/cpu/cpu22/cache/index2/id:36
/sys/devices/system/cpu/cpu22/cache/index3/

RE: [PATCH] cacheinfo: Introduce cache id

2016-07-01 Thread Yu, Fenghua


> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Friday, July 01, 2016 11:36 AM
> To: Yu, Fenghua 
> Cc: Luck, Tony ; Thomas Gleixner
> ; Ingo Molnar ; Anvin, H Peter
> ; Peter Zijlstra ;
> Stephane Eranian ; Shankar, Ravi V
> ; Vikas Shivappa
> ; linux-kernel  ker...@vger.kernel.org>; x86 
> Subject: Re: [PATCH] cacheinfo: Introduce cache id
> 
> On Fri, Jul 01, 2016 at 06:01:06PM +, Yu, Fenghua wrote:
> > Cache id is unique on the same level of cache across platform.
> 
> Btw, I forgot to ask in the reply to Tony: how are those cache IDs exactly
> going to be used? An example please...
> 
> > Could you check cpu#/cache/index#/shared_cpu_map or shared_cpu_list.
> 
> Bah, my kvm is on an AMD guest - forget what it says :-)

I haven't tested on AMD. But I think AMD should have the same code.

Could you please check if 
/sys/device/system/cpu/cpu#/cache/index#/shared_cpu_map contains only the cpu 
itself? From the cache id you dump on KVM, it tells there are 8 cache leaf 0, 8 
cache leaf 1, and 8 cache leaf 2. That means each CPU has its own 3 caches (I 
can't tell from the cache id which level they are. The cache/index#/level will 
tell that).

If this is not the case, maybe cache id doesn't work on AMD. Maybe I don't 
enable cache id for AMD?

> 
> > Cache id is unique on the same level across platform.
> >
> > #find /sys/device/system/cpu/. -name id|xargs cat
> 
> Btw, you could do
> 
> grep . /sys/devices/system/cpu/cpu*/cache/index*/id
> 
> and it'll give you the absolute filepaths too, so that the output is parseable
> easier.

Thanks for the command. Here is more nicer info:)

From the info, cache id on one level is unique on that level across the board.

/sys/devices/system/cpu/cpu0/cache/index0/id:0
/sys/devices/system/cpu/cpu0/cache/index1/id:0
/sys/devices/system/cpu/cpu0/cache/index2/id:0
/sys/devices/system/cpu/cpu0/cache/index3/id:0
/sys/devices/system/cpu/cpu10/cache/index0/id:17
/sys/devices/system/cpu/cpu10/cache/index1/id:17
/sys/devices/system/cpu/cpu10/cache/index2/id:17
/sys/devices/system/cpu/cpu10/cache/index3/id:0
/sys/devices/system/cpu/cpu11/cache/index0/id:18
/sys/devices/system/cpu/cpu11/cache/index1/id:18
/sys/devices/system/cpu/cpu11/cache/index2/id:18
/sys/devices/system/cpu/cpu11/cache/index3/id:0
/sys/devices/system/cpu/cpu12/cache/index0/id:19
/sys/devices/system/cpu/cpu12/cache/index1/id:19
/sys/devices/system/cpu/cpu12/cache/index2/id:19
/sys/devices/system/cpu/cpu12/cache/index3/id:0
/sys/devices/system/cpu/cpu13/cache/index0/id:20
/sys/devices/system/cpu/cpu13/cache/index1/id:20
/sys/devices/system/cpu/cpu13/cache/index2/id:20
/sys/devices/system/cpu/cpu13/cache/index3/id:0
/sys/devices/system/cpu/cpu14/cache/index0/id:24
/sys/devices/system/cpu/cpu14/cache/index1/id:24
/sys/devices/system/cpu/cpu14/cache/index2/id:24
/sys/devices/system/cpu/cpu14/cache/index3/id:0
/sys/devices/system/cpu/cpu15/cache/index0/id:25
/sys/devices/system/cpu/cpu15/cache/index1/id:25
/sys/devices/system/cpu/cpu15/cache/index2/id:25
/sys/devices/system/cpu/cpu15/cache/index3/id:0
/sys/devices/system/cpu/cpu16/cache/index0/id:26
/sys/devices/system/cpu/cpu16/cache/index1/id:26
/sys/devices/system/cpu/cpu16/cache/index2/id:26
/sys/devices/system/cpu/cpu16/cache/index3/id:0
/sys/devices/system/cpu/cpu17/cache/index0/id:27
/sys/devices/system/cpu/cpu17/cache/index1/id:27
/sys/devices/system/cpu/cpu17/cache/index2/id:27
/sys/devices/system/cpu/cpu17/cache/index3/id:0
/sys/devices/system/cpu/cpu18/cache/index0/id:32
/sys/devices/system/cpu/cpu18/cache/index1/id:32
/sys/devices/system/cpu/cpu18/cache/index2/id:32
/sys/devices/system/cpu/cpu18/cache/index3/id:1
/sys/devices/system/cpu/cpu19/cache/index0/id:33
/sys/devices/system/cpu/cpu19/cache/index1/id:33
/sys/devices/system/cpu/cpu19/cache/index2/id:33
/sys/devices/system/cpu/cpu19/cache/index3/id:1
/sys/devices/system/cpu/cpu1/cache/index0/id:1
/sys/devices/system/cpu/cpu1/cache/index1/id:1
/sys/devices/system/cpu/cpu1/cache/index2/id:1
/sys/devices/system/cpu/cpu1/cache/index3/id:0
/sys/devices/system/cpu/cpu20/cache/index0/id:34
/sys/devices/system/cpu/cpu20/cache/index1/id:34
/sys/devices/system/cpu/cpu20/cache/index2/id:34
/sys/devices/system/cpu/cpu20/cache/index3/id:1
/sys/devices/system/cpu/cpu21/cache/index0/id:35
/sys/devices/system/cpu/cpu21/cache/index1/id:35
/sys/devices/system/cpu/cpu21/cache/index2/id:35
/sys/devices/system/cpu/cpu21/cache/index3/id:1
/sys/devices/system/cpu/cpu22/cache/index0/id:36
/sys/devices/system/cpu/cpu22/cache/index1/id:36
/sys/devices/system/cpu/cpu22/cache/index2/id:36
/sys/devices/system/cpu/cpu22/cache/index3/id:1
/sys/devices/system/cpu/cpu23/cache/index0/id:40
/sys/devices/system/cpu/cpu23/cache/index1/id:40
/sys/devices/system/cpu/cpu23/cache/index2/id:40
/sys/devices/system/cpu/cpu23/cache/index3/id:1
/sys/devices/system/cpu/cpu24/cache/index0/id:41
/sys/devices/system/cpu/cpu24/ca

RE: [PATCH] cacheinfo: Introduce cache id

2016-07-01 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Friday, July 01, 2016 11:40 AM
> On Fri, Jul 01, 2016 at 06:32:19PM +0000, Yu, Fenghua wrote:
> > We has prefix "L3" or "L2" in the syntax, id is for that level in each 
> > line.b
> 
> Give me a full example, please, how the id is going to be used.

Intel Cache Allocation Technology allows app to use only portion of L2 or L3 
cache. Each cache has its own allocation MSR registers array which contains CBM 
(Cache Bit Mask) to specify which part of cache is allocated.

We have a user interface to allow user to input CBM bits to a specific cache 
MSR register array. In current kernel, there is no identification of the cache. 
So we propose the cache id to identify the cache.

So CAT user interface is called schemas which have multiple lines. Each line 
describes one cache allocation CBM (or schema).

For L3 schema on a two socket system which has two L3 caches, user input:
L3:0=3;1=f
That means user wants to allocate CBM 3 on L3 id 0 and CBM f on L3 id 1. Kernel 
gets this info and writes CBM 3 to one MSR register on the L3 cache with cache 
id 0 and write CBM f to one MSR register on another L3 cache with cache id 1.

User can allow a task to use this schema. When the task is running on a CPU 
that shares the L3 cache of cache id 0, the CBM 3 is used (via another PQR_MSR 
register pointing to the CBM) and a portion of the L3 of cache id 0 is 
allocated to the task and task can only uses that portion. When the task is 
scheduled on a CPU that shares the L3 cache of cache id 1, the CBM f is used 
and a portion of the L3 of cache id 1 is allocated to the task for use.

The current kernel has store cache info in cacheinfo structure and each 
cacheinfo eventually is exported in /sys/device/system/cpu/cpu#/cache/index#/. 
BUTthere is no existing identification methodology to allow user to specify 
a cache. CAT wants to allow user to use one id to specify which cache. That's 
the reason we propose the cache id in the patch.

Please note, socket# and core# and combination of them can not be used to 
identify a cache architecturally. A lot of machines have L3 per socket and L2 
per core. BUT architecturally one socket can have more than one L3 and L2 can 
be shared by more than one core.

Thanks.

-Fenghua

RE: [PATCH] cacheinfo: Introduce cache id

2016-07-01 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Friday, July 01, 2016 11:40 AM
> On Fri, Jul 01, 2016 at 06:32:19PM +0000, Yu, Fenghua wrote:
> > We has prefix "L3" or "L2" in the syntax, id is for that level in each 
> > line.b
> 
> Give me a full example, please, how the id is going to be used.

Intel Cache Allocation Technology allows app to use only portion of L2 or L3 
cache. Each cache has its own allocation MSR registers array which contains CBM 
(Cache Bit Mask) to specify which part of cache is allocated.

We have a user interface to allow user to input CBM bits to a specific cache 
MSR register array. In current kernel, there is no identification of the cache. 
So we propose the cache id to identify the cache.

So CAT user interface is called schemas which have multiple lines. Each line 
describes one cache allocation CBM (or schema).

For L3 schema on a two socket system which has two L3 caches, user input:
L3:0=3;1=f
That means user wants to allocate CBM 3 on L3 id 0 and CBM f on L3 id 1. Kernel 
gets this info and writes CBM 3 to one MSR register on the L3 cache with cache 
id 0 and write CBM f to one MSR register on another L3 cache with cache id 1.

User can allow a task to use this schema. When the task is running on a CPU 
that shares the L3 cache of cache id 0, the CBM 3 is used (via another PQR_MSR 
register pointing to the CBM) and a portion of the L3 of cache id 0 is 
allocated to the task and task can only uses that portion. When the task is 
scheduled on a CPU that shares the L3 cache of cache id 1, the CBM f is used 
and a portion of the L3 of cache id 1 is allocated to the task for use.

The current kernel has store cache info in cacheinfo structure and each 
cacheinfo eventually is exported in /sys/device/system/cpu/cpu#/cache/index#/. 
BUTthere is no existing identification methodology to allow user to specify 
a cache. CAT wants to allow user to use one id to specify which cache. That's 
the reason we propose the cache id in the patch.

Please note, socket# and core# and combination of them can not be used to 
identify a cache architecturally. A lot of machines have L3 per socket and L2 
per core. BUT architecturally one socket can have more than one L3 and L2 can 
be shared by more than one core.

Thanks.

-Fenghua

RE: [PATCH] cacheinfo: Introduce cache id

2016-07-01 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Friday, July 01, 2016 11:29 AM
> To: Luck, Tony <tony.l...@intel.com>
> Cc: Yu, Fenghua <fenghua...@intel.com>; Thomas Gleixner
> <t...@linutronix.de>; Ingo Molnar <mi...@elte.hu>; Anvin, H Peter
> <h.peter.an...@intel.com>; Peter Zijlstra <pet...@infradead.org>;
> Stephane Eranian <eran...@google.com>; Shankar, Ravi V
> <ravi.v.shan...@intel.com>; Vikas Shivappa
> <vikas.shiva...@linux.intel.com>; linux-kernel  ker...@vger.kernel.org>; x86 <x...@kernel.org>
> Subject: Re: [PATCH] cacheinfo: Introduce cache id
> 
> On Fri, Jul 01, 2016 at 11:00:35AM -0700, Luck, Tony wrote:
> > For CAT we only need the IDs to be unique at each level. Our tentative
> > syntax for the schema file for CAT looks like this (for a theoretical
> > system supporting CAT in both L2 and L3 with two L3 caches and eight
> > L2 caches)
> >
> > L3:id0=fff;id1=ff0
> > L2:id0=3;id1=c;id2=30;id3=c0;id4=3;id5=c;id6=30;id7=c0
> 
> So wouldn't it be straightforward and natural to do the following
> nomenclature (which basically suggests itself):
> 
> ID.

We has prefix "L3" or "L2" in the syntax, id is for that level in each line.b

> 
> ?
> 
> So that the ID hierarchy above is:
> 
> ID3.0 ID3.1
> ID2.0 ID2.1 ID2.2 ID2.3 ... ID2.7
> 
> I don't know if that's useful though.
> 
> I mean, we have that info in the path anyway:
> 
> /sys/devices/system/cpu/cpu0/cache/index3/id = 0xfff ...
> 
> and so on.
> 
> Whatever you do, as long as the nomenclature is documented somewhere,
> say Documentation/x86/topology.txt, for example, we should be fine.
> 
> --
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)
> --

RE: [PATCH] cacheinfo: Introduce cache id

2016-07-01 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Friday, July 01, 2016 11:29 AM
> To: Luck, Tony 
> Cc: Yu, Fenghua ; Thomas Gleixner
> ; Ingo Molnar ; Anvin, H Peter
> ; Peter Zijlstra ;
> Stephane Eranian ; Shankar, Ravi V
> ; Vikas Shivappa
> ; linux-kernel  ker...@vger.kernel.org>; x86 
> Subject: Re: [PATCH] cacheinfo: Introduce cache id
> 
> On Fri, Jul 01, 2016 at 11:00:35AM -0700, Luck, Tony wrote:
> > For CAT we only need the IDs to be unique at each level. Our tentative
> > syntax for the schema file for CAT looks like this (for a theoretical
> > system supporting CAT in both L2 and L3 with two L3 caches and eight
> > L2 caches)
> >
> > L3:id0=fff;id1=ff0
> > L2:id0=3;id1=c;id2=30;id3=c0;id4=3;id5=c;id6=30;id7=c0
> 
> So wouldn't it be straightforward and natural to do the following
> nomenclature (which basically suggests itself):
> 
> ID.

We has prefix "L3" or "L2" in the syntax, id is for that level in each line.b

> 
> ?
> 
> So that the ID hierarchy above is:
> 
> ID3.0 ID3.1
> ID2.0 ID2.1 ID2.2 ID2.3 ... ID2.7
> 
> I don't know if that's useful though.
> 
> I mean, we have that info in the path anyway:
> 
> /sys/devices/system/cpu/cpu0/cache/index3/id = 0xfff ...
> 
> and so on.
> 
> Whatever you do, as long as the nomenclature is documented somewhere,
> say Documentation/x86/topology.txt, for example, we should be fine.
> 
> --
> Regards/Gruss,
> Boris.
> 
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)
> --

RE: [PATCH] cacheinfo: Introduce cache id

2016-07-01 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Friday, July 01, 2016 10:28 AM
> To: Luck, Tony <tony.l...@intel.com>
> Cc: Yu, Fenghua <fenghua...@intel.com>; Thomas Gleixner
> <t...@linutronix.de>; Ingo Molnar <mi...@elte.hu>; Anvin, H Peter
> <h.peter.an...@intel.com>; Peter Zijlstra <pet...@infradead.org>;
> Stephane Eranian <eran...@google.com>; Shankar, Ravi V
> <ravi.v.shan...@intel.com>; Vikas Shivappa
> <vikas.shiva...@linux.intel.com>; linux-kernel  ker...@vger.kernel.org>; x86 <x...@kernel.org>
> Subject: Re: [PATCH] cacheinfo: Introduce cache id
> 
> On Fri, Jul 01, 2016 at 09:50:41AM -0700, Luck, Tony wrote:
> > Here's the situation.  We have lots of (mostly) independent caches on a
> system.
> > The L3 cache (also called LLC - Last Level Cache - in some
> > documentation) is usually shared by all cpus on a physical socket. The
> > L1 and L2 caches are typically local to a core, so only shared by the
> hyperthreads on a core.
> > But I say "usually" and "typically" because the architecture doesn't
> > require that. We could have multiple separate L3 caches on a socket
> > with a subset of cpus assigned to each of them. We could have an L2
> > cache that is shared by two or more cores.
> 
> Right, so I'm presuming we don't want to make it special to the LLC but be
> generic and have any cache level have an ID.
> 
> And, since we simply call it cache, how about we drop the "node" thing and
> simply talk about cache and it having an ID. We say it is unique on the system
> and the cache with ID X is local to only thread X and the cache with ID Y is
> shared by threads Y_0, ... Y_k. And so on...
> 
> How does that sound?

That's good for me.

> 
> Struct doc will have then:
> 
>  /**
>   * struct cacheinfo - represent a cache leaf node
> + * @id: This cache's ID. ID is unique on the platform.
> 
> ?
> 
> which begs the question, does get_cache_id() give unique IDs for all caches
> in the hierarchy on the system?

Cache id is unique on the same level of cache across platform.

> 
> Because I tried this patch quickly in kvm and it was of course wrong but it
> hints at that important question:
> 
> cpu0/cache/index0/id:0b
> cpu0/cache/index1/id:0
> cpu0/cache/index2/id:0
> cpu1/cache/index0/id:1
> cpu1/cache/index1/id:1
> cpu1/cache/index2/id:1
> cpu2/cache/index0/id:2
> cpu2/cache/index1/id:2
> cpu2/cache/index2/id:2
> cpu3/cache/index0/id:3
> cpu3/cache/index1/id:3
> cpu3/cache/index2/id:3
> cpu4/cache/index0/id:4
> cpu4/cache/index1/id:4
> cpu4/cache/index2/id:4
> cpu5/cache/index0/id:5
> cpu5/cache/index1/id:5
> cpu5/cache/index2/id:5
> cpu6/cache/index0/id:6
> cpu6/cache/index1/id:6
> cpu6/cache/index2/id:6
> cpu7/cache/index0/id:7
> cpu7/cache/index1/id:7
> cpu7/cache/index2/id:7
> 
> Basically all cache indices carry the APIC ID of the core, so L1D on
> CPU0 has ID 0 and then L1I has ID 0 too and then L2 has also the same ID.

The cache id should represent shared cpus on one cache level according to SDM.

Could you check cpu#/cache/index#/shared_cpu_map or shared_cpu_list.
According to the cache id, there is NO shared cpu on each level. The 
shared_cpu_map
In your KVM should tell that as well, I think. I would guess all shared_cpu_map 
in each cache level on each cpu has itself on your KVM.

> 
> How does that look on a CAT system? Do all the different cache levels get
> different IDs?

On my Broadwell EP, the cache id info is as follows:

From index3/id, there are 2 cache ids for L3.
L3 cache id 0 has shared_cpu_map: 00,0030,0003
L3 cache id 1 has shared_cpu_map: ff,ffcf,fffc
Similar for L2 and L1.

Cache id is unique on the same level across platform.

#find /sys/device/system/cpu/. -name id|xargs cat
0
0
0
0
1
1
1
0
2
2
2
0
3
3
3
0
4
4
4
0
8
8
8
0
9
9
9
0
10
10
10
0
11
11
11
0
16
16
16
0
17
17
17
0
18
18
18
0
19
19
19
0
20
20
20
0
24
24
24
0
25
25
25
0
26
26
26
0
27
27
27
0
32
32
32
1
33
33
33
1
34
34
34
1
35
35
35
1
36
36
36
1
40
40
40
1
41
41
41
1
42
42
42
1
43
43
43
1
48
48
48
1
49
49
49
1
50
50
50
1
51
51
51
1
52
52
52
1
56
56
56
1
57
57
57
1
58
58
58
1
59
59
59
1
0
0
0
0
1
1
1
0
2
2
2
0
3
3
3
0
4
4
4
0
8
8
8
0
9
9
9
0
10
10
10
0
11
11
11
0
16
16
16
0
17
17
17
0
18
18
18
0
19
19
19
0
20
20
20
0
24
24
24
0
25
25
25
0
26
26
26
0
27
27
27
0
32
32
32
1
33
33
33
1
34
34
34
1
35
35
35
1
36
36
36
1
40
40
40
1
41
41
41
1
42
42
42
1
43
43
43
1
48
48
48
1
49
49
49
1
50
50
50
1
51
51
51
1
52
52
52
1
56
56
56
1
57
57
57
1
58
58
58
1
59
59
59
1

RE: [PATCH] cacheinfo: Introduce cache id

2016-07-01 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Friday, July 01, 2016 10:28 AM
> To: Luck, Tony 
> Cc: Yu, Fenghua ; Thomas Gleixner
> ; Ingo Molnar ; Anvin, H Peter
> ; Peter Zijlstra ;
> Stephane Eranian ; Shankar, Ravi V
> ; Vikas Shivappa
> ; linux-kernel  ker...@vger.kernel.org>; x86 
> Subject: Re: [PATCH] cacheinfo: Introduce cache id
> 
> On Fri, Jul 01, 2016 at 09:50:41AM -0700, Luck, Tony wrote:
> > Here's the situation.  We have lots of (mostly) independent caches on a
> system.
> > The L3 cache (also called LLC - Last Level Cache - in some
> > documentation) is usually shared by all cpus on a physical socket. The
> > L1 and L2 caches are typically local to a core, so only shared by the
> hyperthreads on a core.
> > But I say "usually" and "typically" because the architecture doesn't
> > require that. We could have multiple separate L3 caches on a socket
> > with a subset of cpus assigned to each of them. We could have an L2
> > cache that is shared by two or more cores.
> 
> Right, so I'm presuming we don't want to make it special to the LLC but be
> generic and have any cache level have an ID.
> 
> And, since we simply call it cache, how about we drop the "node" thing and
> simply talk about cache and it having an ID. We say it is unique on the system
> and the cache with ID X is local to only thread X and the cache with ID Y is
> shared by threads Y_0, ... Y_k. And so on...
> 
> How does that sound?

That's good for me.

> 
> Struct doc will have then:
> 
>  /**
>   * struct cacheinfo - represent a cache leaf node
> + * @id: This cache's ID. ID is unique on the platform.
> 
> ?
> 
> which begs the question, does get_cache_id() give unique IDs for all caches
> in the hierarchy on the system?

Cache id is unique on the same level of cache across platform.

> 
> Because I tried this patch quickly in kvm and it was of course wrong but it
> hints at that important question:
> 
> cpu0/cache/index0/id:0b
> cpu0/cache/index1/id:0
> cpu0/cache/index2/id:0
> cpu1/cache/index0/id:1
> cpu1/cache/index1/id:1
> cpu1/cache/index2/id:1
> cpu2/cache/index0/id:2
> cpu2/cache/index1/id:2
> cpu2/cache/index2/id:2
> cpu3/cache/index0/id:3
> cpu3/cache/index1/id:3
> cpu3/cache/index2/id:3
> cpu4/cache/index0/id:4
> cpu4/cache/index1/id:4
> cpu4/cache/index2/id:4
> cpu5/cache/index0/id:5
> cpu5/cache/index1/id:5
> cpu5/cache/index2/id:5
> cpu6/cache/index0/id:6
> cpu6/cache/index1/id:6
> cpu6/cache/index2/id:6
> cpu7/cache/index0/id:7
> cpu7/cache/index1/id:7
> cpu7/cache/index2/id:7
> 
> Basically all cache indices carry the APIC ID of the core, so L1D on
> CPU0 has ID 0 and then L1I has ID 0 too and then L2 has also the same ID.

The cache id should represent shared cpus on one cache level according to SDM.

Could you check cpu#/cache/index#/shared_cpu_map or shared_cpu_list.
According to the cache id, there is NO shared cpu on each level. The 
shared_cpu_map
In your KVM should tell that as well, I think. I would guess all shared_cpu_map 
in each cache level on each cpu has itself on your KVM.

> 
> How does that look on a CAT system? Do all the different cache levels get
> different IDs?

On my Broadwell EP, the cache id info is as follows:

From index3/id, there are 2 cache ids for L3.
L3 cache id 0 has shared_cpu_map: 00,0030,0003
L3 cache id 1 has shared_cpu_map: ff,ffcf,fffc
Similar for L2 and L1.

Cache id is unique on the same level across platform.

#find /sys/device/system/cpu/. -name id|xargs cat
0
0
0
0
1
1
1
0
2
2
2
0
3
3
3
0
4
4
4
0
8
8
8
0
9
9
9
0
10
10
10
0
11
11
11
0
16
16
16
0
17
17
17
0
18
18
18
0
19
19
19
0
20
20
20
0
24
24
24
0
25
25
25
0
26
26
26
0
27
27
27
0
32
32
32
1
33
33
33
1
34
34
34
1
35
35
35
1
36
36
36
1
40
40
40
1
41
41
41
1
42
42
42
1
43
43
43
1
48
48
48
1
49
49
49
1
50
50
50
1
51
51
51
1
52
52
52
1
56
56
56
1
57
57
57
1
58
58
58
1
59
59
59
1
0
0
0
0
1
1
1
0
2
2
2
0
3
3
3
0
4
4
4
0
8
8
8
0
9
9
9
0
10
10
10
0
11
11
11
0
16
16
16
0
17
17
17
0
18
18
18
0
19
19
19
0
20
20
20
0
24
24
24
0
25
25
25
0
26
26
26
0
27
27
27
0
32
32
32
1
33
33
33
1
34
34
34
1
35
35
35
1
36
36
36
1
40
40
40
1
41
41
41
1
42
42
42
1
43
43
43
1
48
48
48
1
49
49
49
1
50
50
50
1
51
51
51
1
52
52
52
1
56
56
56
1
57
57
57
1
58
58
58
1
59
59
59
1

RE: [PATCH 2/7] crypto : async implementation for sha1-mb

2016-05-26 Thread Yu, Fenghua

> From: Dey, Megha
> Sent: Thursday, May 19, 2016 5:43 PM
> To: herb...@gondor.apana.org.au
> Cc: tim.c.c...@linux.intel.com; da...@davemloft.net; linux-
> cry...@vger.kernel.org; linux-kernel@vger.kernel.org; Dey, Megha
> <megha@intel.com>; Yu, Fenghua <fenghua...@intel.com>; Megha
> Dey <megha@linux.intel.com>
> Subject: [PATCH 2/7] crypto : async implementation for sha1-mb
> 
> From: Megha Dey <megha@linux.intel.com>
> 
> Herbert wants the sha1-mb algorithm to have an async implementation:
> https://lkml.org/lkml/2016/4/5/286.
> Currently, sha1-mb uses an async interface for the outer algorithm and a sync
> interface for the inner algorithm. This patch introduces a async interface for
> even the inner algorithm.
> 
> Signed-off-by: Megha Dey <megha@linux.intel.com>
> Signed-off-by: Tim Chen <tim.c.c...@linux.intel.com>
> ---
>  arch/x86/crypto/sha-mb/sha1_mb.c | 190 ++--
> ---
>  crypto/ahash.c   |   6 --
>  crypto/mcryptd.c | 117 +---
>  include/crypto/hash.h|   6 ++
>  include/crypto/internal/hash.h   |   8 +-
>  include/crypto/mcryptd.h |   8 +-
>  6 files changed, 184 insertions(+), 151 deletions(-)
> 

Hi, Herbert,

Any comment/feedback on this patch set? Is there any plan to push it to 
upstream?

Thanks.

-Fenghua

RE: [PATCH 2/7] crypto : async implementation for sha1-mb

2016-05-26 Thread Yu, Fenghua

> From: Dey, Megha
> Sent: Thursday, May 19, 2016 5:43 PM
> To: herb...@gondor.apana.org.au
> Cc: tim.c.c...@linux.intel.com; da...@davemloft.net; linux-
> cry...@vger.kernel.org; linux-kernel@vger.kernel.org; Dey, Megha
> ; Yu, Fenghua ; Megha
> Dey 
> Subject: [PATCH 2/7] crypto : async implementation for sha1-mb
> 
> From: Megha Dey 
> 
> Herbert wants the sha1-mb algorithm to have an async implementation:
> https://lkml.org/lkml/2016/4/5/286.
> Currently, sha1-mb uses an async interface for the outer algorithm and a sync
> interface for the inner algorithm. This patch introduces a async interface for
> even the inner algorithm.
> 
> Signed-off-by: Megha Dey 
> Signed-off-by: Tim Chen 
> ---
>  arch/x86/crypto/sha-mb/sha1_mb.c | 190 ++--
> ---
>  crypto/ahash.c   |   6 --
>  crypto/mcryptd.c | 117 +---
>  include/crypto/hash.h|   6 ++
>  include/crypto/internal/hash.h   |   8 +-
>  include/crypto/mcryptd.h |   8 +-
>  6 files changed, 184 insertions(+), 151 deletions(-)
> 

Hi, Herbert,

Any comment/feedback on this patch set? Is there any plan to push it to 
upstream?

Thanks.

-Fenghua

RE: [PATCH v6 01/13] x86/xsaves: Define and use fpu_user_xstate_size

2016-05-11 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Wednesday, May 11, 2016 10:21 AM
> To: Yu, Yu-cheng <yu-cheng...@intel.com>
> Cc: linux-kernel@vger.kernel.org; x...@kernel.org; H. Peter Anvin
> <h...@zytor.com>; Thomas Gleixner <t...@linutronix.de>; Ingo Molnar
> <mi...@redhat.com>; Dave Hansen <dave.han...@linux.intel.com>; Andy
> Lutomirski <l...@kernel.org>; Prakhya, Sai Praneeth
> <sai.praneeth.prak...@intel.com>; Shankar, Ravi V
> <ravi.v.shan...@intel.com>; Yu, Fenghua <fenghua...@intel.com>
> Subject: Re: [PATCH v6 01/13] x86/xsaves: Define and use
> fpu_user_xstate_size
> 
> On Tue, May 10, 2016 at 04:29:53PM -0700, Yu-cheng Yu wrote:
> > The XSAVE area of kernel can be in standard or compacted format;
> 
> "The kernel xstate area... "
> 
> and can we call it the xstate area as there are a bunch of XSAVE* insns
> touching it. The file which deals with it is even called that:
> arch/x86/kernel/fpu/xstate.c
> 
> > it is always in standard format for user mode. When XSAVES is enabled,
> > the kernel uses the compacted format and it is necessary to use a
> > separate fpu_user_xstate_size for signal/ptrace frames.
> >
> > Based on an earlier patch from Fenghua Yu <fenghua...@intel.com>
> >
> > Signed-off-by: Fenghua Yu <fenghua...@intel.com>
> > [yu-cheng...@intel.com: rebase to current, rename to
> > fpu_user_xstate_size]
> > Signed-off-by: Yu-cheng Yu <yu-cheng...@intel.com>
> > Reviewed-by: Dave Hansen <dave.han...@intel.com>
> 
> Maybe I wasn't as clear as I hoped to be. Let me be more specific:
> 
> So you either need to do:
> 
> ---
> From: Fenghua
> 
> ...
> 
> Signed-off-by: Fenghua
> Signed-off-by: You
> ...
> ---
> 
> or
> 
> ---
> 
> Based on an earlier patch from Fenghua Yu <fenghua...@intel.com>.
> 
> Signed-off-by: You
> 
> ---
> 
> with the second variant making you the author implicitly because you're the
> sender.
> 
> Makes more sense this way?

Is this possible to have the third one?

From: Yu-cheng

Signed-off-by: Yu-cheng
Signed-off-by: Fenghua

Thanks.

-Fenghua

RE: [PATCH v6 01/13] x86/xsaves: Define and use fpu_user_xstate_size

2016-05-11 Thread Yu, Fenghua

> From: Borislav Petkov [mailto:b...@suse.de]
> Sent: Wednesday, May 11, 2016 10:21 AM
> To: Yu, Yu-cheng 
> Cc: linux-kernel@vger.kernel.org; x...@kernel.org; H. Peter Anvin
> ; Thomas Gleixner ; Ingo Molnar
> ; Dave Hansen ; Andy
> Lutomirski ; Prakhya, Sai Praneeth
> ; Shankar, Ravi V
> ; Yu, Fenghua 
> Subject: Re: [PATCH v6 01/13] x86/xsaves: Define and use
> fpu_user_xstate_size
> 
> On Tue, May 10, 2016 at 04:29:53PM -0700, Yu-cheng Yu wrote:
> > The XSAVE area of kernel can be in standard or compacted format;
> 
> "The kernel xstate area... "
> 
> and can we call it the xstate area as there are a bunch of XSAVE* insns
> touching it. The file which deals with it is even called that:
> arch/x86/kernel/fpu/xstate.c
> 
> > it is always in standard format for user mode. When XSAVES is enabled,
> > the kernel uses the compacted format and it is necessary to use a
> > separate fpu_user_xstate_size for signal/ptrace frames.
> >
> > Based on an earlier patch from Fenghua Yu 
> >
> > Signed-off-by: Fenghua Yu 
> > [yu-cheng...@intel.com: rebase to current, rename to
> > fpu_user_xstate_size]
> > Signed-off-by: Yu-cheng Yu 
> > Reviewed-by: Dave Hansen 
> 
> Maybe I wasn't as clear as I hoped to be. Let me be more specific:
> 
> So you either need to do:
> 
> ---
> From: Fenghua
> 
> ...
> 
> Signed-off-by: Fenghua
> Signed-off-by: You
> ...
> ---
> 
> or
> 
> ---
> 
> Based on an earlier patch from Fenghua Yu .
> 
> Signed-off-by: You
> 
> ---
> 
> with the second variant making you the author implicitly because you're the
> sender.
> 
> Makes more sense this way?

Is this possible to have the third one?

From: Yu-cheng

Signed-off-by: Yu-cheng
Signed-off-by: Fenghua

Thanks.

-Fenghua

RE: [PATCH 2/7] crypto: sha256-mb - SHA256 multibuffer job manager and glue code

2016-04-08 Thread Yu, Fenghua

> From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
> Sent: Tuesday, April 05, 2016 5:04 AM
> To: megha@linux.intel.com
> Cc: da...@davemloft.net; linux-cry...@vger.kernel.org; linux-
> ker...@vger.kernel.org; tim.c.c...@linux.intel.com; Yu, Fenghua
> <fenghua...@intel.com>; Dey, Megha <megha@intel.com>
> Subject: Re: [PATCH 2/7] crypto: sha256-mb - SHA256 multibuffer job
> manager and glue code
> 
> On Thu, Mar 24, 2016 at 01:25:58PM -0700, megha@linux.intel.com
> wrote:
> > From: Megha Dey <megha@intel.com>
> >
> > This patch introduces the multi-buffer job manager which is
> > responsible for submitting scatter-gather buffers from several SHA256
> > jobs to the multi-buffer algorithm. It also contains the flush routine
> > to that's called by the crypto daemon to complete the job when no new
> > jobs arrive before the deadline of maximum latency of a SHA256 crypto job.
> >
> > The SHA256 multi-buffer crypto algorithm is defined and initialized in
> > this patch.
> >
> > Signed-off-by: Megha Dey <megha@linux.intel.com>
> > Reviewed-by: Fenghua Yu <fenghua...@intel.com>
> > Reviewed-by: Tim Chen <tim.c.c...@linux.intel.com>
> 
> sha1-mb still has the same issues that I complained about for the aes-mb
> submission.  In particular, I don't like the use of shash to handle what is 
> really
> an async implmentation.
> 
> So I'd like to see this fixed first before we add any more copies of this 
> code.

We will update the patches and re-send out sha1/256/512 AVX2 mb algos using 
async implementation.

Thanks.

-Fenghua

RE: [PATCH 2/7] crypto: sha256-mb - SHA256 multibuffer job manager and glue code

2016-04-08 Thread Yu, Fenghua

> From: Herbert Xu [mailto:herb...@gondor.apana.org.au]
> Sent: Tuesday, April 05, 2016 5:04 AM
> To: megha@linux.intel.com
> Cc: da...@davemloft.net; linux-cry...@vger.kernel.org; linux-
> ker...@vger.kernel.org; tim.c.c...@linux.intel.com; Yu, Fenghua
> ; Dey, Megha 
> Subject: Re: [PATCH 2/7] crypto: sha256-mb - SHA256 multibuffer job
> manager and glue code
> 
> On Thu, Mar 24, 2016 at 01:25:58PM -0700, megha@linux.intel.com
> wrote:
> > From: Megha Dey 
> >
> > This patch introduces the multi-buffer job manager which is
> > responsible for submitting scatter-gather buffers from several SHA256
> > jobs to the multi-buffer algorithm. It also contains the flush routine
> > to that's called by the crypto daemon to complete the job when no new
> > jobs arrive before the deadline of maximum latency of a SHA256 crypto job.
> >
> > The SHA256 multi-buffer crypto algorithm is defined and initialized in
> > this patch.
> >
> > Signed-off-by: Megha Dey 
> > Reviewed-by: Fenghua Yu 
> > Reviewed-by: Tim Chen 
> 
> sha1-mb still has the same issues that I complained about for the aes-mb
> submission.  In particular, I don't like the use of shash to handle what is 
> really
> an async implmentation.
> 
> So I'd like to see this fixed first before we add any more copies of this 
> code.

We will update the patches and re-send out sha1/256/512 AVX2 mb algos using 
async implementation.

Thanks.

-Fenghua

RE: [PATCH 0/7] crypto: SHA256 multibuffer implementation

2016-04-04 Thread Yu, Fenghua

> From: megha@linux.intel.com [mailto:megha@linux.intel.com]
> Sent: Thursday, March 24, 2016 1:26 PM
> To: herb...@gondor.apana.org.au; da...@davemloft.net
> Cc: linux-cry...@vger.kernel.org; linux-kernel@vger.kernel.org;
> tim.c.c...@linux.intel.com; Yu, Fenghua <fenghua...@intel.com>; Megha
> Dey <megha@linux.intel.com>
> Subject: [PATCH 0/7] crypto: SHA256 multibuffer implementation
> 
> From: Megha Dey <megha@linux.intel.com>
> 
> In this patch series, we introduce the multi-buffer crypto algorithm on
> x86_64 and apply it to SHA256 hash computation.  The multi-buffer technique
> takes advantage of the 8 data lanes in the AVX2 registers and allows
> computation to be performed on data from multiple jobs in parallel.
> This allows us to parallelize computations when data inter-dependency in a
> single crypto job prevents us to fully parallelize our computations.
> The algorithm can be extended to other hashing and encryption schemes in
> the future.

Hi, Herbert and Dave,

Any comment on the patch set and the SHA512 patch set?

Are you planning on merging the patch sets into crypto tree and upstream?

Thanks.

-Fenghua

RE: [PATCH 0/7] crypto: SHA256 multibuffer implementation

2016-04-04 Thread Yu, Fenghua

> From: megha@linux.intel.com [mailto:megha@linux.intel.com]
> Sent: Thursday, March 24, 2016 1:26 PM
> To: herb...@gondor.apana.org.au; da...@davemloft.net
> Cc: linux-cry...@vger.kernel.org; linux-kernel@vger.kernel.org;
> tim.c.c...@linux.intel.com; Yu, Fenghua ; Megha
> Dey 
> Subject: [PATCH 0/7] crypto: SHA256 multibuffer implementation
> 
> From: Megha Dey 
> 
> In this patch series, we introduce the multi-buffer crypto algorithm on
> x86_64 and apply it to SHA256 hash computation.  The multi-buffer technique
> takes advantage of the 8 data lanes in the AVX2 registers and allows
> computation to be performed on data from multiple jobs in parallel.
> This allows us to parallelize computations when data inter-dependency in a
> single crypto job prevents us to fully parallelize our computations.
> The algorithm can be extended to other hashing and encryption schemes in
> the future.

Hi, Herbert and Dave,

Any comment on the patch set and the SHA512 patch set?

Are you planning on merging the patch sets into crypto tree and upstream?

Thanks.

-Fenghua

RE: [PATCH V16 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation

2016-01-04 Thread Yu, Fenghua

> From: Richard Weinberger [mailto:richard.weinber...@gmail.com]
> Sent: Saturday, January 02, 2016 2:54 PM
> On Mon, Dec 21, 2015 at 6:05 PM, Marcelo Tosatti 
> wrote:
> > OK cool hopefully that makes it clear to Fenghua Yu what must be
> > changed in the patchset.
> >
> >>  - http://lkml.iu.edu/hypermail/linux/kernel/1511.0/02375.html
> 
> ...beating a dead horse but this patch is in -next and breaks the ARCH=um
> build when enabled.
> 
>   UPD include/generated/compile.h
>   CC  init/version.o
>   LD  init/built-in.o
> kernel/built-in.o:(.rodata+0x2920): undefined reference to
> `intel_rdt_cgrp_subsys'
> collect2: error: ld returned 1 exit status
> make: *** [vmlinux] Error 1

Do you set CONFIG_INTEL_RDT=y?

Thanks.

-Fenghua

RE: [PATCH V16 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation

2016-01-04 Thread Yu, Fenghua

> From: Richard Weinberger [mailto:richard.weinber...@gmail.com]
> Sent: Saturday, January 02, 2016 2:54 PM
> On Mon, Dec 21, 2015 at 6:05 PM, Marcelo Tosatti 
> wrote:
> > OK cool hopefully that makes it clear to Fenghua Yu what must be
> > changed in the patchset.
> >
> >>  - http://lkml.iu.edu/hypermail/linux/kernel/1511.0/02375.html
> 
> ...beating a dead horse but this patch is in -next and breaks the ARCH=um
> build when enabled.
> 
>   UPD include/generated/compile.h
>   CC  init/version.o
>   LD  init/built-in.o
> kernel/built-in.o:(.rodata+0x2920): undefined reference to
> `intel_rdt_cgrp_subsys'
> collect2: error: ld returned 1 exit status
> make: *** [vmlinux] Error 1

Do you set CONFIG_INTEL_RDT=y?

Thanks.

-Fenghua

RE: [RFD] CAT user space interface revisited

2015-12-22 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, November 18, 2015 10:25 AM
> Folks!
> 
> After rereading the mail flood on CAT and staring into the SDM for a while, I
> think we all should sit back and look at it from scratch again w/o our
> preconceptions - I certainly had to put my own away.
> 
> Let's look at the properties of CAT again:
> 
>- It's a per socket facility
> 
>- CAT slots can be associated to external hardware. This
>  association is per socket as well, so different sockets can have
>  different behaviour. I missed that detail when staring the first
>  time, thanks for the pointer!
> 
>- The association ifself is per cpu. The COS selection happens on a
>  CPU while the set of masks which are selected via COS are shared
>  by all CPUs on a socket.
> 
> There are restrictions which CAT imposes in terms of configurability:
> 
>- The bits which select a cache partition need to be consecutive
> 
>- The number of possible cache association masks is limited
> 
> Let's look at the configurations (CDP omitted and size restricted)
> 
> Default:   1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
> 
> Shared:  1 1 1 1 1 1 1 1
>  0 0 1 1 1 1 1 1
>  0 0 0 0 1 1 1 1
>  0 0 0 0 0 0 1 1
> 
> Isolated:  1 1 1 1 0 0 0 0
>  0 0 0 0 1 1 0 0
>  0 0 0 0 0 0 1 0
>  0 0 0 0 0 0 0 1
> 
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity of a
> sysadmin. The worst outcome might be L3 disabled for everything, so what?
> 
> Now that gets even more convoluted if CDP comes into play and we really
> need to look at CDP right now. We might end up with something which looks
> like this:
> 
>  1 1 1 1 0 0 0 0  Code
>  1 1 1 1 0 0 0 0  Data
>  0 0 0 0 0 0 1 0  Code
>  0 0 0 0 1 1 0 0  Data
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 1 1 0 0  Data
> or
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 1 1 0 0  Data
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 0 1 1 0  Data
> 
> Let's look at partitioning itself. We have two options:
> 
>1) Per task partitioning
> 
>2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as well. Let 
> me
> give you a simple example.
> 
> Assume that you have isolated a CPU and run your important task on it. You
> give that task a slice of cache. Now that task needs kernel services which run
> in kernel threads on that CPU. We really don't want to (and cannot) hunt
> down random kernel threads (think cpu bound worker threads, softirq
> threads ) and give them another slice of cache. What we really want is:
> 
>1 1 1 1 0 0 0 0<- Default cache
>0 0 0 0 1 1 1 0<- Cache for important task
>0 0 0 0 0 0 0 1<- Cache for CPU of important task
> 
> It would even be sufficient for particular use cases to just associate a 
> piece of
> cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on my new
> intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same meaning
> on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available, there are
> not going to be enough to have a system wide consistent view unless we
> have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is useful at 
> all.
> 
>  - If a task migrates between sockets, it's going to suffer anyway.
>Real sensitive applications will simply pin tasks on a socket to
>avoid that in the first place. If we make the whole thing
>configurable enough then the sysadmin can set it up to support
>even the nonsensical case of identical cache partitions on all
>sockets and let tasks use the corresponding partitions when
>migrating.
> 
>  - The number of cache slices is going to be limited no matter what,
>so one still has to come up with a sensible partitioning scheme.
> 
>  - Even if we have enough cos ids the system wide view will not make
>the configuration problem any simpler as it remains per socket.
> 
> It's hard. Policies are hard by definition, but this one is harder than most
> other policies due to the inherent limitations.
> 
> So now to the interface part. Unfortunately we need to expose this very
> close to the hardware implementation as there are really no abstractions
> which allow us to express the various bitmap combinations. Any abstraction I
> tried to come up with renders that thing completely useless.
> 
> I was not able to identify any existing

RE: [RFD] CAT user space interface revisited

2015-12-22 Thread Yu, Fenghua

> From: Thomas Gleixner [mailto:t...@linutronix.de]
> Sent: Wednesday, November 18, 2015 10:25 AM
> Folks!
> 
> After rereading the mail flood on CAT and staring into the SDM for a while, I
> think we all should sit back and look at it from scratch again w/o our
> preconceptions - I certainly had to put my own away.
> 
> Let's look at the properties of CAT again:
> 
>- It's a per socket facility
> 
>- CAT slots can be associated to external hardware. This
>  association is per socket as well, so different sockets can have
>  different behaviour. I missed that detail when staring the first
>  time, thanks for the pointer!
> 
>- The association ifself is per cpu. The COS selection happens on a
>  CPU while the set of masks which are selected via COS are shared
>  by all CPUs on a socket.
> 
> There are restrictions which CAT imposes in terms of configurability:
> 
>- The bits which select a cache partition need to be consecutive
> 
>- The number of possible cache association masks is limited
> 
> Let's look at the configurations (CDP omitted and size restricted)
> 
> Default:   1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
>  1 1 1 1 1 1 1 1
> 
> Shared:  1 1 1 1 1 1 1 1
>  0 0 1 1 1 1 1 1
>  0 0 0 0 1 1 1 1
>  0 0 0 0 0 0 1 1
> 
> Isolated:  1 1 1 1 0 0 0 0
>  0 0 0 0 1 1 0 0
>  0 0 0 0 0 0 1 0
>  0 0 0 0 0 0 0 1
> 
> Or any combination thereof. Surely some combinations will not make any
> sense, but we really should not make any restrictions on the stupidity of a
> sysadmin. The worst outcome might be L3 disabled for everything, so what?
> 
> Now that gets even more convoluted if CDP comes into play and we really
> need to look at CDP right now. We might end up with something which looks
> like this:
> 
>  1 1 1 1 0 0 0 0  Code
>  1 1 1 1 0 0 0 0  Data
>  0 0 0 0 0 0 1 0  Code
>  0 0 0 0 1 1 0 0  Data
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 1 1 0 0  Data
> or
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 1 1 0 0  Data
>  0 0 0 0 0 0 0 1  Code
>  0 0 0 0 0 1 1 0  Data
> 
> Let's look at partitioning itself. We have two options:
> 
>1) Per task partitioning
> 
>2) Per CPU partitioning
> 
> So far we only talked about #1, but I think that #2 has a value as well. Let 
> me
> give you a simple example.
> 
> Assume that you have isolated a CPU and run your important task on it. You
> give that task a slice of cache. Now that task needs kernel services which run
> in kernel threads on that CPU. We really don't want to (and cannot) hunt
> down random kernel threads (think cpu bound worker threads, softirq
> threads ) and give them another slice of cache. What we really want is:
> 
>1 1 1 1 0 0 0 0<- Default cache
>0 0 0 0 1 1 1 0<- Cache for important task
>0 0 0 0 0 0 0 1<- Cache for CPU of important task
> 
> It would even be sufficient for particular use cases to just associate a 
> piece of
> cache to a given CPU and do not bother with tasks at all.
> 
> We really need to make this as configurable as possible from userspace
> without imposing random restrictions to it. I played around with it on my new
> intel toy and the restriction to 16 COS ids (that's 8 with CDP
> enabled) makes it really useless if we force the ids to have the same meaning
> on all sockets and restrict it to per task partitioning.
> 
> Even if next generation systems will have more COS ids available, there are
> not going to be enough to have a system wide consistent view unless we
> have COS ids > nr_cpus.
> 
> Aside of that I don't think that a system wide consistent view is useful at 
> all.
> 
>  - If a task migrates between sockets, it's going to suffer anyway.
>Real sensitive applications will simply pin tasks on a socket to
>avoid that in the first place. If we make the whole thing
>configurable enough then the sysadmin can set it up to support
>even the nonsensical case of identical cache partitions on all
>sockets and let tasks use the corresponding partitions when
>migrating.
> 
>  - The number of cache slices is going to be limited no matter what,
>so one still has to come up with a sensible partitioning scheme.
> 
>  - Even if we have enough cos ids the system wide view will not make
>the configuration problem any simpler as it remains per socket.
> 
> It's hard. Policies are hard by definition, but this one is harder than most
> other policies due to the inherent limitations.
> 
> So now to the interface part. Unfortunately we need to expose this very
> close to the hardware implementation as there are really no abstractions
> which allow us to express the various bitmap combinations. Any abstraction I
> tried to come up with renders that thing completely useless.
> 
> I was not able to identify any existing

RE: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation

2015-12-16 Thread Yu, Fenghua

> From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
> Sent: Wednesday, November 18, 2015 1:27 PM
> To: Yu, Fenghua 
> Cc: H Peter Anvin ; Ingo Molnar ;
> Thomas Gleixner ; Peter Zijlstra
> ; linux-kernel ; x86
> ; Vikas Shivappa 
> Subject: Re: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface
> to manage Intel cache allocation
> 
> On Thu, Oct 01, 2015 at 11:09:45PM -0700, Fenghua Yu wrote:
> > Add a new cgroup 'intel_rdt' to manage cache allocation. Each cgroup
> + /*
> +  * Try to get a reference for a different CLOSid and release the
> +  * reference to the current CLOSid.
> +  * Need to put down the reference here and get it back in case we
> +  * run out of closids. Otherwise we run into a problem when
> +  * we could be using the last closid that could have been available.
> +  */
> + closid_put(ir->closid);
> + if (cbm_search(cbmvalue, )) {
> 
> Can't you move closid_put here?

No. This cannot be moved here.

If it's moved here, it won't work in the case of the current rdt is the only 
usage of the closid.
In this case, the closid will released from the cbm table and a new cbm will be 
allocated.
So the closid_put() is in the right place and can handle both the only usage of 
closid or recycling
case, I think.

> 
> + ir->closid = closid;
> + closid_get(closid);
> + } else {
> + closid = ir->closid;
> 
> Variable unused.

You are right. I'll remove this statement.

> 
> + err = closid_alloc(>closid);
> + if (err) {
> + closid_get(ir->closid);
> + goto out;
> + }
> 
> This makes you cycle closid when changing the cbm, not necessary.
> (not very important, but closid_put is nerving because it can possibly set
> l3_cbm to zero).

I think the current code is ok. If closid_put sets l3_cbm to zero (i.e. the 
closid only has this usage),
a new closid allocaation will be started to get a new closid.

Thanks.

-Fenghua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation

2015-12-16 Thread Yu, Fenghua

> From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
> Sent: Wednesday, November 18, 2015 1:27 PM
> To: Yu, Fenghua <fenghua...@intel.com>
> Cc: H Peter Anvin <h...@zytor.com>; Ingo Molnar <mi...@redhat.com>;
> Thomas Gleixner <t...@linutronix.de>; Peter Zijlstra
> <pet...@infradead.org>; linux-kernel <linux-kernel@vger.kernel.org>; x86
> <x...@kernel.org>; Vikas Shivappa <vikas.shiva...@linux.intel.com>
> Subject: Re: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface
> to manage Intel cache allocation
> 
> On Thu, Oct 01, 2015 at 11:09:45PM -0700, Fenghua Yu wrote:
> > Add a new cgroup 'intel_rdt' to manage cache allocation. Each cgroup
> + /*
> +  * Try to get a reference for a different CLOSid and release the
> +  * reference to the current CLOSid.
> +  * Need to put down the reference here and get it back in case we
> +  * run out of closids. Otherwise we run into a problem when
> +  * we could be using the last closid that could have been available.
> +  */
> + closid_put(ir->closid);
> + if (cbm_search(cbmvalue, )) {
> 
> Can't you move closid_put here?

No. This cannot be moved here.

If it's moved here, it won't work in the case of the current rdt is the only 
usage of the closid.
In this case, the closid will released from the cbm table and a new cbm will be 
allocated.
So the closid_put() is in the right place and can handle both the only usage of 
closid or recycling
case, I think.

> 
> + ir->closid = closid;
> + closid_get(closid);
> + } else {
> + closid = ir->closid;
> 
> Variable unused.

You are right. I'll remove this statement.

> 
> + err = closid_alloc(>closid);
> + if (err) {
> + closid_get(ir->closid);
> + goto out;
> + }
> 
> This makes you cycle closid when changing the cbm, not necessary.
> (not very important, but closid_put is nerving because it can possibly set
> l3_cbm to zero).

I think the current code is ok. If closid_put sets l3_cbm to zero (i.e. the 
closid only has this usage),
a new closid allocaation will be started to get a new closid.

Thanks.

-Fenghua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

RE: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface to manage Intel cache allocation

2015-12-14 Thread Yu, Fenghua

> From: Marcelo Tosatti [mailto:mtosa...@redhat.com]
> Sent: Wednesday, November 18, 2015 2:15 PM
> To: Yu, Fenghua 
> Cc: H Peter Anvin ; Ingo Molnar ;
> Thomas Gleixner ; Peter Zijlstra
> ; linux-kernel ; x86
> ; Vikas Shivappa 
> Subject: Re: [PATCH V15 11/11] x86,cgroup/intel_rdt : Add a cgroup interface
> to manage Intel cache allocation
> 
> On Thu, Oct 01, 2015 at 11:09:45PM -0700, Fenghua Yu wrote:
> > Add a new cgroup 'intel_rdt' to manage cache allocation. Each cgroup
> > directory is associated with a class of service id(closid). To map a
> > task with closid during scheduling, this patch removes the closid field
> > from task_struct and uses the already existing 'cgroups' field in
> > task_struct.
> >
> > +
> > +/*
> >   * intel_rdt_sched_in() - Writes the task's CLOSid to IA32_PQR_MSR
> >   *
> >   * Following considerations are made so that this has minimal impact
> >   * on scheduler hot path:
> >   * - This will stay as no-op unless we are running on an Intel SKU
> >   * which supports L3 cache allocation.
> > + * - When support is present and enabled, does not do any
> > + * IA32_PQR_MSR writes until the user starts really using the feature
> > + * ie creates a rdt cgroup directory and assigns a cache_mask thats
> > + * different from the root cgroup's cache_mask.
> >   * - Caches the per cpu CLOSid values and does the MSR write only
> > - * when a task with a different CLOSid is scheduled in.
> 
> Why is this even allowed?
> 
>   socket CBM bits:
> 
>  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
> [ | | | | | | | | | |  |  |  |  |  ]
> 
>  x x x x x x x
>x  x  x  x
> 
>  x x x x x
> 
> cgroupA.bits = [ 0 - 6 ] cgroupB.bits = [ 10 - 14]  (level 1)
> cgroupA-A.bits = [ 0 - 4 ](level 2)
> 
> Two ways to create a cgroup with bits [ 0 - 4] set:
> 
> 1) Create a cgroup C in level 1 with a different name.
> Useful to have same cgroup with two different names.
> 
> 2) Create a cgroup A-B under cgroup-A with bits [0 - 4].
> 
> It just creates confusion, having two or more cgroups under
> different levels of the hierarchy with the same bits set.
> (can't see any organizational benefit).
> 
> Why not return -EINVAL ? Ah, cgroups are hierarchical, right.

I would let the situation be handled by user space management tool. Kernel 
handles only minimum situation.
The management tool should have more knowledge to create CLOSID. Kernel only 
pass that info to hardware.

Thanks.

-Fenghua
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/

1 2 3 >

1 - 100 of 269 matches

Mail list logo