Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-12 Thread Andi Kleen
> Isn't a thread likely to clear the AVX registers at the end of a function
> that uses them.
> In particular this removes the massive overhead on certain cpus of
> switching between two AVX modes.
> So it is actually unlikely that XSAVE will need to save them at all?

Only if context switches only happened on function boundaries, which
is obviously not the case.

Yes the detection mechanism is not 100% accurate, but if AVX
is used significantly it should eventually detect it. 
Think of it as a statistical sampling heuristic.

> 
> As I've also said before the registers are caller saved and since
> systems calls are normal function calls the application code
> would have to save them across a system call.
> This allows the kernel to zero the registers on system call entry
> again meaning that XSAVE won't normally have to save them.

While I agree this would be nice, the Linux system call ABI 
wasn't defined like this, so it cannot be done at this point.

-Andi



RE: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-12 Thread David Laight
From: Aubrey Li
> Sent: 11 December 2018 00:25
> 
> User space tools which do automated task placement need information
> about AVX-512 usage of tasks, because AVX-512 usage could cause core
> turbo frequency drop and impact the running task on the sibling CPU.
> 
> The XSAVE hardware structure has bits that indicate when valid state
> is present in registers unique to AVX-512 use.  Use these bits to
> indicate when AVX-512 has been in use and add per-task AVX-512 state
> tracking to context switch.

Isn't a thread likely to clear the AVX registers at the end of a function
that uses them.
In particular this removes the massive overhead on certain cpus of
switching between two AVX modes.
So it is actually unlikely that XSAVE will need to save them at all?

As I've also said before the registers are caller saved and since
systems calls are normal function calls the application code
would have to save them across a system call.
This allows the kernel to zero the registers on system call entry
again meaning that XSAVE won't normally have to save them.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, 
UK
Registration No: 1397386 (Wales)



Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Dave Hansen
On 12/11/18 4:59 PM, Li, Aubrey wrote:
>> maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
>> time we found the task to use avx? (need to find a good time unit)
>>
>>
> Are you suggesting kernel does not do any translation, just provide a fact
> to the user space tool and let user space tool to decide how to use this info?
> 
> So how does user space tool use this timestamp in your mind?

Couple of things...

If we have a timestamp, we don't need a decay.  That means less tuning
and also less work at runtime in the common case.

Let's say we report milliseconds.  The app itself looks at the number
and could now say the following:

1. task A used AVX512 1ms ago, I might need to segregate it
2. task B used AVX512 1ms ago, time to stop segregating it
3. task C used AVX512 1ms ago, and was using it 100ms ago (during the
   last scan), it's a regular AVX512 user.

That way, you don't have to *catch* tasks in the window between use and
the end of the decay period.


Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Li, Aubrey
On 2018/12/12 8:14, Arjan van de Ven wrote:
> On 12/11/2018 3:46 PM, Li, Aubrey wrote:
>> On 2018/12/12 1:18, Dave Hansen wrote:
>>> On 12/10/18 4:24 PM, Aubrey Li wrote:
 The tracking turns on the usage flag at the next context switch of
 the task, but requires 3 consecutive context switches with no usage
 to clear it. This decay is required because well-written AVX-512
 applications are expected to clear this state when not actively using
 AVX-512 registers.
>>>
>>> One concern about this:  Given a HZ=1000 system, this means that the
>>> flag needs to get scanned every ~3ms.  That's a pretty good amount of
>>> scanning on a system with hundreds or thousands of tasks running around.
>>>
>>> How many tasks does this scale to until you're eating up an entire CPU
>>> or two just scanning /proc?
>>>
>>
>> Do we have a real requirement to do this in practical environment?
>> AFAIK, 1s or even 5s is good enough in some customers environment.
> 
> maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
> time we found the task to use avx? (need to find a good time unit)
> 
> 
Are you suggesting kernel does not do any translation, just provide a fact
to the user space tool and let user space tool to decide how to use this info?

So how does user space tool use this timestamp in your mind?

Thanks,
-Aubrey


Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Dave Hansen
On 12/11/18 4:34 PM, Li, Aubrey wrote:
>> Is there a reason we shouldn't do:
>>
>>  if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
>>  update_avx512_state(fpu);
>>
>> ?
>>
> Why _!_ ?

Sorry, got it backwards.  I think I was considering having you do a

if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
return;

inside of update_avx512_state(), but I got the state mixed up in my head.

You don't need the '!'.


Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Li, Aubrey
On 2018/12/12 1:20, Dave Hansen wrote:
>  to update AVX512 state
>> + */
>> +static inline void update_avx512_state(struct fpu *fpu)
>> +{
>> +/*
>> + * AVX512 state is tracked here because its use is known to slow
>> + * the max clock speed of the core.
>> + *
>> + * However, AVX512-using tasks are expected to clear this state when
>> + * not actively using these registers. Thus, this tracking mechanism
>> + * can miss. To ensure that false-negatives do not immediately show
>> + * up, decay the usage count over time.
>> + */
>> +if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
>> +fpu->avx512_usage = AVX512_STATE_DECAY_COUNT;
>> +else if (fpu->avx512_usage)
>> +fpu->avx512_usage--;
>> +}
>> +
>>  /*
>>   * This function is called only during boot time when x86 caps are not set
>>   * up and alternative can not be used yet.
>> @@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
>>  {
>>  if (likely(use_xsave())) {
>>  copy_xregs_to_kernel(>state.xsave);
>> +update_avx512_state(fpu);
>>  return 1;
>>  }
> 
> 
> Is there a reason we shouldn't do:
> 
>   if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
>   update_avx512_state(fpu);
> 
> ?
> 

Why _!_ ?




Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Arjan van de Ven

On 12/11/2018 3:46 PM, Li, Aubrey wrote:

On 2018/12/12 1:18, Dave Hansen wrote:

On 12/10/18 4:24 PM, Aubrey Li wrote:

The tracking turns on the usage flag at the next context switch of
the task, but requires 3 consecutive context switches with no usage
to clear it. This decay is required because well-written AVX-512
applications are expected to clear this state when not actively using
AVX-512 registers.


One concern about this:  Given a HZ=1000 system, this means that the
flag needs to get scanned every ~3ms.  That's a pretty good amount of
scanning on a system with hundreds or thousands of tasks running around.

How many tasks does this scale to until you're eating up an entire CPU
or two just scanning /proc?



Do we have a real requirement to do this in practical environment?
AFAIK, 1s or even 5s is good enough in some customers environment.


maybe instead of a 1/0 bit, it's useful to store the timestamp of the last
time we found the task to use avx? (need to find a good time unit)



Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Li, Aubrey
On 2018/12/12 1:18, Dave Hansen wrote:
> On 12/10/18 4:24 PM, Aubrey Li wrote:
>> The tracking turns on the usage flag at the next context switch of
>> the task, but requires 3 consecutive context switches with no usage
>> to clear it. This decay is required because well-written AVX-512
>> applications are expected to clear this state when not actively using
>> AVX-512 registers.
> 
> One concern about this:  Given a HZ=1000 system, this means that the
> flag needs to get scanned every ~3ms.  That's a pretty good amount of
> scanning on a system with hundreds or thousands of tasks running around.
> 
> How many tasks does this scale to until you're eating up an entire CPU
> or two just scanning /proc?
> 

Do we have a real requirement to do this in practical environment?
AFAIK, 1s or even 5s is good enough in some customers environment.

Thanks,
-Aubrey


Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Andi Kleen
On Tue, Dec 11, 2018 at 09:18:25AM -0800, Dave Hansen wrote:
> On 12/10/18 4:24 PM, Aubrey Li wrote:
> > The tracking turns on the usage flag at the next context switch of
> > the task, but requires 3 consecutive context switches with no usage
> > to clear it. This decay is required because well-written AVX-512
> > applications are expected to clear this state when not actively using
> > AVX-512 registers.
> 
> One concern about this:  Given a HZ=1000 system, this means that the
> flag needs to get scanned every ~3ms.  That's a pretty good amount of
> scanning on a system with hundreds or thousands of tasks running around.
> 
> How many tasks does this scale to until you're eating up an entire CPU
> or two just scanning /proc?

Yes that's why we may need to propagate it to cgroups in the kernel,
because user daemons don't really want to track every TID.

But per pid is a start so that people can start experimenting with this.

Then with some experience fancier interfaces for it can be implemented.

-Andi


Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Tim Chen



On 12/11/18 9:18 AM, Dave Hansen wrote:
> On 12/10/18 4:24 PM, Aubrey Li wrote:
>> The tracking turns on the usage flag at the next context switch of
>> the task, but requires 3 consecutive context switches with no usage
>> to clear it. This decay is required because well-written AVX-512
>> applications are expected to clear this state when not actively using
>> AVX-512 registers.
> 
> One concern about this:  Given a HZ=1000 system, this means that the
> flag needs to get scanned every ~3ms.  That's a pretty good amount of
> scanning on a system with hundreds or thousands of tasks running around.
> 
> How many tasks does this scale to until you're eating up an entire CPU
> or two just scanning /proc?
> 

The reason why I believe that an exponential decay average, or some
other kind of weighted decay of the AVX512 usage is better.

Tim



Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Dave Hansen
 to update AVX512 state
> + */
> +static inline void update_avx512_state(struct fpu *fpu)
> +{
> + /*
> +  * AVX512 state is tracked here because its use is known to slow
> +  * the max clock speed of the core.
> +  *
> +  * However, AVX512-using tasks are expected to clear this state when
> +  * not actively using these registers. Thus, this tracking mechanism
> +  * can miss. To ensure that false-negatives do not immediately show
> +  * up, decay the usage count over time.
> +  */
> + if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
> + fpu->avx512_usage = AVX512_STATE_DECAY_COUNT;
> + else if (fpu->avx512_usage)
> + fpu->avx512_usage--;
> +}
> +
>  /*
>   * This function is called only during boot time when x86 caps are not set
>   * up and alternative can not be used yet.
> @@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
>  {
>   if (likely(use_xsave())) {
>   copy_xregs_to_kernel(>state.xsave);
> + update_avx512_state(fpu);
>   return 1;
>   }


Is there a reason we shouldn't do:

if (!cpu_feature_enabled(X86_FEATURE_AVX512F))
update_avx512_state(fpu);

?


Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-11 Thread Dave Hansen
On 12/10/18 4:24 PM, Aubrey Li wrote:
> The tracking turns on the usage flag at the next context switch of
> the task, but requires 3 consecutive context switches with no usage
> to clear it. This decay is required because well-written AVX-512
> applications are expected to clear this state when not actively using
> AVX-512 registers.

One concern about this:  Given a HZ=1000 system, this means that the
flag needs to get scanned every ~3ms.  That's a pretty good amount of
scanning on a system with hundreds or thousands of tasks running around.

How many tasks does this scale to until you're eating up an entire CPU
or two just scanning /proc?


[PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks

2018-12-10 Thread Aubrey Li
User space tools which do automated task placement need information
about AVX-512 usage of tasks, because AVX-512 usage could cause core
turbo frequency drop and impact the running task on the sibling CPU.

The XSAVE hardware structure has bits that indicate when valid state
is present in registers unique to AVX-512 use.  Use these bits to
indicate when AVX-512 has been in use and add per-task AVX-512 state
tracking to context switch.

The tracking turns on the usage flag at the next context switch of
the task, but requires 3 consecutive context switches with no usage
to clear it. This decay is required because well-written AVX-512
applications are expected to clear this state when not actively using
AVX-512 registers.

Although this mechanism is imprecise and can theoretically have both
false-positives and false-negatives, it has been measured to be precise
enough to be useful under real-world workloads like tensorflow and linpack.

If higher precision is required, suggest user space tools to use the
PMU-based mechanisms in combination.

Signed-off-by: Aubrey Li 
Cc: Peter Zijlstra 
Cc: Andi Kleen 
Cc: Tim Chen 
Cc: Dave Hansen 
Cc: Arjan van de Ven 
---
 arch/x86/include/asm/fpu/internal.h | 22 ++
 arch/x86/include/asm/fpu/types.h|  8 
 2 files changed, 30 insertions(+)

diff --git a/arch/x86/include/asm/fpu/internal.h 
b/arch/x86/include/asm/fpu/internal.h
index a38bf5a1e37a..0da74d63ba14 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -275,6 +275,27 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
 : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)\
 : "memory")
 
+#defineAVX512_STATE_DECAY_COUNT3
+/*
+ * This function is called during context switch to update AVX512 state
+ */
+static inline void update_avx512_state(struct fpu *fpu)
+{
+   /*
+* AVX512 state is tracked here because its use is known to slow
+* the max clock speed of the core.
+*
+* However, AVX512-using tasks are expected to clear this state when
+* not actively using these registers. Thus, this tracking mechanism
+* can miss. To ensure that false-negatives do not immediately show
+* up, decay the usage count over time.
+*/
+   if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
+   fpu->avx512_usage = AVX512_STATE_DECAY_COUNT;
+   else if (fpu->avx512_usage)
+   fpu->avx512_usage--;
+}
+
 /*
  * This function is called only during boot time when x86 caps are not set
  * up and alternative can not be used yet.
@@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu)
 {
if (likely(use_xsave())) {
copy_xregs_to_kernel(>state.xsave);
+   update_avx512_state(fpu);
return 1;
}
 
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 202c53918ecf..313b134d3ca3 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -302,6 +302,14 @@ struct fpu {
 */
unsigned char   initialized;
 
+   /*
+* @avx512_usage:
+*
+* Records the usage of AVX512 registers. A value of non-zero is used
+* to indicate whether these AVX512 registers recently had valid state.
+*/
+   unsigned char   avx512_usage;
+
/*
 * @state:
 *
-- 
2.17.1