Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
> Isn't a thread likely to clear the AVX registers at the end of a function > that uses them. > In particular this removes the massive overhead on certain cpus of > switching between two AVX modes. > So it is actually unlikely that XSAVE will need to save them at all? Only if context switches only happened on function boundaries, which is obviously not the case. Yes the detection mechanism is not 100% accurate, but if AVX is used significantly it should eventually detect it. Think of it as a statistical sampling heuristic. > > As I've also said before the registers are caller saved and since > systems calls are normal function calls the application code > would have to save them across a system call. > This allows the kernel to zero the registers on system call entry > again meaning that XSAVE won't normally have to save them. While I agree this would be nice, the Linux system call ABI wasn't defined like this, so it cannot be done at this point. -Andi
RE: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
From: Aubrey Li > Sent: 11 December 2018 00:25 > > User space tools which do automated task placement need information > about AVX-512 usage of tasks, because AVX-512 usage could cause core > turbo frequency drop and impact the running task on the sibling CPU. > > The XSAVE hardware structure has bits that indicate when valid state > is present in registers unique to AVX-512 use. Use these bits to > indicate when AVX-512 has been in use and add per-task AVX-512 state > tracking to context switch. Isn't a thread likely to clear the AVX registers at the end of a function that uses them. In particular this removes the massive overhead on certain cpus of switching between two AVX modes. So it is actually unlikely that XSAVE will need to save them at all? As I've also said before the registers are caller saved and since systems calls are normal function calls the application code would have to save them across a system call. This allows the kernel to zero the registers on system call entry again meaning that XSAVE won't normally have to save them. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
On 12/11/18 4:59 PM, Li, Aubrey wrote: >> maybe instead of a 1/0 bit, it's useful to store the timestamp of the last >> time we found the task to use avx? (need to find a good time unit) >> >> > Are you suggesting kernel does not do any translation, just provide a fact > to the user space tool and let user space tool to decide how to use this info? > > So how does user space tool use this timestamp in your mind? Couple of things... If we have a timestamp, we don't need a decay. That means less tuning and also less work at runtime in the common case. Let's say we report milliseconds. The app itself looks at the number and could now say the following: 1. task A used AVX512 1ms ago, I might need to segregate it 2. task B used AVX512 1ms ago, time to stop segregating it 3. task C used AVX512 1ms ago, and was using it 100ms ago (during the last scan), it's a regular AVX512 user. That way, you don't have to *catch* tasks in the window between use and the end of the decay period.
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
On 2018/12/12 8:14, Arjan van de Ven wrote: > On 12/11/2018 3:46 PM, Li, Aubrey wrote: >> On 2018/12/12 1:18, Dave Hansen wrote: >>> On 12/10/18 4:24 PM, Aubrey Li wrote: The tracking turns on the usage flag at the next context switch of the task, but requires 3 consecutive context switches with no usage to clear it. This decay is required because well-written AVX-512 applications are expected to clear this state when not actively using AVX-512 registers. >>> >>> One concern about this: Given a HZ=1000 system, this means that the >>> flag needs to get scanned every ~3ms. That's a pretty good amount of >>> scanning on a system with hundreds or thousands of tasks running around. >>> >>> How many tasks does this scale to until you're eating up an entire CPU >>> or two just scanning /proc? >>> >> >> Do we have a real requirement to do this in practical environment? >> AFAIK, 1s or even 5s is good enough in some customers environment. > > maybe instead of a 1/0 bit, it's useful to store the timestamp of the last > time we found the task to use avx? (need to find a good time unit) > > Are you suggesting kernel does not do any translation, just provide a fact to the user space tool and let user space tool to decide how to use this info? So how does user space tool use this timestamp in your mind? Thanks, -Aubrey
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
On 12/11/18 4:34 PM, Li, Aubrey wrote: >> Is there a reason we shouldn't do: >> >> if (!cpu_feature_enabled(X86_FEATURE_AVX512F)) >> update_avx512_state(fpu); >> >> ? >> > Why _!_ ? Sorry, got it backwards. I think I was considering having you do a if (!cpu_feature_enabled(X86_FEATURE_AVX512F)) return; inside of update_avx512_state(), but I got the state mixed up in my head. You don't need the '!'.
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
On 2018/12/12 1:20, Dave Hansen wrote: > to update AVX512 state >> + */ >> +static inline void update_avx512_state(struct fpu *fpu) >> +{ >> +/* >> + * AVX512 state is tracked here because its use is known to slow >> + * the max clock speed of the core. >> + * >> + * However, AVX512-using tasks are expected to clear this state when >> + * not actively using these registers. Thus, this tracking mechanism >> + * can miss. To ensure that false-negatives do not immediately show >> + * up, decay the usage count over time. >> + */ >> +if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512) >> +fpu->avx512_usage = AVX512_STATE_DECAY_COUNT; >> +else if (fpu->avx512_usage) >> +fpu->avx512_usage--; >> +} >> + >> /* >> * This function is called only during boot time when x86 caps are not set >> * up and alternative can not be used yet. >> @@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu) >> { >> if (likely(use_xsave())) { >> copy_xregs_to_kernel(>state.xsave); >> +update_avx512_state(fpu); >> return 1; >> } > > > Is there a reason we shouldn't do: > > if (!cpu_feature_enabled(X86_FEATURE_AVX512F)) > update_avx512_state(fpu); > > ? > Why _!_ ?
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
On 12/11/2018 3:46 PM, Li, Aubrey wrote: On 2018/12/12 1:18, Dave Hansen wrote: On 12/10/18 4:24 PM, Aubrey Li wrote: The tracking turns on the usage flag at the next context switch of the task, but requires 3 consecutive context switches with no usage to clear it. This decay is required because well-written AVX-512 applications are expected to clear this state when not actively using AVX-512 registers. One concern about this: Given a HZ=1000 system, this means that the flag needs to get scanned every ~3ms. That's a pretty good amount of scanning on a system with hundreds or thousands of tasks running around. How many tasks does this scale to until you're eating up an entire CPU or two just scanning /proc? Do we have a real requirement to do this in practical environment? AFAIK, 1s or even 5s is good enough in some customers environment. maybe instead of a 1/0 bit, it's useful to store the timestamp of the last time we found the task to use avx? (need to find a good time unit)
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
On 2018/12/12 1:18, Dave Hansen wrote: > On 12/10/18 4:24 PM, Aubrey Li wrote: >> The tracking turns on the usage flag at the next context switch of >> the task, but requires 3 consecutive context switches with no usage >> to clear it. This decay is required because well-written AVX-512 >> applications are expected to clear this state when not actively using >> AVX-512 registers. > > One concern about this: Given a HZ=1000 system, this means that the > flag needs to get scanned every ~3ms. That's a pretty good amount of > scanning on a system with hundreds or thousands of tasks running around. > > How many tasks does this scale to until you're eating up an entire CPU > or two just scanning /proc? > Do we have a real requirement to do this in practical environment? AFAIK, 1s or even 5s is good enough in some customers environment. Thanks, -Aubrey
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
On Tue, Dec 11, 2018 at 09:18:25AM -0800, Dave Hansen wrote: > On 12/10/18 4:24 PM, Aubrey Li wrote: > > The tracking turns on the usage flag at the next context switch of > > the task, but requires 3 consecutive context switches with no usage > > to clear it. This decay is required because well-written AVX-512 > > applications are expected to clear this state when not actively using > > AVX-512 registers. > > One concern about this: Given a HZ=1000 system, this means that the > flag needs to get scanned every ~3ms. That's a pretty good amount of > scanning on a system with hundreds or thousands of tasks running around. > > How many tasks does this scale to until you're eating up an entire CPU > or two just scanning /proc? Yes that's why we may need to propagate it to cgroups in the kernel, because user daemons don't really want to track every TID. But per pid is a start so that people can start experimenting with this. Then with some experience fancier interfaces for it can be implemented. -Andi
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
On 12/11/18 9:18 AM, Dave Hansen wrote: > On 12/10/18 4:24 PM, Aubrey Li wrote: >> The tracking turns on the usage flag at the next context switch of >> the task, but requires 3 consecutive context switches with no usage >> to clear it. This decay is required because well-written AVX-512 >> applications are expected to clear this state when not actively using >> AVX-512 registers. > > One concern about this: Given a HZ=1000 system, this means that the > flag needs to get scanned every ~3ms. That's a pretty good amount of > scanning on a system with hundreds or thousands of tasks running around. > > How many tasks does this scale to until you're eating up an entire CPU > or two just scanning /proc? > The reason why I believe that an exponential decay average, or some other kind of weighted decay of the AVX512 usage is better. Tim
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
to update AVX512 state > + */ > +static inline void update_avx512_state(struct fpu *fpu) > +{ > + /* > + * AVX512 state is tracked here because its use is known to slow > + * the max clock speed of the core. > + * > + * However, AVX512-using tasks are expected to clear this state when > + * not actively using these registers. Thus, this tracking mechanism > + * can miss. To ensure that false-negatives do not immediately show > + * up, decay the usage count over time. > + */ > + if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512) > + fpu->avx512_usage = AVX512_STATE_DECAY_COUNT; > + else if (fpu->avx512_usage) > + fpu->avx512_usage--; > +} > + > /* > * This function is called only during boot time when x86 caps are not set > * up and alternative can not be used yet. > @@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu) > { > if (likely(use_xsave())) { > copy_xregs_to_kernel(>state.xsave); > + update_avx512_state(fpu); > return 1; > } Is there a reason we shouldn't do: if (!cpu_feature_enabled(X86_FEATURE_AVX512F)) update_avx512_state(fpu); ?
Re: [PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
On 12/10/18 4:24 PM, Aubrey Li wrote: > The tracking turns on the usage flag at the next context switch of > the task, but requires 3 consecutive context switches with no usage > to clear it. This decay is required because well-written AVX-512 > applications are expected to clear this state when not actively using > AVX-512 registers. One concern about this: Given a HZ=1000 system, this means that the flag needs to get scanned every ~3ms. That's a pretty good amount of scanning on a system with hundreds or thousands of tasks running around. How many tasks does this scale to until you're eating up an entire CPU or two just scanning /proc?
[PATCH v4 1/2] x86/fpu: track AVX-512 usage of tasks
User space tools which do automated task placement need information about AVX-512 usage of tasks, because AVX-512 usage could cause core turbo frequency drop and impact the running task on the sibling CPU. The XSAVE hardware structure has bits that indicate when valid state is present in registers unique to AVX-512 use. Use these bits to indicate when AVX-512 has been in use and add per-task AVX-512 state tracking to context switch. The tracking turns on the usage flag at the next context switch of the task, but requires 3 consecutive context switches with no usage to clear it. This decay is required because well-written AVX-512 applications are expected to clear this state when not actively using AVX-512 registers. Although this mechanism is imprecise and can theoretically have both false-positives and false-negatives, it has been measured to be precise enough to be useful under real-world workloads like tensorflow and linpack. If higher precision is required, suggest user space tools to use the PMU-based mechanisms in combination. Signed-off-by: Aubrey Li Cc: Peter Zijlstra Cc: Andi Kleen Cc: Tim Chen Cc: Dave Hansen Cc: Arjan van de Ven --- arch/x86/include/asm/fpu/internal.h | 22 ++ arch/x86/include/asm/fpu/types.h| 8 2 files changed, 30 insertions(+) diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h index a38bf5a1e37a..0da74d63ba14 100644 --- a/arch/x86/include/asm/fpu/internal.h +++ b/arch/x86/include/asm/fpu/internal.h @@ -275,6 +275,27 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu) : "D" (st), "m" (*st), "a" (lmask), "d" (hmask)\ : "memory") +#defineAVX512_STATE_DECAY_COUNT3 +/* + * This function is called during context switch to update AVX512 state + */ +static inline void update_avx512_state(struct fpu *fpu) +{ + /* +* AVX512 state is tracked here because its use is known to slow +* the max clock speed of the core. +* +* However, AVX512-using tasks are expected to clear this state when +* not actively using these registers. Thus, this tracking mechanism +* can miss. To ensure that false-negatives do not immediately show +* up, decay the usage count over time. +*/ + if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512) + fpu->avx512_usage = AVX512_STATE_DECAY_COUNT; + else if (fpu->avx512_usage) + fpu->avx512_usage--; +} + /* * This function is called only during boot time when x86 caps are not set * up and alternative can not be used yet. @@ -411,6 +432,7 @@ static inline int copy_fpregs_to_fpstate(struct fpu *fpu) { if (likely(use_xsave())) { copy_xregs_to_kernel(>state.xsave); + update_avx512_state(fpu); return 1; } diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h index 202c53918ecf..313b134d3ca3 100644 --- a/arch/x86/include/asm/fpu/types.h +++ b/arch/x86/include/asm/fpu/types.h @@ -302,6 +302,14 @@ struct fpu { */ unsigned char initialized; + /* +* @avx512_usage: +* +* Records the usage of AVX512 registers. A value of non-zero is used +* to indicate whether these AVX512 registers recently had valid state. +*/ + unsigned char avx512_usage; + /* * @state: * -- 2.17.1