Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
Christophe, Christophe Leroy writes: > On 01/09/2020 02:05 PM, Thomas Gleixner wrote: >> The reason why this is implemented in this way is that >> __arch_get_hw_counter() needs a way to express that the clocksource of >> the moment is not suitable for VDSO so that the syscall fallback gets >> invoked. >> >> Sure we could have used a pointer for the value and a return value >> indicating the validity, but given the required uptime the resulting >> code overhead seemed to be not worth it. At least not for me as I'm not >> planning to be around 58 years from now :) >> > > I managed to get better code and better performance by splitting out the > validity check as follows. Would it be suitable for all arches ? A quick test on x86 shows only a minimal impact, but it's in the noise. So from my side that's fine, but I can't talk for ARM[64]/MIPS > diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h > b/arch/powerpc/include/asm/vdso/gettimeofday.h > index 689f51b0d8c9..11cdd6faa4ad 100644 > --- a/arch/powerpc/include/asm/vdso/gettimeofday.h > +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h > @@ -114,15 +114,17 @@ int clock_getres32_fallback(clockid_t _clkid, > struct old_timespec32 *_ts) > return ret; > } > > -static __always_inline u64 __arch_get_hw_counter(s32 clock_mode) > +static __always_inline bool __arch_is_hw_counter_valid(s32 clock_mode) > { > /* >* clock_mode == 0 implies that vDSO are enabled otherwise >* fallback on syscall. >*/ > - if (clock_mode) > - return ULLONG_MAX; > + return clock_mode ? false : true; I don't think we need an arch specific function here. I rather convert the boolean of ARM[64] to the x86/MIPS way of VCLOCK_* modes and let the generic code check for clock_mode == VCLOCK_NONE. There is some magic in ARM and MIPS which wants to be able to disable the whole thing at compile time, but that can be handled in the generic code with just an extra config switch. I'll have a stab at that tomorrow. Thanks, tglx
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
Hi Thomas, On 01/09/2020 02:05 PM, Thomas Gleixner wrote: Christophe! Christophe Leroy writes: In do_hres(), I see: cycles = __arch_get_hw_counter(vd->clock_mode); ns = vdso_ts->nsec; last = vd->cycle_last; if (unlikely((s64)cycles < 0)) return -1; __arch_get_hw_counter() returns a u64 values. On the PPC, this is read from the timebase which is a 64 bits counter. Why returning -1 if (s64)cycles < 0 ? Does it means we have to mask out the most significant bit when reading the HW counter ? Only if you expect the HW counter to reach a value which has bit 63 set. That'd require: uptime counter frequency ~292 years 1GHz ~ 58 years 5GHz assumed that the HW counter starts at 0 when the box is powered on. The reason why this is implemented in this way is that __arch_get_hw_counter() needs a way to express that the clocksource of the moment is not suitable for VDSO so that the syscall fallback gets invoked. Sure we could have used a pointer for the value and a return value indicating the validity, but given the required uptime the resulting code overhead seemed to be not worth it. At least not for me as I'm not planning to be around 58 years from now :) I managed to get better code and better performance by splitting out the validity check as follows. Would it be suitable for all arches ? diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h b/arch/powerpc/include/asm/vdso/gettimeofday.h index 689f51b0d8c9..11cdd6faa4ad 100644 --- a/arch/powerpc/include/asm/vdso/gettimeofday.h +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h @@ -114,15 +114,17 @@ int clock_getres32_fallback(clockid_t _clkid, struct old_timespec32 *_ts) return ret; } -static __always_inline u64 __arch_get_hw_counter(s32 clock_mode) +static __always_inline bool __arch_is_hw_counter_valid(s32 clock_mode) { /* * clock_mode == 0 implies that vDSO are enabled otherwise * fallback on syscall. */ - if (clock_mode) - return ULLONG_MAX; + return clock_mode ? false : true; +} +static __always_inline u64 __arch_get_hw_counter(s32 clock_mode) +{ return get_tb(); } diff --git a/lib/vdso/gettimeofday.c b/lib/vdso/gettimeofday.c index ee9da52a3e02..90bb5dfd0db0 100644 --- a/lib/vdso/gettimeofday.c +++ b/lib/vdso/gettimeofday.c @@ -46,11 +46,12 @@ static inline int do_hres(const struct vdso_data *vd, clockid_t clk, do { seq = vdso_read_begin(vd); + if (!__arch_is_hw_counter_valid(vd->clock_mode)) + return -1; + cycles = __arch_get_hw_counter(vd->clock_mode); ns = vdso_ts->nsec; last = vd->cycle_last; - if (unlikely((s64)cycles < 0)) - return -1; ns += vdso_calc_delta(cycles, last, vd->mask, vd->mult); ns >>= vd->shift; Thanks Christophe
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
Christophe! Christophe Leroy writes: > In do_hres(), I see: > > cycles = __arch_get_hw_counter(vd->clock_mode); > ns = vdso_ts->nsec; > last = vd->cycle_last; > if (unlikely((s64)cycles < 0)) > return -1; > > __arch_get_hw_counter() returns a u64 values. On the PPC, this is read > from the timebase which is a 64 bits counter. > > Why returning -1 if (s64)cycles < 0 ? Does it means we have to mask out > the most significant bit when reading the HW counter ? Only if you expect the HW counter to reach a value which has bit 63 set. That'd require: uptime counter frequency ~292 years 1GHz ~ 58 years 5GHz assumed that the HW counter starts at 0 when the box is powered on. The reason why this is implemented in this way is that __arch_get_hw_counter() needs a way to express that the clocksource of the moment is not suitable for VDSO so that the syscall fallback gets invoked. Sure we could have used a pointer for the value and a return value indicating the validity, but given the required uptime the resulting code overhead seemed to be not worth it. At least not for me as I'm not planning to be around 58 years from now :) Thanks, tglx
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
Hi Thomas, In do_hres(), I see: cycles = __arch_get_hw_counter(vd->clock_mode); ns = vdso_ts->nsec; last = vd->cycle_last; if (unlikely((s64)cycles < 0)) return -1; __arch_get_hw_counter() returns a u64 values. On the PPC, this is read from the timebase which is a 64 bits counter. Why returning -1 if (s64)cycles < 0 ? Does it means we have to mask out the most significant bit when reading the HW counter ? Christophe
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
On Sun, Oct 27, 2019 at 10:21:25AM +0100, Christophe Leroy wrote: > Le 27/10/2019 à 01:06, Segher Boessenkool a écrit : > >The hand-optimised asm code will pretty likely win handsomely, whatever > >you do. Especially on cores like the 885 (no branch prediction, single > >issue, small caches, etc.: every instruction counts). > > > >Is there any reason to replace this hand-optimised code? It was written > >for exacty this reason? These functions are critical and should be as > >fast as possible. > > Well, all this started with COARSE clocks not being supported by PPC32 > VDSO. I first submitted a series with a set of optimisations including > the implementation of COARSE clocks > (https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=126779) > > Then after a comment received on patch 4 of the series from Santosh > Sivaraj asking for a common implementation of it for PPC32 and PPC64, I > started looking into making the whole VDSO source code common to PPC32 > and PPC64. Most functions are similar. Time functions are also rather > similar but unfortunately don't use the same registers. They also don't > cover all possible clocks. And getres() is also buggy, see series > https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=110321 That is all nice work :-) > So instead of reworking the existing time functions, I started > investigating whether we could plug powerpc to the generic > implementation. One drawback of PPC is that we need to setup an ASM > trampoline to handle the SO bit as it can't be handled from C directly, > can it ? There is no way to say what CR bits to return. The ABI requires some of those bits to be preserved, and some are volatile. System calls use a different ABI, one the compiler knows nothing about, so you cannot even show system calls as calls to the compiler. > How critical are these functions ? Although we have a slight degration > with the C implementation, they are still way faster than the > corresponding syscall. "Slight": With current powerpc/32 ASM VDSO: gettimeofday:vdso: 750 nsec/call clock-getres-realtime:vdso: 382 nsec/call clock-gettime-realtime:vdso: 928 nsec/call clock-getres-monotonic:vdso: 382 nsec/call clock-gettime-monotonic:vdso: 1033 nsec/call Once switched to C implementation: gettimeofday:vdso: 1533 nsec/call clock-getres-realtime:vdso: 853 nsec/call clock-gettime-realtime:vdso: 1570 nsec/call clock-getres-monotonic:vdso: 835 nsec/call clock-gettime-monotonic:vdso: 1605 nsec/call ---> Those that are not more than two times slower are almost that. <--- This also needs measurements on more representative PowerPC cores, say some G3 or G4; and on modern CPUs (Power7/8/9). It also needs context with those measurements: what CPU core is it? Running at what frequency clock? > Another thing I was wondering, is it worth using the 64 bit timebase on > PPC32 ? As far as I understand, the timebase is there to calculate a > linear date update since last VDSO datapage update. How often is the > VDSO datapage updated ? On the 885 clocked at 132Mhz, the timebase is at > 8.25 Mhz, which means it needs more than 8 minutes to loop over 32 bits. On most PowerPC cores the time base is incremented significantly faster. Usual speeds for older cores are 50MHz to 100MHz, and for newer cores ten times that. Recommended frequency is currently 512MHz, so you'll wrap the low 32 bits in 8s or so on those, and in about a minute on many powermac etc. machines already. How can you know this long hasn't passed since the last time you read the high half of the time base? Without reading that high part? The current (assembler) code already optimises converting this to some other scale quite well, better than a compiler can (see __do_get_tspec). Segher
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
Le 27/10/2019 à 01:06, Segher Boessenkool a écrit : On Sat, Oct 26, 2019 at 08:48:27PM +0200, Thomas Gleixner wrote: On Sat, 26 Oct 2019, Christophe Leroy wrote: Let's look at the code: __cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz) { const struct vdso_data *vd = __arch_get_vdso_data(); if (likely(tv != NULL)) { struct __kernel_timespec ts; if (do_hres(&vd[CS_HRES_COARSE], CLOCK_REALTIME, &ts)) return gettimeofday_fallback(tv, tz); tv->tv_sec = ts.tv_sec; tv->tv_usec = (u32)ts.tv_nsec / NSEC_PER_USEC; IIRC PPC did some magic math tricks to avoid that. Could you just for the fun of it replace this division with (u32)ts.tv_nsec >> 10; On this particular CPU (the 885, right?) a division by 1000 is just 9 cycles. On other CPUs it can be more, say 19 cycles like on the 750; not cheap at all, but not hugely expensive either, comparatively. (A 64/32->32 division is expensive on all 32-bit PowerPC: there is no hardware help for it at all, so it's all done in software.) Of course the compiler won't do a division by a constant with a division instruction at all, so it's somewhat cheaper even, 5 or 6 cycles or so. One thing which might be worth to try as well is to mark all functions in that file as inline. The speedup by the do_hres() inlining was impressive on PPC. The hand-optimised asm code will pretty likely win handsomely, whatever you do. Especially on cores like the 885 (no branch prediction, single issue, small caches, etc.: every instruction counts). Is there any reason to replace this hand-optimised code? It was written for exacty this reason? These functions are critical and should be as fast as possible. Well, all this started with COARSE clocks not being supported by PPC32 VDSO. I first submitted a series with a set of optimisations including the implementation of COARSE clocks (https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=126779) Then after a comment received on patch 4 of the series from Santosh Sivaraj asking for a common implementation of it for PPC32 and PPC64, I started looking into making the whole VDSO source code common to PPC32 and PPC64. Most functions are similar. Time functions are also rather similar but unfortunately don't use the same registers. They also don't cover all possible clocks. And getres() is also buggy, see series https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=110321 So instead of reworking the existing time functions, I started investigating whether we could plug powerpc to the generic implementation. One drawback of PPC is that we need to setup an ASM trampoline to handle the SO bit as it can't be handled from C directly, can it ? How critical are these functions ? Although we have a slight degration with the C implementation, they are still way faster than the corresponding syscall. Another thing I was wondering, is it worth using the 64 bit timebase on PPC32 ? As far as I understand, the timebase is there to calculate a linear date update since last VDSO datapage update. How often is the VDSO datapage updated ? On the 885 clocked at 132Mhz, the timebase is at 8.25 Mhz, which means it needs more than 8 minutes to loop over 32 bits. Christophe
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
On Sat, Oct 26, 2019 at 08:48:27PM +0200, Thomas Gleixner wrote: > On Sat, 26 Oct 2019, Christophe Leroy wrote: > Let's look at the code: > > __cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz) > { > const struct vdso_data *vd = __arch_get_vdso_data(); > > if (likely(tv != NULL)) { > struct __kernel_timespec ts; > > if (do_hres(&vd[CS_HRES_COARSE], CLOCK_REALTIME, &ts)) > return gettimeofday_fallback(tv, tz); > > tv->tv_sec = ts.tv_sec; > tv->tv_usec = (u32)ts.tv_nsec / NSEC_PER_USEC; > > IIRC PPC did some magic math tricks to avoid that. Could you just for the > fun of it replace this division with > >(u32)ts.tv_nsec >> 10; On this particular CPU (the 885, right?) a division by 1000 is just 9 cycles. On other CPUs it can be more, say 19 cycles like on the 750; not cheap at all, but not hugely expensive either, comparatively. (A 64/32->32 division is expensive on all 32-bit PowerPC: there is no hardware help for it at all, so it's all done in software.) Of course the compiler won't do a division by a constant with a division instruction at all, so it's somewhat cheaper even, 5 or 6 cycles or so. > One thing which might be worth to try as well is to mark all functions in > that file as inline. The speedup by the do_hres() inlining was impressive > on PPC. The hand-optimised asm code will pretty likely win handsomely, whatever you do. Especially on cores like the 885 (no branch prediction, single issue, small caches, etc.: every instruction counts). Is there any reason to replace this hand-optimised code? It was written for exacty this reason? These functions are critical and should be as fast as possible. Segher
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
On Sat, 26 Oct 2019, Christophe Leroy wrote: > Le 26/10/2019 à 17:53, Thomas Gleixner a écrit : > > > > > > gettimeofday: vdso: 750 nsec/call > > > > > > > > > > > > gettimeofday: vdso: 1533 nsec/call > > > > > > > > Small improvement (3%) with the proposed change: > > > > > > > > gettimeofday: vdso: 1485 nsec/call > > > > > > By inlining do_hres() I get the following: > > > > > > gettimeofday:vdso: 1072 nsec/call > > > > What's the effect for clock_gettime()? > > > > gettimeofday() is suboptimal vs. the PPC ASM variant due to an extra > > division, but clock_gettime() should be 1:1 comparable. > > > > Original PPC asm: > clock-gettime-realtime:vdso: 928 nsec/call > > My original RFC: > clock-gettime-realtime:vdso: 1570 nsec/call > > With your suggested vdso_calc_delta(): > clock-gettime-realtime:vdso: 1512 nsec/call > > With your vdso_calc_delta() and inlined do_hres(): > clock-gettime-realtime:vdso: 1302 nsec/call That does not make any sense at all. gettimeofday() is basically the same as clock_gettime(REALTIME) and does an extra division. So I would expect it to be slower. Let's look at the code: __cvdso_gettimeofday(struct __kernel_old_timeval *tv, struct timezone *tz) { const struct vdso_data *vd = __arch_get_vdso_data(); if (likely(tv != NULL)) { struct __kernel_timespec ts; if (do_hres(&vd[CS_HRES_COARSE], CLOCK_REALTIME, &ts)) return gettimeofday_fallback(tv, tz); tv->tv_sec = ts.tv_sec; tv->tv_usec = (u32)ts.tv_nsec / NSEC_PER_USEC; IIRC PPC did some magic math tricks to avoid that. Could you just for the fun of it replace this division with (u32)ts.tv_nsec >> 10; That's obviously incorrect, but it would tell us how heavy the division is. If that brings us close we might do something special for gettimeofday(). OTOH, last time I checked clock_gettime() was by far more used than gettimeofday() but that obviously depends on the use cases. } ... } and __cvdso_clock_gettime_common(clockid_t clock, struct __kernel_timespec *ts) { const struct vdso_data *vd = __arch_get_vdso_data(); u32 msk; /* Check for negative values or invalid clocks */ if (unlikely((u32) clock >= MAX_CLOCKS)) return -1; /* * Convert the clockid to a bitmask and use it to check which * clocks are handled in the VDSO directly. */ msk = 1U << clock; if (likely(msk & VDSO_HRES)) { return do_hres(&vd[CS_HRES_COARSE], clock, ts); So this is the extra code which is executed for clock_gettime(REAL) which is pure logic and certainly not as heavyweight as the division in the gettimeofday() code path. } static __maybe_unused int __cvdso_clock_gettime(clockid_t clock, struct __kernel_timespec *ts) { int ret = __cvdso_clock_gettime_common(clock, ts); if (unlikely(ret)) return clock_gettime_fallback(clock, ts); return 0; } One thing which might be worth to try as well is to mark all functions in that file as inline. The speedup by the do_hres() inlining was impressive on PPC. Thanks, tglx
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
Le 26/10/2019 à 17:53, Thomas Gleixner a écrit : On Tue, 22 Oct 2019, Christophe Leroy wrote: Le 22/10/2019 à 11:01, Christophe Leroy a écrit : Le 21/10/2019 à 23:29, Thomas Gleixner a écrit : On Mon, 21 Oct 2019, Christophe Leroy wrote: This is a tentative to switch powerpc/32 vdso to generic C implementation. It will likely not work on 64 bits or even build properly at the moment. powerpc is a bit special for VDSO as well as system calls in the way that it requires setting CR SO bit which cannot be done in C. Therefore, entry/exit and fallback needs to be performed in ASM. To allow that, C fallbacks just return -1 and the ASM entry point performs the system call when the C function returns -1. The performance is rather disappoiting. That's most likely all calculation in the C implementation are based on 64 bits math and converted to 32 bits at the very end. I guess C implementation should use 32 bits math like the assembly VDSO does as of today. gettimeofday: vdso: 750 nsec/call gettimeofday: vdso: 1533 nsec/call Small improvement (3%) with the proposed change: gettimeofday: vdso: 1485 nsec/call By inlining do_hres() I get the following: gettimeofday:vdso: 1072 nsec/call What's the effect for clock_gettime()? gettimeofday() is suboptimal vs. the PPC ASM variant due to an extra division, but clock_gettime() should be 1:1 comparable. Original PPC asm: clock-gettime-realtime:vdso: 928 nsec/call My original RFC: clock-gettime-realtime:vdso: 1570 nsec/call With your suggested vdso_calc_delta(): clock-gettime-realtime:vdso: 1512 nsec/call With your vdso_calc_delta() and inlined do_hres(): clock-gettime-realtime:vdso: 1302 nsec/call Christophe
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
Le 26/10/2019 à 15:55, Andy Lutomirski a écrit : On Tue, Oct 22, 2019 at 6:56 AM Christophe Leroy wrote: The performance is rather disappoiting. That's most likely all calculation in the C implementation are based on 64 bits math and converted to 32 bits at the very end. I guess C implementation should use 32 bits math like the assembly VDSO does as of today. gettimeofday:vdso: 750 nsec/call gettimeofday:vdso: 1533 nsec/call Small improvement (3%) with the proposed change: gettimeofday:vdso: 1485 nsec/call By inlining do_hres() I get the following: gettimeofday:vdso: 1072 nsec/call A perf report might be informative. Not sure there is much to learn from perf report: With the original RFC: 51.83% test_vdso [vdso] [.] do_hres 24.86% test_vdso [vdso] [.] __c_kernel_gettimeofday 7.33% test_vdso [vdso] [.] __kernel_gettimeofday 5.77% test_vdso test_vdso [.] main 1.55% test_vdso [kernel.kallsyms] [k] copy_page 0.67% test_vdso libc-2.23.so [.] _dl_addr 0.55% test_vdso ld-2.23.so [.] do_lookup_x With vdso_calc_delta() optimised as suggested by Thomas + inlined do_hres(): 68.00% test_vdso [vdso] [.] __c_kernel_gettimeofday 12.59% test_vdso [vdso] [.] __kernel_gettimeofday 6.22% test_vdso test_vdso [.] main 2.07% test_vdso [kernel.kallsyms] [k] copy_page 1.04% test_vdso ld-2.23.so [.] _dl_relocate_object 0.89% test_vdso ld-2.23.so [.] do_lookup_x I've tried 'perf annotate', but I have not found how to tell perf to use vdso32.so.dbg file for annotate [vdso]. Test app: #include #include #include #include #include #include #include #include static int (*gettimeofday_vdso)(struct timeval *tv, struct timezone *tz); int main(int argc, char **argv) { void *handle = dlopen("linux-vdso32.so.1", RTLD_NOW | RTLD_GLOBAL); struct timeval tv; struct timezone tz; int i; (void)dlerror(); gettimeofday_vdso = dlsym(handle, "__kernel_gettimeofday"); if (dlerror()) gettimeofday_vdso = NULL; for (i = 0; i < 10; i++) gettimeofday_vdso(&tv, &tz); } Christophe
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
On Tue, 22 Oct 2019, Christophe Leroy wrote: > Le 22/10/2019 à 11:01, Christophe Leroy a écrit : > > Le 21/10/2019 à 23:29, Thomas Gleixner a écrit : > > > On Mon, 21 Oct 2019, Christophe Leroy wrote: > > > > > > > This is a tentative to switch powerpc/32 vdso to generic C > > > > implementation. > > > > It will likely not work on 64 bits or even build properly at the moment. > > > > > > > > powerpc is a bit special for VDSO as well as system calls in the > > > > way that it requires setting CR SO bit which cannot be done in C. > > > > Therefore, entry/exit and fallback needs to be performed in ASM. > > > > > > > > To allow that, C fallbacks just return -1 and the ASM entry point > > > > performs the system call when the C function returns -1. > > > > > > > > The performance is rather disappoiting. That's most likely all > > > > calculation in the C implementation are based on 64 bits math and > > > > converted to 32 bits at the very end. I guess C implementation should > > > > use 32 bits math like the assembly VDSO does as of today. > > > > > > > gettimeofday: vdso: 750 nsec/call > > > > > > > > gettimeofday: vdso: 1533 nsec/call > > > > Small improvement (3%) with the proposed change: > > > > gettimeofday: vdso: 1485 nsec/call > > By inlining do_hres() I get the following: > > gettimeofday:vdso: 1072 nsec/call What's the effect for clock_gettime()? gettimeofday() is suboptimal vs. the PPC ASM variant due to an extra division, but clock_gettime() should be 1:1 comparable. Thanks, tglx
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
On Tue, Oct 22, 2019 at 6:56 AM Christophe Leroy wrote: > > > >>> The performance is rather disappoiting. That's most likely all > >>> calculation in the C implementation are based on 64 bits math and > >>> converted to 32 bits at the very end. I guess C implementation should > >>> use 32 bits math like the assembly VDSO does as of today. > >> > >>> gettimeofday:vdso: 750 nsec/call > >>> > >>> gettimeofday:vdso: 1533 nsec/call > > > > Small improvement (3%) with the proposed change: > > > > gettimeofday:vdso: 1485 nsec/call > > By inlining do_hres() I get the following: > > gettimeofday:vdso: 1072 nsec/call > A perf report might be informative.
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
Le 22/10/2019 à 11:01, Christophe Leroy a écrit : Le 21/10/2019 à 23:29, Thomas Gleixner a écrit : On Mon, 21 Oct 2019, Christophe Leroy wrote: This is a tentative to switch powerpc/32 vdso to generic C implementation. It will likely not work on 64 bits or even build properly at the moment. powerpc is a bit special for VDSO as well as system calls in the way that it requires setting CR SO bit which cannot be done in C. Therefore, entry/exit and fallback needs to be performed in ASM. To allow that, C fallbacks just return -1 and the ASM entry point performs the system call when the C function returns -1. The performance is rather disappoiting. That's most likely all calculation in the C implementation are based on 64 bits math and converted to 32 bits at the very end. I guess C implementation should use 32 bits math like the assembly VDSO does as of today. gettimeofday: vdso: 750 nsec/call gettimeofday: vdso: 1533 nsec/call Small improvement (3%) with the proposed change: gettimeofday: vdso: 1485 nsec/call By inlining do_hres() I get the following: gettimeofday:vdso: 1072 nsec/call Christophe Though still some way to go. Christophe The only real 64bit math which can matter is the 64bit * 32bit multiply, i.e. static __always_inline u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult) { return ((cycles - last) & mask) * mult; } Everything else is trivial add/sub/shift, which should be roughly the same in ASM. Can you try to replace that with: static __always_inline u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult) { u64 ret, delta = ((cycles - last) & mask); u32 dh, dl; dl = delta; dh = delta >> 32; res = mul_u32_u32(al, mul); if (ah) res += mul_u32_u32(ah, mul) << 32; return res; } That's pretty much what __do_get_tspec does in ASM. Thanks, tglx
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
Le 21/10/2019 à 23:29, Thomas Gleixner a écrit : On Mon, 21 Oct 2019, Christophe Leroy wrote: This is a tentative to switch powerpc/32 vdso to generic C implementation. It will likely not work on 64 bits or even build properly at the moment. powerpc is a bit special for VDSO as well as system calls in the way that it requires setting CR SO bit which cannot be done in C. Therefore, entry/exit and fallback needs to be performed in ASM. To allow that, C fallbacks just return -1 and the ASM entry point performs the system call when the C function returns -1. The performance is rather disappoiting. That's most likely all calculation in the C implementation are based on 64 bits math and converted to 32 bits at the very end. I guess C implementation should use 32 bits math like the assembly VDSO does as of today. gettimeofday:vdso: 750 nsec/call gettimeofday:vdso: 1533 nsec/call Small improvement (3%) with the proposed change: gettimeofday:vdso: 1485 nsec/call Though still some way to go. Christophe The only real 64bit math which can matter is the 64bit * 32bit multiply, i.e. static __always_inline u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult) { return ((cycles - last) & mask) * mult; } Everything else is trivial add/sub/shift, which should be roughly the same in ASM. Can you try to replace that with: static __always_inline u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult) { u64 ret, delta = ((cycles - last) & mask); u32 dh, dl; dl = delta; dh = delta >> 32; res = mul_u32_u32(al, mul); if (ah) res += mul_u32_u32(ah, mul) << 32; return res; } That's pretty much what __do_get_tspec does in ASM. Thanks, tglx
Re: [RFC PATCH] powerpc/32: Switch VDSO to C implementation.
On Mon, 21 Oct 2019, Christophe Leroy wrote: > This is a tentative to switch powerpc/32 vdso to generic C implementation. > It will likely not work on 64 bits or even build properly at the moment. > > powerpc is a bit special for VDSO as well as system calls in the > way that it requires setting CR SO bit which cannot be done in C. > Therefore, entry/exit and fallback needs to be performed in ASM. > > To allow that, C fallbacks just return -1 and the ASM entry point > performs the system call when the C function returns -1. > > The performance is rather disappoiting. That's most likely all > calculation in the C implementation are based on 64 bits math and > converted to 32 bits at the very end. I guess C implementation should > use 32 bits math like the assembly VDSO does as of today. > gettimeofday:vdso: 750 nsec/call > > gettimeofday:vdso: 1533 nsec/call The only real 64bit math which can matter is the 64bit * 32bit multiply, i.e. static __always_inline u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult) { return ((cycles - last) & mask) * mult; } Everything else is trivial add/sub/shift, which should be roughly the same in ASM. Can you try to replace that with: static __always_inline u64 vdso_calc_delta(u64 cycles, u64 last, u64 mask, u32 mult) { u64 ret, delta = ((cycles - last) & mask); u32 dh, dl; dl = delta; dh = delta >> 32; res = mul_u32_u32(al, mul); if (ah) res += mul_u32_u32(ah, mul) << 32; return res; } That's pretty much what __do_get_tspec does in ASM. Thanks, tglx
[RFC PATCH] powerpc/32: Switch VDSO to C implementation.
This is a tentative to switch powerpc/32 vdso to generic C implementation. It will likely not work on 64 bits or even build properly at the moment. powerpc is a bit special for VDSO as well as system calls in the way that it requires setting CR SO bit which cannot be done in C. Therefore, entry/exit and fallback needs to be performed in ASM. To allow that, C fallbacks just return -1 and the ASM entry point performs the system call when the C function returns -1. The performance is rather disappoiting. That's most likely all calculation in the C implementation are based on 64 bits math and converted to 32 bits at the very end. I guess C implementation should use 32 bits math like the assembly VDSO does as of today. With current powerpc/32 ASM VDSO: gettimeofday:vdso: 750 nsec/call clock-getres-realtime:vdso: 382 nsec/call clock-gettime-realtime:vdso: 928 nsec/call clock-getres-monotonic:vdso: 382 nsec/call clock-gettime-monotonic:vdso: 1033 nsec/call Once switched to C implementation: gettimeofday:vdso: 1533 nsec/call clock-getres-realtime:vdso: 853 nsec/call clock-gettime-realtime:vdso: 1570 nsec/call clock-getres-monotonic:vdso: 835 nsec/call clock-gettime-monotonic:vdso: 1605 nsec/call Signed-off-by: Christophe Leroy --- arch/powerpc/Kconfig | 2 + arch/powerpc/include/asm/vdso/gettimeofday.h | 81 arch/powerpc/include/asm/vdso/vsyscall.h | 27 +++ arch/powerpc/include/asm/vdso_datapage.h | 16 +- arch/powerpc/kernel/asm-offsets.c| 26 +-- arch/powerpc/kernel/time.c | 90 + arch/powerpc/kernel/vdso.c | 19 +- arch/powerpc/kernel/vdso32/Makefile | 19 +- arch/powerpc/kernel/vdso32/gettimeofday.S| 270 ++- arch/powerpc/kernel/vdso32/vgettimeofday.c | 32 10 files changed, 233 insertions(+), 349 deletions(-) create mode 100644 arch/powerpc/include/asm/vdso/gettimeofday.h create mode 100644 arch/powerpc/include/asm/vdso/vsyscall.h create mode 100644 arch/powerpc/kernel/vdso32/vgettimeofday.c diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig index 3e56c9c2f16e..a363c5186b82 100644 --- a/arch/powerpc/Kconfig +++ b/arch/powerpc/Kconfig @@ -168,6 +168,7 @@ config PPC select GENERIC_STRNCPY_FROM_USER select GENERIC_STRNLEN_USER select GENERIC_TIME_VSYSCALL + select GENERIC_GETTIMEOFDAY select HAVE_ARCH_AUDITSYSCALL select HAVE_ARCH_HUGE_VMAP if PPC_BOOK3S_64 && PPC_RADIX_MMU select HAVE_ARCH_JUMP_LABEL @@ -197,6 +198,7 @@ config PPC select HAVE_FUNCTION_GRAPH_TRACER select HAVE_FUNCTION_TRACER select HAVE_GCC_PLUGINS if GCC_VERSION >= 50200 # plugin support on gcc <= 5.1 is buggy on PPC + select HAVE_GENERIC_VDSO select HAVE_HW_BREAKPOINT if PERF_EVENTS && (PPC_BOOK3S || PPC_8xx) select HAVE_IDE select HAVE_IOREMAP_PROT diff --git a/arch/powerpc/include/asm/vdso/gettimeofday.h b/arch/powerpc/include/asm/vdso/gettimeofday.h new file mode 100644 index ..6de875cf4b75 --- /dev/null +++ b/arch/powerpc/include/asm/vdso/gettimeofday.h @@ -0,0 +1,81 @@ +/* SPDX-License-Identifier: GPL-2.0 */ +/* + * Copyright (C) 2018 ARM Limited + */ +#ifndef __ASM_VDSO_GETTIMEOFDAY_H +#define __ASM_VDSO_GETTIMEOFDAY_H + +#ifndef __ASSEMBLY__ + +#include +#include +#include + +#define __VDSO_USE_SYSCALL ULLONG_MAX + +#define VDSO_HAS_CLOCK_GETRES 1 + +#define VDSO_HAS_TIME 1 + +#define VDSO_HAS_32BIT_FALLBACK1 + +static __always_inline +int gettimeofday_fallback(struct __kernel_old_timeval *_tv, + struct timezone *_tz) +{ + return -1; +} + +static __always_inline +long clock_gettime_fallback(clockid_t _clkid, struct __kernel_timespec *_ts) +{ + return -1; +} + +static __always_inline +int clock_getres_fallback(clockid_t _clkid, struct __kernel_timespec *_ts) +{ + return -1; +} + +static __always_inline +int clock_getres32_fallback(clockid_t clock, struct old_timespec32 *res) +{ + return -1; +} + +static __always_inline +int clock_gettime32_fallback(clockid_t clock, struct old_timespec32 *res) +{ + return -1; +} + +static __always_inline u64 __arch_get_hw_counter(s32 clock_mode) +{ + /* +* clock_mode == 0 implies that vDSO are enabled otherwise +* fallback on syscall. +*/ + if (clock_mode) + return __VDSO_USE_SYSCALL; + + return get_tb(); +} + +static __always_inline +const struct vdso_data *__arch_get_vdso_data(void) +{ + void *ptr; + + asm volatile( + " bcl 20, 31, .+4;\n" + " mflr%0;\n" + " addi%0, %0, __kernel_datapage_offset - (.-4);\n" + : "=b"(ptr) : : "lr"); + + return ptr + *(unsigned long