Re: [Qemu-devel] [PATCH 2/2] qemu-common.h: optimise muldiv64 for x86_64 architecture
2015-01-09 12:22 GMT+00:00 Peter Maydell peter.mayd...@linaro.org: On 9 January 2015 at 11:25, Frediano Ziglio fredd...@gmail.com wrote: As this platform can do multiply/divide using 128 bit precision use these instructions to implement it. Signed-off-by: Frediano Ziglio frediano.zig...@huawei.com --- include/qemu-common.h | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/include/qemu-common.h b/include/qemu-common.h index f3033ae..880659d 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -370,11 +370,23 @@ static inline uint8_t from_bcd(uint8_t val) } /* compute with 96 bit intermediate result: (a*b)/c */ -#ifdef CONFIG_INT128 +#if defined(CONFIG_INT128) !defined(__x86_64__) static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) { return (__int128)a * b / c; } +#elif defined(__x86_64__) +/* Optimised x64 version. This assume that a*b/c fits in 64 bit */ This assumption isn't necessarily true, and this implementation will dump core on overflow. For instance: Inputs: a = 8000 b = 8000 c = 1 fn muldiv64 result 0 fn muldiv64_with_int128 result 0 fn muldiv64_with_uint128 result 0 Floating point exception (core dumped) -- PMM Yes, I know, it was meant to be a precondition, not a math rule. Doing some grep I'm not really sure this is valid in all cases however I would ask if the truncation is handled correctly. Surely in some cases the call to this function is not needed like in muldiv64(1, tks, usb_bit_time); or in muldiv64((int16_t) s-rtc.comp, 1000, 0x8000); Frediano
Re: [Qemu-devel] [PATCH 2/2] qemu-common.h: optimise muldiv64 for x86_64 architecture
On 9 January 2015 at 16:08, Frediano Ziglio fredd...@gmail.com wrote: I agree (after some digging) we are not sure we won't get that overflow. Agree to drop the second patch. However I would retain the first. Compiler can use it to optimize much easier. For instance if compiler understand that the multiplication fits into a 64 bit can decide to avoid the 128 bit operation easily, not so easy with all these shift, multiply, division and union structure. Yes, the uint128_t patch is a good idea. -- PMM
Re: [Qemu-devel] [PATCH 2/2] qemu-common.h: optimise muldiv64 for x86_64 architecture
On 9 January 2015 at 15:41, Frediano Ziglio fredd...@gmail.com wrote: 2015-01-09 12:22 GMT+00:00 Peter Maydell peter.mayd...@linaro.org: +/* Optimised x64 version. This assume that a*b/c fits in 64 bit */ This assumption isn't necessarily true, and this implementation will dump core on overflow. Yes, I know, it was meant to be a precondition, not a math rule. Doing some grep I'm not really sure this is valid in all cases however I would ask if the truncation is handled correctly. Surely in some cases the call to this function is not needed like in muldiv64(1, tks, usb_bit_time); or in muldiv64((int16_t) s-rtc.comp, 1000, 0x8000); Sure, there are some cases where we know the calculation won't overflow, but there are also cases where we can't be so sure, and if the parameters can be controlled by the guest then we absolutely can't allow the guest to cause QEMU to SIGFPE. Incidentally, in the examples you quote gcc is generally able (because the function is inline) to do a better job because of some of the constants involved: for instance with muldiv64((int16_t) s-rtc.comp, 1000, 0x8000); it generates a completely inline insn sequence with one mul and no div insns. So I think at this point I come back to an earlier question: do you have profiling results showing this function to be a performance problem on a real workload? If not, it just doesn't seem to me to be worth the risk and the maintenance burden to add a native x86 assembly version. -- PMM
Re: [Qemu-devel] [PATCH 2/2] qemu-common.h: optimise muldiv64 for x86_64 architecture
2015-01-09 15:52 GMT+00:00 Peter Maydell peter.mayd...@linaro.org: On 9 January 2015 at 15:41, Frediano Ziglio fredd...@gmail.com wrote: 2015-01-09 12:22 GMT+00:00 Peter Maydell peter.mayd...@linaro.org: +/* Optimised x64 version. This assume that a*b/c fits in 64 bit */ This assumption isn't necessarily true, and this implementation will dump core on overflow. Yes, I know, it was meant to be a precondition, not a math rule. Doing some grep I'm not really sure this is valid in all cases however I would ask if the truncation is handled correctly. Surely in some cases the call to this function is not needed like in muldiv64(1, tks, usb_bit_time); or in muldiv64((int16_t) s-rtc.comp, 1000, 0x8000); Sure, there are some cases where we know the calculation won't overflow, but there are also cases where we can't be so sure, and if the parameters can be controlled by the guest then we absolutely can't allow the guest to cause QEMU to SIGFPE. Incidentally, in the examples you quote gcc is generally able (because the function is inline) to do a better job because of some of the constants involved: for instance with muldiv64((int16_t) s-rtc.comp, 1000, 0x8000); it generates a completely inline insn sequence with one mul and no div insns. So I think at this point I come back to an earlier question: do you have profiling results showing this function to be a performance problem on a real workload? If not, it just doesn't seem to me to be worth the risk and the maintenance burden to add a native x86 assembly version. -- PMM I agree (after some digging) we are not sure we won't get that overflow. Agree to drop the second patch. However I would retain the first. Compiler can use it to optimize much easier. For instance if compiler understand that the multiplication fits into a 64 bit can decide to avoid the 128 bit operation easily, not so easy with all these shift, multiply, division and union structure. I didn't do any profiling. Frediano
Re: [Qemu-devel] [PATCH 2/2] qemu-common.h: optimise muldiv64 for x86_64 architecture
2015-01-09 11:43 GMT+00:00 Paolo Bonzini pbonz...@redhat.com: On 09/01/2015 12:25, Frediano Ziglio wrote: /* compute with 96 bit intermediate result: (a*b)/c */ -#ifdef CONFIG_INT128 +#if defined(CONFIG_INT128) !defined(__x86_64__) static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) { return (__int128)a * b / c; } +#elif defined(__x86_64__) +/* Optimised x64 version. This assume that a*b/c fits in 64 bit */ +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) +{ +uint64_t res; + +asm (mulq %2\n\tdivq %3 + : =a(res) + : a(a), qm((uint64_t) b), qm((uint64_t)c) + : rdx, cc); +return res; +} Reorder it the other way, and you can simplify the first #if. Yes, I was however thinking about people reading code. The int128 version is much easier to read so I put it first. I applied patch 1 locally, and will send a pull request once the tree is thawed. Paolo Frediano
Re: [Qemu-devel] [PATCH 2/2] qemu-common.h: optimise muldiv64 for x86_64 architecture
On 9 January 2015 at 11:25, Frediano Ziglio fredd...@gmail.com wrote: As this platform can do multiply/divide using 128 bit precision use these instructions to implement it. Signed-off-by: Frediano Ziglio frediano.zig...@huawei.com --- include/qemu-common.h | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/include/qemu-common.h b/include/qemu-common.h index f3033ae..880659d 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -370,11 +370,23 @@ static inline uint8_t from_bcd(uint8_t val) } /* compute with 96 bit intermediate result: (a*b)/c */ -#ifdef CONFIG_INT128 +#if defined(CONFIG_INT128) !defined(__x86_64__) static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) { return (__int128)a * b / c; } +#elif defined(__x86_64__) +/* Optimised x64 version. This assume that a*b/c fits in 64 bit */ This assumption isn't necessarily true, and this implementation will dump core on overflow. For instance: Inputs: a = 8000 b = 8000 c = 1 fn muldiv64 result 0 fn muldiv64_with_int128 result 0 fn muldiv64_with_uint128 result 0 Floating point exception (core dumped) -- PMM
[Qemu-devel] [PATCH 2/2] qemu-common.h: optimise muldiv64 for x86_64 architecture
As this platform can do multiply/divide using 128 bit precision use these instructions to implement it. Signed-off-by: Frediano Ziglio frediano.zig...@huawei.com --- include/qemu-common.h | 14 +- 1 file changed, 13 insertions(+), 1 deletion(-) diff --git a/include/qemu-common.h b/include/qemu-common.h index f3033ae..880659d 100644 --- a/include/qemu-common.h +++ b/include/qemu-common.h @@ -370,11 +370,23 @@ static inline uint8_t from_bcd(uint8_t val) } /* compute with 96 bit intermediate result: (a*b)/c */ -#ifdef CONFIG_INT128 +#if defined(CONFIG_INT128) !defined(__x86_64__) static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) { return (__int128)a * b / c; } +#elif defined(__x86_64__) +/* Optimised x64 version. This assume that a*b/c fits in 64 bit */ +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) +{ +uint64_t res; + +asm (mulq %2\n\tdivq %3 + : =a(res) + : a(a), qm((uint64_t) b), qm((uint64_t)c) + : rdx, cc); +return res; +} #else static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) { -- 1.9.1
Re: [Qemu-devel] [PATCH 2/2] qemu-common.h: optimise muldiv64 for x86_64 architecture
On 09/01/2015 12:25, Frediano Ziglio wrote: /* compute with 96 bit intermediate result: (a*b)/c */ -#ifdef CONFIG_INT128 +#if defined(CONFIG_INT128) !defined(__x86_64__) static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) { return (__int128)a * b / c; } +#elif defined(__x86_64__) +/* Optimised x64 version. This assume that a*b/c fits in 64 bit */ +static inline uint64_t muldiv64(uint64_t a, uint32_t b, uint32_t c) +{ +uint64_t res; + +asm (mulq %2\n\tdivq %3 + : =a(res) + : a(a), qm((uint64_t) b), qm((uint64_t)c) + : rdx, cc); +return res; +} Reorder it the other way, and you can simplify the first #if. I applied patch 1 locally, and will send a pull request once the tree is thawed. Paolo