Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
On 5 July 2016 at 13:24, Vijay Kilariwrote: > On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson wrote: >> Consider >> >> #define VECTYPEuint32x4_t >> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0) >> >> >> which compiles down to >> >> 1c: 6e211c00eor v0.16b, v0.16b, v1.16b >> 20: 6eb0a800umaxv s0, v0.4s >> 24: 1e26fmovw0, s0 >> 28: 6b1f001fcmp w0, wzr >> 2c: 1a9f17e0csetw0, eq >> 30: d65f03c0ret > > For me this code compiles as below and migration time is ~100ms more. Thanks for benchmarking this. I'll take your original patch into target-arm.next. -- PMM
Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
On Sat, Jul 2, 2016 at 3:37 AM, Richard Hendersonwrote: > On 06/30/2016 06:45 AM, Peter Maydell wrote: >> >> On 29 June 2016 at 09:47, wrote: >>> >>> From: Vijay >>> >>> Use Neon instructions to perform zero checking of >>> buffer. This is helps in reducing total migration time. >> >> >>> diff --git a/util/cutils.c b/util/cutils.c >>> index 5830a68..4779403 100644 >>> --- a/util/cutils.c >>> +++ b/util/cutils.c >>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) >>> #define SPLAT(p) _mm_set1_epi8(*(p)) >>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >>> 0x) >>> #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) >>> +#elif __aarch64__ >>> +#include "arm_neon.h" >>> +#define VECTYPEuint64x2_t >>> +#define ALL_EQ(v1, v2) \ >>> +((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ >>> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) >>> +#define VEC_OR(v1, v2) ((v1) | (v2)) >> >> >> Should be '#elif defined(__aarch64__)'. I have made this >> tweak and put this patch in target-arm.next. > > > Consider > > #define VECTYPEuint32x4_t > #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0) > > > which compiles down to > > 1c: 6e211c00eor v0.16b, v0.16b, v1.16b > 20: 6eb0a800umaxv s0, v0.4s > 24: 1e26fmovw0, s0 > 28: 6b1f001fcmp w0, wzr > 2c: 1a9f17e0csetw0, eq > 30: d65f03c0ret For me this code compiles as below and migration time is ~100ms more. See below 3 trails of migration time 7039cc: 6eb0a800umaxv s0, v0.4s 7039d0: 0e043c02mov w2, v0.s[0] 7039d4: 35c2cbnzw2, 7039ec 7039d8: 91002084add x4, x4, #0x8 7039dc: 91020063add x3, x3, #0x80 7039e0: eb01009fcmp x4, x1 (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 3070 milliseconds downtime: 55 milliseconds setup: 4 milliseconds transferred ram: 300637 kbytes throughput: 802.49 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2062834 pages skipped: 0 pages normal: 70489 pages normal bytes: 281956 kbytes dirty sync count: 3 (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 3067 milliseconds downtime: 47 milliseconds setup: 5 milliseconds transferred ram: 290277 kbytes throughput: 775.61 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2064185 pages skipped: 0 pages normal: 67901 pages normal bytes: 271604 kbytes dirty sync count: 3 (qemu) (qemu) info migrate capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 3067 milliseconds downtime: 34 milliseconds setup: 5 milliseconds transferred ram: 294614 kbytes throughput: 787.19 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2063365 pages skipped: 0 pages normal: 68985 pages normal bytes: 275940 kbytes dirty sync count: 3 > > vs > > 34: 4e083c20mov x0, v1.d[0] > 38: 4e083c01mov x1, v0.d[0] > 3c: eb3fcmp x1, x0 > 40: 5280mov w0, #0 > 44: 5440b.eq4c > 48: d65f03c0ret > 4c: 4e183c20mov x0, v1.d[1] > 50: 4e183c01mov x1, v0.d[1] > 54: eb3fcmp x1, x0 > 58: 1a9f17e0csetw0, eq > 5c: d65f03c0ret > My patch compiles to below code and takes ~100ms less time #define VECTYPEuint64x2_t #define ALL_EQ(v1, v2) \ ((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) 7039d0: 4e083c02mov x2, v0.d[0] 7039d4: b5000102cbnzx2, 7039f4 7039d8: 4e183c02mov x2, v0.d[1] 7039dc: b5c2cbnzx2, 7039f4 7039e0: 91002084add x4, x4, #0x8 7039e4: 91020063add x3, x3, #0x80 7039e8: eb04003fcmp x1, x4 capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off events: off x-postcopy-ram: off Migration status: completed total time: 2973 milliseconds downtime: 67 milliseconds setup: 5 milliseconds transferred ram: 293659 kbytes throughput: 809.45 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2062791 pages skipped: 0 pages normal: 68748 pages normal bytes: 274992 kbytes dirty sync count: 3 (qemu) capabilities: xbzrle: off rdma-pin-all: off auto-converge: off zero-blocks: off compress: off
Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
On 1 July 2016 at 23:07, Richard Hendersonwrote: > On 06/30/2016 06:45 AM, Peter Maydell wrote: >> >> On 29 June 2016 at 09:47, wrote: >>> >>> From: Vijay >>> >>> Use Neon instructions to perform zero checking of >>> buffer. This is helps in reducing total migration time. >> >> >>> diff --git a/util/cutils.c b/util/cutils.c >>> index 5830a68..4779403 100644 >>> --- a/util/cutils.c >>> +++ b/util/cutils.c >>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) >>> #define SPLAT(p) _mm_set1_epi8(*(p)) >>> #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == >>> 0x) >>> #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) >>> +#elif __aarch64__ >>> +#include "arm_neon.h" >>> +#define VECTYPEuint64x2_t >>> +#define ALL_EQ(v1, v2) \ >>> +((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ >>> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) >>> +#define VEC_OR(v1, v2) ((v1) | (v2)) >> >> >> Should be '#elif defined(__aarch64__)'. I have made this >> tweak and put this patch in target-arm.next. > > > Consider > > #define VECTYPEuint32x4_t > #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0) Sounds good. Vijay, could you benchmark that variant, please? thanks -- PMM
Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
On 06/30/2016 06:45 AM, Peter Maydell wrote: On 29 June 2016 at 09:47,wrote: From: Vijay Use Neon instructions to perform zero checking of buffer. This is helps in reducing total migration time. diff --git a/util/cutils.c b/util/cutils.c index 5830a68..4779403 100644 --- a/util/cutils.c +++ b/util/cutils.c @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) #define SPLAT(p) _mm_set1_epi8(*(p)) #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0x) #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) +#elif __aarch64__ +#include "arm_neon.h" +#define VECTYPEuint64x2_t +#define ALL_EQ(v1, v2) \ +((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) +#define VEC_OR(v1, v2) ((v1) | (v2)) Should be '#elif defined(__aarch64__)'. I have made this tweak and put this patch in target-arm.next. Consider #define VECTYPEuint32x4_t #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0) which compiles down to 1c: 6e211c00eor v0.16b, v0.16b, v1.16b 20: 6eb0a800umaxv s0, v0.4s 24: 1e26fmovw0, s0 28: 6b1f001fcmp w0, wzr 2c: 1a9f17e0csetw0, eq 30: d65f03c0ret vs 34: 4e083c20mov x0, v1.d[0] 38: 4e083c01mov x1, v0.d[0] 3c: eb3fcmp x1, x0 40: 5280mov w0, #0 44: 5440b.eq4c 48: d65f03c0ret 4c: 4e183c20mov x0, v1.d[1] 50: 4e183c01mov x1, v0.d[1] 54: eb3fcmp x1, x0 58: 1a9f17e0csetw0, eq 5c: d65f03c0ret r~
Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
On 29 June 2016 at 09:47,wrote: > From: Vijay > > Use Neon instructions to perform zero checking of > buffer. This is helps in reducing total migration time. > diff --git a/util/cutils.c b/util/cutils.c > index 5830a68..4779403 100644 > --- a/util/cutils.c > +++ b/util/cutils.c > @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) > #define SPLAT(p) _mm_set1_epi8(*(p)) > #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0x) > #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) > +#elif __aarch64__ > +#include "arm_neon.h" > +#define VECTYPEuint64x2_t > +#define ALL_EQ(v1, v2) \ > +((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ > + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) > +#define VEC_OR(v1, v2) ((v1) | (v2)) Should be '#elif defined(__aarch64__)'. I have made this tweak and put this patch in target-arm.next. thanks -- PMM
[Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
From: VijayUse Neon instructions to perform zero checking of buffer. This is helps in reducing total migration time. Use case: Idle VM live migration with 4 VCPUS and 8GB ram running CentOS 7. Without Neon, the Total migration time is 3.5 Sec Migration status: completed total time: 3560 milliseconds downtime: 33 milliseconds setup: 5 milliseconds transferred ram: 297907 kbytes throughput: 685.76 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2062760 pages skipped: 0 pages normal: 69808 pages normal bytes: 279232 kbytes dirty sync count: 3 With Neon, the total migration time is 2.9 Sec Migration status: completed total time: 2960 milliseconds downtime: 65 milliseconds setup: 4 milliseconds transferred ram: 299869 kbytes throughput: 830.19 mbps remaining ram: 0 kbytes total ram: 8519872 kbytes duplicate: 2064313 pages skipped: 0 pages normal: 70294 pages normal bytes: 281176 kbytes dirty sync count: 3 Signed-off-by: Vijaya Kumar K Signed-off-by: Suresh --- util/cutils.c |7 +++ 1 file changed, 7 insertions(+) diff --git a/util/cutils.c b/util/cutils.c index 5830a68..4779403 100644 --- a/util/cutils.c +++ b/util/cutils.c @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) #define SPLAT(p) _mm_set1_epi8(*(p)) #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0x) #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) +#elif __aarch64__ +#include "arm_neon.h" +#define VECTYPEuint64x2_t +#define ALL_EQ(v1, v2) \ +((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) +#define VEC_OR(v1, v2) ((v1) | (v2)) #else #define VECTYPEunsigned long #define SPLAT(p) (*(p) * (~0UL / 255)) -- 1.7.9.5
Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking
On 29/06/2016 10:47, vija...@cavium.com wrote: > From: Vijay> > Use Neon instructions to perform zero checking of > buffer. This is helps in reducing total migration time. > > Use case: Idle VM live migration with 4 VCPUS and 8GB ram > running CentOS 7. > > Without Neon, the Total migration time is 3.5 Sec > > Migration status: completed > total time: 3560 milliseconds > downtime: 33 milliseconds > setup: 5 milliseconds > transferred ram: 297907 kbytes > throughput: 685.76 mbps > remaining ram: 0 kbytes > total ram: 8519872 kbytes > duplicate: 2062760 pages > skipped: 0 pages > normal: 69808 pages > normal bytes: 279232 kbytes > dirty sync count: 3 > > With Neon, the total migration time is 2.9 Sec > > Migration status: completed > total time: 2960 milliseconds > downtime: 65 milliseconds > setup: 4 milliseconds > transferred ram: 299869 kbytes > throughput: 830.19 mbps > remaining ram: 0 kbytes > total ram: 8519872 kbytes > duplicate: 2064313 pages > skipped: 0 pages > normal: 70294 pages > normal bytes: 281176 kbytes > dirty sync count: 3 > > Signed-off-by: Vijaya Kumar K > Signed-off-by: Suresh > --- > util/cutils.c |7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/util/cutils.c b/util/cutils.c > index 5830a68..4779403 100644 > --- a/util/cutils.c > +++ b/util/cutils.c > @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd) > #define SPLAT(p) _mm_set1_epi8(*(p)) > #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0x) > #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2)) > +#elif __aarch64__ > +#include "arm_neon.h" > +#define VECTYPEuint64x2_t > +#define ALL_EQ(v1, v2) \ > +((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \ > + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1))) > +#define VEC_OR(v1, v2) ((v1) | (v2)) > #else > #define VECTYPEunsigned long > #define SPLAT(p) (*(p) * (~0UL / 255)) > Acked-by: Paolo Bonzini