Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking

2016-07-11 Thread Peter Maydell
On 5 July 2016 at 13:24, Vijay Kilari  wrote:
> On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson  wrote:
>> Consider
>>
>> #define VECTYPEuint32x4_t
>> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)
>>
>>
>> which compiles down to
>>
>>   1c:   6e211c00eor v0.16b, v0.16b, v1.16b
>>   20:   6eb0a800umaxv   s0, v0.4s
>>   24:   1e26fmovw0, s0
>>   28:   6b1f001fcmp w0, wzr
>>   2c:   1a9f17e0csetw0, eq
>>   30:   d65f03c0ret
>
> For me this code compiles as below and migration time is ~100ms more.

Thanks for benchmarking this. I'll take your original patch into
target-arm.next.

-- PMM



Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking

2016-07-05 Thread Vijay Kilari
On Sat, Jul 2, 2016 at 3:37 AM, Richard Henderson  wrote:
> On 06/30/2016 06:45 AM, Peter Maydell wrote:
>>
>> On 29 June 2016 at 09:47,   wrote:
>>>
>>> From: Vijay 
>>>
>>> Use Neon instructions to perform zero checking of
>>> buffer. This is helps in reducing total migration time.
>>
>>
>>> diff --git a/util/cutils.c b/util/cutils.c
>>> index 5830a68..4779403 100644
>>> --- a/util/cutils.c
>>> +++ b/util/cutils.c
>>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>>>  #define SPLAT(p)   _mm_set1_epi8(*(p))
>>>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>> 0x)
>>>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
>>> +#elif __aarch64__
>>> +#include "arm_neon.h"
>>> +#define VECTYPEuint64x2_t
>>> +#define ALL_EQ(v1, v2) \
>>> +((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
>>> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
>>> +#define VEC_OR(v1, v2) ((v1) | (v2))
>>
>>
>> Should be '#elif defined(__aarch64__)'. I have made this
>> tweak and put this patch in target-arm.next.
>
>
> Consider
>
> #define VECTYPEuint32x4_t
> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)
>
>
> which compiles down to
>
>   1c:   6e211c00eor v0.16b, v0.16b, v1.16b
>   20:   6eb0a800umaxv   s0, v0.4s
>   24:   1e26fmovw0, s0
>   28:   6b1f001fcmp w0, wzr
>   2c:   1a9f17e0csetw0, eq
>   30:   d65f03c0ret

For me this code compiles as below and migration time is ~100ms more.

See below 3 trails of migration time

  7039cc:   6eb0a800umaxv   s0, v0.4s
  7039d0:   0e043c02mov w2, v0.s[0]
  7039d4:   35c2cbnzw2, 7039ec 
  7039d8:   91002084add x4, x4, #0x8
  7039dc:   91020063add x3, x3, #0x80
  7039e0:   eb01009fcmp x4, x1

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3070 milliseconds
downtime: 55 milliseconds
setup: 4 milliseconds
transferred ram: 300637 kbytes
throughput: 802.49 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062834 pages
skipped: 0 pages
normal: 70489 pages
normal bytes: 281956 kbytes
dirty sync count: 3

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3067 milliseconds
downtime: 47 milliseconds
setup: 5 milliseconds
transferred ram: 290277 kbytes
throughput: 775.61 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2064185 pages
skipped: 0 pages
normal: 67901 pages
normal bytes: 271604 kbytes
dirty sync count: 3
(qemu)

(qemu) info migrate
capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 3067 milliseconds
downtime: 34 milliseconds
setup: 5 milliseconds
transferred ram: 294614 kbytes
throughput: 787.19 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2063365 pages
skipped: 0 pages
normal: 68985 pages
normal bytes: 275940 kbytes
dirty sync count: 3

>
> vs
>
>   34:   4e083c20mov x0, v1.d[0]
>   38:   4e083c01mov x1, v0.d[0]
>   3c:   eb3fcmp x1, x0
>   40:   5280mov w0, #0
>   44:   5440b.eq4c 
>   48:   d65f03c0ret
>   4c:   4e183c20mov x0, v1.d[1]
>   50:   4e183c01mov x1, v0.d[1]
>   54:   eb3fcmp x1, x0
>   58:   1a9f17e0csetw0, eq
>   5c:   d65f03c0ret
>

My patch compiles to below code and takes ~100ms less time

#define VECTYPEuint64x2_t
#define ALL_EQ(v1, v2) \
((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
 (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))

  7039d0:   4e083c02mov x2, v0.d[0]
  7039d4:   b5000102cbnzx2, 7039f4 
  7039d8:   4e183c02mov x2, v0.d[1]
  7039dc:   b5c2cbnzx2, 7039f4 
  7039e0:   91002084add x4, x4, #0x8
  7039e4:   91020063add x3, x3, #0x80
  7039e8:   eb04003fcmp x1, x4

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off events: off x-postcopy-ram: off
Migration status: completed
total time: 2973 milliseconds
downtime: 67 milliseconds
setup: 5 milliseconds
transferred ram: 293659 kbytes
throughput: 809.45 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062791 pages
skipped: 0 pages
normal: 68748 pages
normal bytes: 274992 kbytes
dirty sync count: 3
(qemu)

capabilities: xbzrle: off rdma-pin-all: off auto-converge: off
zero-blocks: off compress: off 

Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking

2016-07-02 Thread Peter Maydell
On 1 July 2016 at 23:07, Richard Henderson  wrote:
> On 06/30/2016 06:45 AM, Peter Maydell wrote:
>>
>> On 29 June 2016 at 09:47,   wrote:
>>>
>>> From: Vijay 
>>>
>>> Use Neon instructions to perform zero checking of
>>> buffer. This is helps in reducing total migration time.
>>
>>
>>> diff --git a/util/cutils.c b/util/cutils.c
>>> index 5830a68..4779403 100644
>>> --- a/util/cutils.c
>>> +++ b/util/cutils.c
>>> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>>>  #define SPLAT(p)   _mm_set1_epi8(*(p))
>>>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) ==
>>> 0x)
>>>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
>>> +#elif __aarch64__
>>> +#include "arm_neon.h"
>>> +#define VECTYPEuint64x2_t
>>> +#define ALL_EQ(v1, v2) \
>>> +((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
>>> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
>>> +#define VEC_OR(v1, v2) ((v1) | (v2))
>>
>>
>> Should be '#elif defined(__aarch64__)'. I have made this
>> tweak and put this patch in target-arm.next.
>
>
> Consider
>
> #define VECTYPEuint32x4_t
> #define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)

Sounds good. Vijay, could you benchmark that variant, please?

thanks
-- PMM



Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking

2016-07-01 Thread Richard Henderson

On 06/30/2016 06:45 AM, Peter Maydell wrote:

On 29 June 2016 at 09:47,   wrote:

From: Vijay 

Use Neon instructions to perform zero checking of
buffer. This is helps in reducing total migration time.



diff --git a/util/cutils.c b/util/cutils.c
index 5830a68..4779403 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
 #define SPLAT(p)   _mm_set1_epi8(*(p))
 #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0x)
 #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
+#elif __aarch64__
+#include "arm_neon.h"
+#define VECTYPEuint64x2_t
+#define ALL_EQ(v1, v2) \
+((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
+ (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
+#define VEC_OR(v1, v2) ((v1) | (v2))


Should be '#elif defined(__aarch64__)'. I have made this
tweak and put this patch in target-arm.next.


Consider

#define VECTYPEuint32x4_t
#define ALL_EQ(v1, v2) (vmaxvq_u32((v1) ^ (v2)) == 0)


which compiles down to

  1c:   6e211c00eor v0.16b, v0.16b, v1.16b
  20:   6eb0a800umaxv   s0, v0.4s
  24:   1e26fmovw0, s0
  28:   6b1f001fcmp w0, wzr
  2c:   1a9f17e0csetw0, eq
  30:   d65f03c0ret

vs

  34:   4e083c20mov x0, v1.d[0]
  38:   4e083c01mov x1, v0.d[0]
  3c:   eb3fcmp x1, x0
  40:   5280mov w0, #0
  44:   5440b.eq4c 
  48:   d65f03c0ret
  4c:   4e183c20mov x0, v1.d[1]
  50:   4e183c01mov x1, v0.d[1]
  54:   eb3fcmp x1, x0
  58:   1a9f17e0csetw0, eq
  5c:   d65f03c0ret


r~



Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking

2016-06-30 Thread Peter Maydell
On 29 June 2016 at 09:47,   wrote:
> From: Vijay 
>
> Use Neon instructions to perform zero checking of
> buffer. This is helps in reducing total migration time.

> diff --git a/util/cutils.c b/util/cutils.c
> index 5830a68..4779403 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>  #define SPLAT(p)   _mm_set1_epi8(*(p))
>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0x)
>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
> +#elif __aarch64__
> +#include "arm_neon.h"
> +#define VECTYPEuint64x2_t
> +#define ALL_EQ(v1, v2) \
> +((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
> +#define VEC_OR(v1, v2) ((v1) | (v2))

Should be '#elif defined(__aarch64__)'. I have made this
tweak and put this patch in target-arm.next.

thanks
-- PMM



[Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking

2016-06-29 Thread vijayak
From: Vijay 

Use Neon instructions to perform zero checking of
buffer. This is helps in reducing total migration time.

Use case: Idle VM live migration with 4 VCPUS and 8GB ram
running CentOS 7.

Without Neon, the Total migration time is 3.5 Sec

Migration status: completed
total time: 3560 milliseconds
downtime: 33 milliseconds
setup: 5 milliseconds
transferred ram: 297907 kbytes
throughput: 685.76 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2062760 pages
skipped: 0 pages
normal: 69808 pages
normal bytes: 279232 kbytes
dirty sync count: 3

With Neon, the total migration time is 2.9 Sec

Migration status: completed
total time: 2960 milliseconds
downtime: 65 milliseconds
setup: 4 milliseconds
transferred ram: 299869 kbytes
throughput: 830.19 mbps
remaining ram: 0 kbytes
total ram: 8519872 kbytes
duplicate: 2064313 pages
skipped: 0 pages
normal: 70294 pages
normal bytes: 281176 kbytes
dirty sync count: 3

Signed-off-by: Vijaya Kumar K 
Signed-off-by: Suresh 
---
 util/cutils.c |7 +++
 1 file changed, 7 insertions(+)

diff --git a/util/cutils.c b/util/cutils.c
index 5830a68..4779403 100644
--- a/util/cutils.c
+++ b/util/cutils.c
@@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
 #define SPLAT(p)   _mm_set1_epi8(*(p))
 #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0x)
 #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
+#elif __aarch64__
+#include "arm_neon.h"
+#define VECTYPEuint64x2_t
+#define ALL_EQ(v1, v2) \
+((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
+ (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
+#define VEC_OR(v1, v2) ((v1) | (v2))
 #else
 #define VECTYPEunsigned long
 #define SPLAT(p)   (*(p) * (~0UL / 255))
-- 
1.7.9.5




Re: [Qemu-devel] [PATCH v3 1/1] target-arm: Use Neon for zero checking

2016-06-29 Thread Paolo Bonzini


On 29/06/2016 10:47, vija...@cavium.com wrote:
> From: Vijay 
> 
> Use Neon instructions to perform zero checking of
> buffer. This is helps in reducing total migration time.
> 
> Use case: Idle VM live migration with 4 VCPUS and 8GB ram
> running CentOS 7.
> 
> Without Neon, the Total migration time is 3.5 Sec
> 
> Migration status: completed
> total time: 3560 milliseconds
> downtime: 33 milliseconds
> setup: 5 milliseconds
> transferred ram: 297907 kbytes
> throughput: 685.76 mbps
> remaining ram: 0 kbytes
> total ram: 8519872 kbytes
> duplicate: 2062760 pages
> skipped: 0 pages
> normal: 69808 pages
> normal bytes: 279232 kbytes
> dirty sync count: 3
> 
> With Neon, the total migration time is 2.9 Sec
> 
> Migration status: completed
> total time: 2960 milliseconds
> downtime: 65 milliseconds
> setup: 4 milliseconds
> transferred ram: 299869 kbytes
> throughput: 830.19 mbps
> remaining ram: 0 kbytes
> total ram: 8519872 kbytes
> duplicate: 2064313 pages
> skipped: 0 pages
> normal: 70294 pages
> normal bytes: 281176 kbytes
> dirty sync count: 3
> 
> Signed-off-by: Vijaya Kumar K 
> Signed-off-by: Suresh 
> ---
>  util/cutils.c |7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/util/cutils.c b/util/cutils.c
> index 5830a68..4779403 100644
> --- a/util/cutils.c
> +++ b/util/cutils.c
> @@ -184,6 +184,13 @@ int qemu_fdatasync(int fd)
>  #define SPLAT(p)   _mm_set1_epi8(*(p))
>  #define ALL_EQ(v1, v2) (_mm_movemask_epi8(_mm_cmpeq_epi8(v1, v2)) == 0x)
>  #define VEC_OR(v1, v2) (_mm_or_si128(v1, v2))
> +#elif __aarch64__
> +#include "arm_neon.h"
> +#define VECTYPEuint64x2_t
> +#define ALL_EQ(v1, v2) \
> +((vgetq_lane_u64(v1, 0) == vgetq_lane_u64(v2, 0)) && \
> + (vgetq_lane_u64(v1, 1) == vgetq_lane_u64(v2, 1)))
> +#define VEC_OR(v1, v2) ((v1) | (v2))
>  #else
>  #define VECTYPEunsigned long
>  #define SPLAT(p)   (*(p) * (~0UL / 255))
> 

Acked-by: Paolo Bonzini