Re: [PATCH] [v3] crypto: sha512: add ARM NEON implementation

2014-07-29 Thread Jussi Kivilinna
On 29.07.2014 15:35, Ard Biesheuvel wrote:
> On 30 June 2014 18:39, Jussi Kivilinna  wrote:
>> This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384
>> algorithms.
>>
>> tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm:
>>
>> block-size  bytes/updateold-vs-new
>> 16  16  2.99x
>> 64  16  2.67x
>> 64  64  3.00x
>> 256 16  2.64x
>> 256 64  3.06x
>> 256 256 3.33x
>> 102416  2.53x
>> 1024256 3.39x
>> 102410243.52x
>> 204816  2.50x
>> 2048256 3.41x
>> 204810243.54x
>> 204820483.57x
>> 409616  2.49x
>> 4096256 3.42x
>> 409610243.56x
>> 409640963.59x
>> 819216  2.48x
>> 8192256 3.42x
>> 819210243.56x
>> 819240963.60x
>> 819281923.60x
>>
>> Acked-by: Ard Biesheuvel 
>> Tested-by: Ard Biesheuvel 
>> Signed-off-by: Jussi Kivilinna 
>>
>> ---
>>
>> Changes in v2:
>>  - Use ENTRY/ENDPROC
>>  - Don't provide Thumb2 version
>>
>> v3:
>>  - Changelog moved below '---'
> 
> Hi Jussi,
> 
> What is the status of these patches?
> Have you sent them to Russell's patch tracker?
>

I sent them to patch tracker moment ago. Thanks for the reminder.

-Jussi

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] [v3] crypto: sha512: add ARM NEON implementation

2014-07-29 Thread Ard Biesheuvel
On 30 June 2014 18:39, Jussi Kivilinna  wrote:
> This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384
> algorithms.
>
> tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm:
>
> block-size  bytes/updateold-vs-new
> 16  16  2.99x
> 64  16  2.67x
> 64  64  3.00x
> 256 16  2.64x
> 256 64  3.06x
> 256 256 3.33x
> 102416  2.53x
> 1024256 3.39x
> 102410243.52x
> 204816  2.50x
> 2048256 3.41x
> 204810243.54x
> 204820483.57x
> 409616  2.49x
> 4096256 3.42x
> 409610243.56x
> 409640963.59x
> 819216  2.48x
> 8192256 3.42x
> 819210243.56x
> 819240963.60x
> 819281923.60x
>
> Acked-by: Ard Biesheuvel 
> Tested-by: Ard Biesheuvel 
> Signed-off-by: Jussi Kivilinna 
>
> ---
>
> Changes in v2:
>  - Use ENTRY/ENDPROC
>  - Don't provide Thumb2 version
>
> v3:
>  - Changelog moved below '---'

Hi Jussi,

What is the status of these patches?
Have you sent them to Russell's patch tracker?

-- 
Ard.


> ---
>  arch/arm/crypto/Makefile|2
>  arch/arm/crypto/sha512-armv7-neon.S |  455 
> +++
>  arch/arm/crypto/sha512_neon_glue.c  |  305 +++
>  crypto/Kconfig  |   15 +
>  4 files changed, 777 insertions(+)
>  create mode 100644 arch/arm/crypto/sha512-armv7-neon.S
>  create mode 100644 arch/arm/crypto/sha512_neon_glue.c
>
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 374956d..b48fa34 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -6,11 +6,13 @@ obj-$(CONFIG_CRYPTO_AES_ARM) += aes-arm.o
>  obj-$(CONFIG_CRYPTO_AES_ARM_BS) += aes-arm-bs.o
>  obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
>  obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
> +obj-$(CONFIG_CRYPTO_SHA512_ARM_NEON) += sha512-arm-neon.o
>
>  aes-arm-y  := aes-armv4.o aes_glue.o
>  aes-arm-bs-y   := aesbs-core.o aesbs-glue.o
>  sha1-arm-y := sha1-armv4-large.o sha1_glue.o
>  sha1-arm-neon-y:= sha1-armv7-neon.o sha1_neon_glue.o
> +sha512-arm-neon-y := sha512-armv7-neon.o sha512_neon_glue.o
>
>  quiet_cmd_perl = PERL$@
>cmd_perl = $(PERL) $(<) > $(@)
> diff --git a/arch/arm/crypto/sha512-armv7-neon.S 
> b/arch/arm/crypto/sha512-armv7-neon.S
> new file mode 100644
> index 000..fe99472
> --- /dev/null
> +++ b/arch/arm/crypto/sha512-armv7-neon.S
> @@ -0,0 +1,455 @@
> +/* sha512-armv7-neon.S  -  ARM/NEON assembly implementation of SHA-512 
> transform
> + *
> + * Copyright © 2013-2014 Jussi Kivilinna 
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + */
> +
> +#include 
> +
> +
> +.syntax unified
> +.code   32
> +.fpu neon
> +
> +.text
> +
> +/* structure of SHA512_CONTEXT */
> +#define hd_a 0
> +#define hd_b ((hd_a) + 8)
> +#define hd_c ((hd_b) + 8)
> +#define hd_d ((hd_c) + 8)
> +#define hd_e ((hd_d) + 8)
> +#define hd_f ((hd_e) + 8)
> +#define hd_g ((hd_f) + 8)
> +
> +/* register macros */
> +#define RK %r2
> +
> +#define RA d0
> +#define RB d1
> +#define RC d2
> +#define RD d3
> +#define RE d4
> +#define RF d5
> +#define RG d6
> +#define RH d7
> +
> +#define RT0 d8
> +#define RT1 d9
> +#define RT2 d10
> +#define RT3 d11
> +#define RT4 d12
> +#define RT5 d13
> +#define RT6 d14
> +#define RT7 d15
> +
> +#define RT01q q4
> +#define RT23q q5
> +#define RT45q q6
> +#define RT67q q7
> +
> +#define RW0 d16
> +#define RW1 d17
> +#define RW2 d18
> +#define RW3 d19
> +#define RW4 d20
> +#define RW5 d21
> +#define RW6 d22
> +#define RW7 d23
> +#define RW8 d24
> +#define RW9 d25
> +#define RW10 d26
> +#define RW11 d27
> +#define RW12 d28
> +#define RW13 d29
> +#define RW14 d30
> +#define RW15 d31
> +
> +#define RW01q q8
> +#define RW23q q9
> +#define RW45q q10
> +#define RW67q q11
> +#define RW89q q12
> +#define RW1011q q13
> +#define RW1213q q14
> +#define RW1415q q15
> +
> +/***
> + * ARM assembly implementation of sha512 transform
> + ***/
> +#define rounds2_0_63(ra, rb, rc, rd, re, rf, rg, rh, rw0, rw1, rw01q, rw2, \
> + rw23q, rw1415q, rw9, rw10, interleave_op, arg1) \
> +   /* t1 = h + Sum1 (e) + Ch (e, f, g) + k[t] + w[t]; */ \
> +   vshr.u64 RT2, re, #14; \
> 

Re: [PATCH] [v3] crypto: sha512: add ARM NEON implementation

2014-06-30 Thread Jussi Kivilinna
On 30.06.2014 21:13, Ard Biesheuvel wrote:
> On 30 June 2014 18:39, Jussi Kivilinna  wrote:
>> This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384
>> algorithms.
>>
>> tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm:
>>
>> block-size  bytes/updateold-vs-new
>> 16  16  2.99x
>> 64  16  2.67x
>> 64  64  3.00x
>> 256 16  2.64x
>> 256 64  3.06x
>> 256 256 3.33x
>> 102416  2.53x
>> 1024256 3.39x
>> 102410243.52x
>> 204816  2.50x
>> 2048256 3.41x
>> 204810243.54x
>> 204820483.57x
>> 409616  2.49x
>> 4096256 3.42x
>> 409610243.56x
>> 409640963.59x
>> 819216  2.48x
>> 8192256 3.42x
>> 819210243.56x
>> 819240963.60x
>> 819281923.60x
>>
>> Acked-by: Ard Biesheuvel 
>> Tested-by: Ard Biesheuvel 
>> Signed-off-by: Jussi Kivilinna 
>>
> 
> Likewise for this one: if nobody has any more comments, this should go
> into the patch system.
> 
> One remaining question though: is this code (and the SHA1 code) known
> to be broken for big endian or just untested?
> 

Untested and probably broken, so therefore I've disabled when CPU_BIG_ENDIAN=y.

-Jussi

> Thanks,
> Ard.
> 
>> ---
>>
>> Changes in v2:
>>  - Use ENTRY/ENDPROC
>>  - Don't provide Thumb2 version
>>
>> v3:
>>  - Changelog moved below '---'
>> ---
>>  arch/arm/crypto/Makefile|2
>>  arch/arm/crypto/sha512-armv7-neon.S |  455 
>> +++
>>  arch/arm/crypto/sha512_neon_glue.c  |  305 +++
>>  crypto/Kconfig  |   15 +
>>  4 files changed, 777 insertions(+)
>>  create mode 100644 arch/arm/crypto/sha512-armv7-neon.S
>>  create mode 100644 arch/arm/crypto/sha512_neon_glue.c
>>
>> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
>> index 374956d..b48fa34 100644
>> --- a/arch/arm/crypto/Makefile
>> +++ b/arch/arm/crypto/Makefile
>> @@ -6,11 +6,13 @@ obj-$(CONFIG_CRYPTO_AES_ARM) += aes-arm.o
>>  obj-$(CONFIG_CRYPTO_AES_ARM_BS) += aes-arm-bs.o
>>  obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
>>  obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
>> +obj-$(CONFIG_CRYPTO_SHA512_ARM_NEON) += sha512-arm-neon.o
>>
>>  aes-arm-y  := aes-armv4.o aes_glue.o
>>  aes-arm-bs-y   := aesbs-core.o aesbs-glue.o
>>  sha1-arm-y := sha1-armv4-large.o sha1_glue.o
>>  sha1-arm-neon-y:= sha1-armv7-neon.o sha1_neon_glue.o
>> +sha512-arm-neon-y := sha512-armv7-neon.o sha512_neon_glue.o
>>
>>  quiet_cmd_perl = PERL$@
>>cmd_perl = $(PERL) $(<) > $(@)
>> diff --git a/arch/arm/crypto/sha512-armv7-neon.S 
>> b/arch/arm/crypto/sha512-armv7-neon.S
>> new file mode 100644
>> index 000..fe99472
>> --- /dev/null
>> +++ b/arch/arm/crypto/sha512-armv7-neon.S
>> @@ -0,0 +1,455 @@
>> +/* sha512-armv7-neon.S  -  ARM/NEON assembly implementation of SHA-512 
>> transform
>> + *
>> + * Copyright © 2013-2014 Jussi Kivilinna 
>> + *
>> + * This program is free software; you can redistribute it and/or modify it
>> + * under the terms of the GNU General Public License as published by the 
>> Free
>> + * Software Foundation; either version 2 of the License, or (at your option)
>> + * any later version.
>> + */
>> +
>> +#include 
>> +
>> +
>> +.syntax unified
>> +.code   32
>> +.fpu neon
>> +
>> +.text
>> +
>> +/* structure of SHA512_CONTEXT */
>> +#define hd_a 0
>> +#define hd_b ((hd_a) + 8)
>> +#define hd_c ((hd_b) + 8)
>> +#define hd_d ((hd_c) + 8)
>> +#define hd_e ((hd_d) + 8)
>> +#define hd_f ((hd_e) + 8)
>> +#define hd_g ((hd_f) + 8)
>> +
>> +/* register macros */
>> +#define RK %r2
>> +
>> +#define RA d0
>> +#define RB d1
>> +#define RC d2
>> +#define RD d3
>> +#define RE d4
>> +#define RF d5
>> +#define RG d6
>> +#define RH d7
>> +
>> +#define RT0 d8
>> +#define RT1 d9
>> +#define RT2 d10
>> +#define RT3 d11
>> +#define RT4 d12
>> +#define RT5 d13
>> +#define RT6 d14
>> +#define RT7 d15
>> +
>> +#define RT01q q4
>> +#define RT23q q5
>> +#define RT45q q6
>> +#define RT67q q7
>> +
>> +#define RW0 d16
>> +#define RW1 d17
>> +#define RW2 d18
>> +#define RW3 d19
>> +#define RW4 d20
>> +#define RW5 d21
>> +#define RW6 d22
>> +#define RW7 d23
>> +#define RW8 d24
>> +#define RW9 d25
>> +#define RW10 d26
>> +#define RW11 d27
>> +#define RW12 d28
>> +#define RW13 d29
>> +#define RW14 d30
>> +#define RW15 d31
>> +
>> +#define RW01q q8
>> +#define RW23q q9
>> +#define RW45q q10
>> +#define RW67q q11
>> +#define RW89q q12
>> +#define RW1011q q13
>> +#define RW1213q q14
>> +#define RW1415q q15
>> +
>> +/

Re: [PATCH] [v3] crypto: sha512: add ARM NEON implementation

2014-06-30 Thread Ard Biesheuvel
On 30 June 2014 18:39, Jussi Kivilinna  wrote:
> This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384
> algorithms.
>
> tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm:
>
> block-size  bytes/updateold-vs-new
> 16  16  2.99x
> 64  16  2.67x
> 64  64  3.00x
> 256 16  2.64x
> 256 64  3.06x
> 256 256 3.33x
> 102416  2.53x
> 1024256 3.39x
> 102410243.52x
> 204816  2.50x
> 2048256 3.41x
> 204810243.54x
> 204820483.57x
> 409616  2.49x
> 4096256 3.42x
> 409610243.56x
> 409640963.59x
> 819216  2.48x
> 8192256 3.42x
> 819210243.56x
> 819240963.60x
> 819281923.60x
>
> Acked-by: Ard Biesheuvel 
> Tested-by: Ard Biesheuvel 
> Signed-off-by: Jussi Kivilinna 
>

Likewise for this one: if nobody has any more comments, this should go
into the patch system.

One remaining question though: is this code (and the SHA1 code) known
to be broken for big endian or just untested?

Thanks,
Ard.

> ---
>
> Changes in v2:
>  - Use ENTRY/ENDPROC
>  - Don't provide Thumb2 version
>
> v3:
>  - Changelog moved below '---'
> ---
>  arch/arm/crypto/Makefile|2
>  arch/arm/crypto/sha512-armv7-neon.S |  455 
> +++
>  arch/arm/crypto/sha512_neon_glue.c  |  305 +++
>  crypto/Kconfig  |   15 +
>  4 files changed, 777 insertions(+)
>  create mode 100644 arch/arm/crypto/sha512-armv7-neon.S
>  create mode 100644 arch/arm/crypto/sha512_neon_glue.c
>
> diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
> index 374956d..b48fa34 100644
> --- a/arch/arm/crypto/Makefile
> +++ b/arch/arm/crypto/Makefile
> @@ -6,11 +6,13 @@ obj-$(CONFIG_CRYPTO_AES_ARM) += aes-arm.o
>  obj-$(CONFIG_CRYPTO_AES_ARM_BS) += aes-arm-bs.o
>  obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
>  obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
> +obj-$(CONFIG_CRYPTO_SHA512_ARM_NEON) += sha512-arm-neon.o
>
>  aes-arm-y  := aes-armv4.o aes_glue.o
>  aes-arm-bs-y   := aesbs-core.o aesbs-glue.o
>  sha1-arm-y := sha1-armv4-large.o sha1_glue.o
>  sha1-arm-neon-y:= sha1-armv7-neon.o sha1_neon_glue.o
> +sha512-arm-neon-y := sha512-armv7-neon.o sha512_neon_glue.o
>
>  quiet_cmd_perl = PERL$@
>cmd_perl = $(PERL) $(<) > $(@)
> diff --git a/arch/arm/crypto/sha512-armv7-neon.S 
> b/arch/arm/crypto/sha512-armv7-neon.S
> new file mode 100644
> index 000..fe99472
> --- /dev/null
> +++ b/arch/arm/crypto/sha512-armv7-neon.S
> @@ -0,0 +1,455 @@
> +/* sha512-armv7-neon.S  -  ARM/NEON assembly implementation of SHA-512 
> transform
> + *
> + * Copyright © 2013-2014 Jussi Kivilinna 
> + *
> + * This program is free software; you can redistribute it and/or modify it
> + * under the terms of the GNU General Public License as published by the Free
> + * Software Foundation; either version 2 of the License, or (at your option)
> + * any later version.
> + */
> +
> +#include 
> +
> +
> +.syntax unified
> +.code   32
> +.fpu neon
> +
> +.text
> +
> +/* structure of SHA512_CONTEXT */
> +#define hd_a 0
> +#define hd_b ((hd_a) + 8)
> +#define hd_c ((hd_b) + 8)
> +#define hd_d ((hd_c) + 8)
> +#define hd_e ((hd_d) + 8)
> +#define hd_f ((hd_e) + 8)
> +#define hd_g ((hd_f) + 8)
> +
> +/* register macros */
> +#define RK %r2
> +
> +#define RA d0
> +#define RB d1
> +#define RC d2
> +#define RD d3
> +#define RE d4
> +#define RF d5
> +#define RG d6
> +#define RH d7
> +
> +#define RT0 d8
> +#define RT1 d9
> +#define RT2 d10
> +#define RT3 d11
> +#define RT4 d12
> +#define RT5 d13
> +#define RT6 d14
> +#define RT7 d15
> +
> +#define RT01q q4
> +#define RT23q q5
> +#define RT45q q6
> +#define RT67q q7
> +
> +#define RW0 d16
> +#define RW1 d17
> +#define RW2 d18
> +#define RW3 d19
> +#define RW4 d20
> +#define RW5 d21
> +#define RW6 d22
> +#define RW7 d23
> +#define RW8 d24
> +#define RW9 d25
> +#define RW10 d26
> +#define RW11 d27
> +#define RW12 d28
> +#define RW13 d29
> +#define RW14 d30
> +#define RW15 d31
> +
> +#define RW01q q8
> +#define RW23q q9
> +#define RW45q q10
> +#define RW67q q11
> +#define RW89q q12
> +#define RW1011q q13
> +#define RW1213q q14
> +#define RW1415q q15
> +
> +/***
> + * ARM assembly implementation of sha512 transform
> + ***/
> +#define rounds2_0_63(ra, rb, rc, rd, re, rf, rg, rh, rw0, rw1, rw01q, rw2, \
> + rw23q, rw1415q, rw9, rw10, interle

[PATCH] [v3] crypto: sha512: add ARM NEON implementation

2014-06-30 Thread Jussi Kivilinna
This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384
algorithms.

tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm:

block-size  bytes/updateold-vs-new
16  16  2.99x
64  16  2.67x
64  64  3.00x
256 16  2.64x
256 64  3.06x
256 256 3.33x
102416  2.53x
1024256 3.39x
102410243.52x
204816  2.50x
2048256 3.41x
204810243.54x
204820483.57x
409616  2.49x
4096256 3.42x
409610243.56x
409640963.59x
819216  2.48x
8192256 3.42x
819210243.56x
819240963.60x
819281923.60x

Acked-by: Ard Biesheuvel 
Tested-by: Ard Biesheuvel 
Signed-off-by: Jussi Kivilinna 

---

Changes in v2:
 - Use ENTRY/ENDPROC
 - Don't provide Thumb2 version

v3:
 - Changelog moved below '---'
---
 arch/arm/crypto/Makefile|2 
 arch/arm/crypto/sha512-armv7-neon.S |  455 +++
 arch/arm/crypto/sha512_neon_glue.c  |  305 +++
 crypto/Kconfig  |   15 +
 4 files changed, 777 insertions(+)
 create mode 100644 arch/arm/crypto/sha512-armv7-neon.S
 create mode 100644 arch/arm/crypto/sha512_neon_glue.c

diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 374956d..b48fa34 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -6,11 +6,13 @@ obj-$(CONFIG_CRYPTO_AES_ARM) += aes-arm.o
 obj-$(CONFIG_CRYPTO_AES_ARM_BS) += aes-arm-bs.o
 obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
 obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
+obj-$(CONFIG_CRYPTO_SHA512_ARM_NEON) += sha512-arm-neon.o
 
 aes-arm-y  := aes-armv4.o aes_glue.o
 aes-arm-bs-y   := aesbs-core.o aesbs-glue.o
 sha1-arm-y := sha1-armv4-large.o sha1_glue.o
 sha1-arm-neon-y:= sha1-armv7-neon.o sha1_neon_glue.o
+sha512-arm-neon-y := sha512-armv7-neon.o sha512_neon_glue.o
 
 quiet_cmd_perl = PERL$@
   cmd_perl = $(PERL) $(<) > $(@)
diff --git a/arch/arm/crypto/sha512-armv7-neon.S 
b/arch/arm/crypto/sha512-armv7-neon.S
new file mode 100644
index 000..fe99472
--- /dev/null
+++ b/arch/arm/crypto/sha512-armv7-neon.S
@@ -0,0 +1,455 @@
+/* sha512-armv7-neon.S  -  ARM/NEON assembly implementation of SHA-512 
transform
+ *
+ * Copyright © 2013-2014 Jussi Kivilinna 
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include 
+
+
+.syntax unified
+.code   32
+.fpu neon
+
+.text
+
+/* structure of SHA512_CONTEXT */
+#define hd_a 0
+#define hd_b ((hd_a) + 8)
+#define hd_c ((hd_b) + 8)
+#define hd_d ((hd_c) + 8)
+#define hd_e ((hd_d) + 8)
+#define hd_f ((hd_e) + 8)
+#define hd_g ((hd_f) + 8)
+
+/* register macros */
+#define RK %r2
+
+#define RA d0
+#define RB d1
+#define RC d2
+#define RD d3
+#define RE d4
+#define RF d5
+#define RG d6
+#define RH d7
+
+#define RT0 d8
+#define RT1 d9
+#define RT2 d10
+#define RT3 d11
+#define RT4 d12
+#define RT5 d13
+#define RT6 d14
+#define RT7 d15
+
+#define RT01q q4
+#define RT23q q5
+#define RT45q q6
+#define RT67q q7
+
+#define RW0 d16
+#define RW1 d17
+#define RW2 d18
+#define RW3 d19
+#define RW4 d20
+#define RW5 d21
+#define RW6 d22
+#define RW7 d23
+#define RW8 d24
+#define RW9 d25
+#define RW10 d26
+#define RW11 d27
+#define RW12 d28
+#define RW13 d29
+#define RW14 d30
+#define RW15 d31
+
+#define RW01q q8
+#define RW23q q9
+#define RW45q q10
+#define RW67q q11
+#define RW89q q12
+#define RW1011q q13
+#define RW1213q q14
+#define RW1415q q15
+
+/***
+ * ARM assembly implementation of sha512 transform
+ ***/
+#define rounds2_0_63(ra, rb, rc, rd, re, rf, rg, rh, rw0, rw1, rw01q, rw2, \
+ rw23q, rw1415q, rw9, rw10, interleave_op, arg1) \
+   /* t1 = h + Sum1 (e) + Ch (e, f, g) + k[t] + w[t]; */ \
+   vshr.u64 RT2, re, #14; \
+   vshl.u64 RT3, re, #64 - 14; \
+   interleave_op(arg1); \
+   vshr.u64 RT4, re, #18; \
+   vshl.u64 RT5, re, #64 - 18; \
+   vld1.64 {RT0}, [RK]!; \
+   veor.64 RT23q, RT23q, RT45q; \
+   vshr.u64 RT4, re, #41; \
+   vshl.u64 RT5, re, #64 - 41; \
+   vadd.u64 RT0, RT0, rw0; \
+   veor.64 RT23q, RT23q, RT45q; \
+   vmov.64 RT7, re; \
+   veor.64 RT1, RT2, RT3; \
+   vbsl.64 RT7, rf, rg; \
+   \
+   vadd.u64 R