[PATCH] crypto: arm/chacha20 - always use vrev for 16-bit rotates

2018-07-24 Thread Eric Biggers
From: Eric Biggers 

The 4-way ChaCha20 NEON code implements 16-bit rotates with vrev32.16,
but the one-way code (used on remainder blocks) implements it with
vshl + vsri, which is slower.  Switch the one-way code to vrev32.16 too.

Signed-off-by: Eric Biggers 
---
 arch/arm/crypto/chacha20-neon-core.S | 10 --
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/arch/arm/crypto/chacha20-neon-core.S 
b/arch/arm/crypto/chacha20-neon-core.S
index 3fecb2124c35..451a849ad518 100644
--- a/arch/arm/crypto/chacha20-neon-core.S
+++ b/arch/arm/crypto/chacha20-neon-core.S
@@ -51,9 +51,8 @@ ENTRY(chacha20_block_xor_neon)
 .Ldoubleround:
// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #16
-   vsri.u32q3, q4, #16
+   veorq3, q3, q0
+   vrev32.16   q3, q3
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
vadd.i32q2, q2, q3
@@ -82,9 +81,8 @@ ENTRY(chacha20_block_xor_neon)
 
// x0 += x1, x3 = rotl32(x3 ^ x0, 16)
vadd.i32q0, q0, q1
-   veorq4, q3, q0
-   vshl.u32q3, q4, #16
-   vsri.u32q3, q4, #16
+   veorq3, q3, q0
+   vrev32.16   q3, q3
 
// x2 += x3, x1 = rotl32(x1 ^ x2, 12)
vadd.i32q2, q2, q3
-- 
2.18.0



Re: [PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-24 Thread Sebastian Andrzej Siewior
On 2018-07-24 19:12:20 [+0200], Ard Biesheuvel wrote:
> Vakul reports a considerable performance hit when running the accelerated
> arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
> been updated to take the TIF_NEED_RESCHED flag into account.

just in time. I will try to come up with some numbers on RT with the
original patch and with that one. I have it almost working…

Sebastian


[PATCH 1/4] crypto/arm64: ghash - reduce performance impact of NEON yield checks

2018-07-24 Thread Ard Biesheuvel
As reported by Vakul, checking the TIF_NEED_RESCHED flag after every
iteration of the GHASH and AES-GCM core routines is having a considerable
performance impact on cores such as the Cortex-A53 with Crypto Extensions
implemented.

GHASH performance is down by 22% for large block sizes, and AES-GCM is
down by 16% for large block sizes and 128 bit keys. This appears to be
a result of the high performance of the crypto instructions on the one
hand (2.0 cycles per byte for GHASH, 3.0 cpb for AES-GCM), combined with
the relatively poor load/store performance of this simple core.

So let's reduce this performance impact by only doing the yield check
once every 32 blocks for GHASH (or 4 when using the version based on
8-bit polynomial multiplication), and once every 16 blocks for AES-GCM.
This way, we recover most of the performance while still limiting the
duration of scheduling blackouts due to disabling preemption to ~1000
cycles.

Cc: Vakul Garg 
Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/ghash-ce-core.S | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/arm64/crypto/ghash-ce-core.S 
b/arch/arm64/crypto/ghash-ce-core.S
index dcffb9e77589..9c14beaabeee 100644
--- a/arch/arm64/crypto/ghash-ce-core.S
+++ b/arch/arm64/crypto/ghash-ce-core.S
@@ -212,7 +212,7 @@
ushrXL.2d, XL.2d, #1
.endm
 
-   .macro  __pmull_ghash, pn
+   .macro  __pmull_ghash, pn, yield_count
frame_push  5
 
mov x19, x0
@@ -259,6 +259,9 @@ CPU_LE( rev64   T1.16b, T1.16b  )
eor T2.16b, T2.16b, XH.16b
eor XL.16b, XL.16b, T2.16b
 
+   tst w19, #(\yield_count - 1)
+   b.ne1b
+
cbz w19, 3f
 
if_will_cond_yield_neon
@@ -279,11 +282,11 @@ CPU_LE(   rev64   T1.16b, T1.16b  )
 * struct ghash_key const *k, const char *head)
 */
 ENTRY(pmull_ghash_update_p64)
-   __pmull_ghash   p64
+   __pmull_ghash   p64, 32
 ENDPROC(pmull_ghash_update_p64)
 
 ENTRY(pmull_ghash_update_p8)
-   __pmull_ghash   p8
+   __pmull_ghash   p8, 4
 ENDPROC(pmull_ghash_update_p8)
 
KS  .reqv8
@@ -428,6 +431,9 @@ CPU_LE( rev x28, x28)
st1 {INP.16b}, [x21], #16
.endif
 
+   tst w19, #0xf   // do yield check only
+   b.ne1b  // once every 16 blocks
+
cbz w19, 3f
 
if_will_cond_yield_neon
-- 
2.11.0



[PATCH 0/4] crypto/arm64: reduce impact of NEON yield checks

2018-07-24 Thread Ard Biesheuvel
Vakul reports a considerable performance hit when running the accelerated
arm64 crypto routines with CONFIG_PREEMPT=y configured, now that thay have
been updated to take the TIF_NEED_RESCHED flag into account.

The issue appears to be caused by the fact that Cortex-A53, the core in
question, has a high end implementation of the Crypto Extensions, and
has a shallow pipeline, which means even sequential algorithms that may be
held back by pipeline stalls on high end out of order cores run at maximum
speed. This means SHA-1, SHA-2, GHASH and AES in GCM and CCM modes run at a
speed in the order of 2 to 4 cycles per byte, and are currently implemented
to check the TIF_NEED_RESCHED after each iteration, which may process as
little as 16 bytes (for GHASH).

Obviously, every cycle of overhead hurts in this context, and given that
the A53's load/store unit is not quite high end, any delays caused by
memory accesses that occur in the inner loop of the algorithms are going
to be quite significant, hence the performance regression.

So reduce the frequency at which the NEON yield checks are performed, so
that they occur roughly once every 1000 cycles, which is hopefully a
reasonable tradeoff between throughput and worst case scheduling latency.

Ard Biesheuvel (4):
  crypto/arm64: ghash - reduce performance impact of NEON yield checks
  crypto/arm64: aes-ccm - reduce performance impact of NEON yield checks
  crypto/arm64: sha1 - reduce performance impact of NEON yield checks
  crypto/arm64: sha2 - reduce performance impact of NEON yield checks

 arch/arm64/crypto/aes-ce-ccm-core.S |  3 +++
 arch/arm64/crypto/ghash-ce-core.S   | 12 +---
 arch/arm64/crypto/sha1-ce-core.S|  3 +++
 arch/arm64/crypto/sha2-ce-core.S|  3 +++
 4 files changed, 18 insertions(+), 3 deletions(-)

-- 
2.11.0



[PATCH 3/4] crypto/arm64: sha1 - reduce performance impact of NEON yield checks

2018-07-24 Thread Ard Biesheuvel
Only perform the NEON yield check for every 4 blocks of input, to
prevent taking a considerable performance hit on cores with very
fast crypto instructions and comparatively slow memory accesses,
such as the Cortex-A53.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha1-ce-core.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/crypto/sha1-ce-core.S b/arch/arm64/crypto/sha1-ce-core.S
index 78eb35fb5056..f592c55218d0 100644
--- a/arch/arm64/crypto/sha1-ce-core.S
+++ b/arch/arm64/crypto/sha1-ce-core.S
@@ -129,6 +129,9 @@ CPU_LE( rev32   v11.16b, v11.16b)
add dgbv.2s, dgbv.2s, dg1v.2s
add dgav.4s, dgav.4s, dg0v.4s
 
+   tst w21, #0x3   // yield only every 4 blocks
+   b.ne1b
+
cbz w21, 3f
 
if_will_cond_yield_neon
-- 
2.11.0



[PATCH 4/4] crypto/arm64: sha2 - reduce performance impact of NEON yield checks

2018-07-24 Thread Ard Biesheuvel
Only perform the NEON yield check for every 4 blocks of input, to
prevent taking a considerable performance hit on cores with very
fast crypto instructions and comparatively slow memory accesses,
such as the Cortex-A53.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/sha2-ce-core.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/crypto/sha2-ce-core.S b/arch/arm64/crypto/sha2-ce-core.S
index cd8b36412469..201a33ff6830 100644
--- a/arch/arm64/crypto/sha2-ce-core.S
+++ b/arch/arm64/crypto/sha2-ce-core.S
@@ -136,6 +136,9 @@ CPU_LE( rev32   v19.16b, v19.16b)
add dgav.4s, dgav.4s, dg0v.4s
add dgbv.4s, dgbv.4s, dg1v.4s
 
+   tst w21, #0x3   // yield only every 4 blocks
+   b.ne1b
+
/* handled all input blocks? */
cbz w21, 3f
 
-- 
2.11.0



[PATCH 2/4] crypto/arm64: aes-ccm - reduce performance impact of NEON yield checks

2018-07-24 Thread Ard Biesheuvel
Only perform the NEON yield check for every 8 blocks of input, to
prevent taking a considerable performance hit on cores with very
fast crypto instructions and comparatively slow memory accesses,
such as the Cortex-A53.

Signed-off-by: Ard Biesheuvel 
---
 arch/arm64/crypto/aes-ce-ccm-core.S | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/arm64/crypto/aes-ce-ccm-core.S 
b/arch/arm64/crypto/aes-ce-ccm-core.S
index 88f5aef7934c..627710cdc220 100644
--- a/arch/arm64/crypto/aes-ce-ccm-core.S
+++ b/arch/arm64/crypto/aes-ce-ccm-core.S
@@ -208,6 +208,9 @@ CPU_LE( rev x26, x26)   /* keep 
swabbed ctr in reg */
st1 {v1.16b}, [x19], #16/* write output block */
beq 5f
 
+   tst w21, #(0x7 * 16)/* yield every 8 blocks */
+   b.ne0b
+
if_will_cond_yield_neon
st1 {v0.16b}, [x24] /* store mac */
do_cond_yield_neon
-- 
2.11.0



editing for your photos

2018-07-24 Thread Roland

I would like to speak with the person that managing photos for your
company?

We provide image editing like – photos cutting out and retouching.

Enhancing your images is just a part of what we can do for your business.
Whether you’re an ecommerce
store or portrait photographer, real estate professional, or an e-Retailer,
we are your personal team
of photo editors that integrate seamlessly with your business.

Our mainly services are:

. Cut out, masking, clipping path, deep etching, transparent background
. Colour correction, black and white, light and shadows etc.
. Dust cleaning, spot cleaning
. Beauty retouching, skin retouching, face retouching, body retouching
. Fashion/Beauty Image Retouching
. Product image Retouching
. Real estate image Retouching
. Wedding & Event Album Design.
. Restoration and repair old images
. Vector Conversion
. Portrait image Retouching

We can provide you editing test on your photos.
Please reply if you are interested.

Thanks,
Roland