Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

2016-11-02 Thread Herbert Xu
On Wed, Nov 02, 2016 at 11:00:00PM +0100, Jason A. Donenfeld wrote:
>
> Just tested. I get a 6% slowdown on my Skylake. No good. I think it's
> probably best to have the two paths in there, and not reduce it to
> one.

FWIW I'd rather live with a 6% slowdown than having two different
code paths in the generic code.  Anyone who cares about 6% would
be much better off writing an assembly version of the code.

Cheers,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

2016-11-02 Thread Jason A. Donenfeld
On Wed, Nov 2, 2016 at 10:26 PM, Herbert Xu  wrote:
> What I'm interested in is whether the new code is sufficiently
> close in performance to the old code, particularonly on x86.
>
> I'd much rather only have a single set of code for all architectures.
> After all, this is meant to be a generic implementation.

Just tested. I get a 6% slowdown on my Skylake. No good. I think it's
probably best to have the two paths in there, and not reduce it to
one.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Fast Code and HAVE_EFFICIENT_UNALIGNED_ACCESS (was: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access)

2016-11-02 Thread Jeffrey Walton
On Wed, Nov 2, 2016 at 5:25 PM, Jason A. Donenfeld  wrote:
> These architectures select HAVE_EFFICIENT_UNALIGNED_ACCESS:
>
> s390 arm arm64 powerpc x86 x86_64
>
> So, these will use the original old code.
>
> The architectures that will thus use the new code are:
>
> alpha arc avr32 blackfin c6x cris frv h7300 hexagon ia64 m32r m68k
> metag microblaze mips mn10300 nios2 openrisc parisc score sh sparc
> tile um unicore32 xtensa

What I have found in practice from helping maintain a security library
and running benchmarks until my eyes bled

UNALIGNED_ACCESS is a kiss of death. It effectively prohibits -O3 and
above due to undefined behavior in C and problems with GCC
vectorization. In the bigger picture, it simply slows things down.

Once we moved away from UNALIGNED_ACCESS and started testing at -O3
and -O5, the benchmarks enjoyed non-trivial speedups on top of any
speedups we were trying to achieve with hand tuned assembly language
routines. Effectively, the best speedup was the sum of C-code and ASM;
they were not disjoint as they appear.

The one wrinkle for UNALIGNED_ACCESS is Bernstein's compressed tables
(https://cr.yp.to/antiforgery/cachetiming-20050414.pdf).
UNALIGNED_ACCESS meets some security goals. The techniques from
Bernstein's paper apply equally well to AES, Camellia and other
table-driven implementations. Painting with a broad brush (and as far
as I know), the kernel is not observing the recommendations. My
apologies if I parsed things incorrectly.

Jeff
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

2016-11-02 Thread Herbert Xu
On Wed, Nov 02, 2016 at 10:25:00PM +0100, Jason A. Donenfeld wrote:
> These architectures select HAVE_EFFICIENT_UNALIGNED_ACCESS:
> 
> s390 arm arm64 powerpc x86 x86_64
> 
> So, these will use the original old code.

What I'm interested in is whether the new code is sufficiently
close in performance to the old code, particularonly on x86.

I'd much rather only have a single set of code for all architectures.
After all, this is meant to be a generic implementation.

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

2016-11-02 Thread Jason A. Donenfeld
These architectures select HAVE_EFFICIENT_UNALIGNED_ACCESS:

s390 arm arm64 powerpc x86 x86_64

So, these will use the original old code.

The architectures that will thus use the new code are:

alpha arc avr32 blackfin c6x cris frv h7300 hexagon ia64 m32r m68k
metag microblaze mips mn10300 nios2 openrisc parisc score sh sparc
tile um unicore32 xtensa

Unfortunately, of these, the only machines I have access to are MIPS.
My SPARC access went cold a few years ago.

If you insist on a data-motivated approach approach, then I fear my
test of 1 out of 26 different architectures is woefully insufficient.
Does anybody else on the list have access to more hardware and is
interested in benchmarking?

If not, is there a reasonable way to decide on this by considering the
added complexity of code? Are we able to reason best and worst cases
of instruction latency vs unalignment stalls for most CPU designs?
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

2016-11-02 Thread Herbert Xu
On Wed, Nov 02, 2016 at 10:06:39PM +0100, Jason A. Donenfeld wrote:
> On Wed, Nov 2, 2016 at 9:09 PM, Herbert Xu  
> wrote:
> > Can you give some numbers please? What about other architectures
> > that your patch impacts?
> 
> Per [1], the patch gives a 181% speed up on MIPS32r2.
> 
> [1] https://lists.zx2c4.com/pipermail/wireguard/2016-September/000398.html

What about architectures? In particular, what if we just use your
new code for all architectures.  How much would we lose?

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

2016-11-02 Thread Jason A. Donenfeld
On Wed, Nov 2, 2016 at 9:09 PM, Herbert Xu  wrote:
> Can you give some numbers please? What about other architectures
> that your patch impacts?

Per [1], the patch gives a 181% speed up on MIPS32r2.

[1] https://lists.zx2c4.com/pipermail/wireguard/2016-September/000398.html
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/16] crypto: skcipher - Add skcipher walk interface

2016-11-02 Thread Eric Biggers
Hi Herbert, just a few preliminary comments.  I haven't made it through
everything yet.

On Wed, Nov 02, 2016 at 07:19:02AM +0800, Herbert Xu wrote:
> +static int skcipher_walk_first(struct skcipher_walk *walk)
> +{
> + if (WARN_ON_ONCE(in_irq()))
> + return -EDEADLK;
> +
> + walk->nbytes = walk->total;
> + if (unlikely(!walk->total))
> + return 0;
> +
> + walk->buffer = NULL;
> + if (unlikely(((unsigned long)walk->iv & walk->alignmask))) {
> + int err = skcipher_copy_iv(walk);
> + if (err)
> + return err;
> + }
> +
> + walk->page = NULL;
> +
> + return skcipher_walk_next(walk);
> +}

I think the case where skcipher_copy_iv() fails may be handled incorrectly.
Wouldn't it need to set walk.nbytes to 0 so as to not confuse callers which
expect that behavior?  Or maybe it should be calling skcipher_walk_done().

> +static int skcipher_walk_skcipher(struct skcipher_walk *walk,
> +   struct skcipher_request *req)
> +{
> + struct crypto_skcipher *tfm = crypto_skcipher_reqtfm(req);
> +
> + scatterwalk_start(>in, req->src);
> + scatterwalk_start(>out, req->dst);
> +
> + walk->in.sg = req->src;
> + walk->out.sg = req->dst;

Setting walk->in.sg and walk->out.sg is redundant with the scatterwalk_start()
calls.

> +int skcipher_walk_virt(struct skcipher_walk *walk,
> +struct skcipher_request *req, bool atomic)
> +{
> + int err;
> +
> + walk->flags &= ~SKCIPHER_WALK_PHYS;
> +
> + err = skcipher_walk_skcipher(walk, req);
> +
> + walk->flags &= atomic ? ~SKCIPHER_WALK_SLEEP : ~0;
> +
> + return err;
> +}

This gets called with uninitialized 'walk.flags'.  This was somewhat of a
theoretical problem with the old blkcipher_walk code but it looks like now it
will interact badly with the new SKCIPHER_WALK_SLEEP flag.  As far as I can see,
whether the flag will end up set or not can depend on the uninitialized value.
It would be nice if this problem could be avoided entirely be setting flags=0.

I'm also wondering about the choice to not look at 'atomic' until after the call
to skcipher_walk_skcipher().  Wouldn't this mean that the choice of 'atomic'
would not be respected in e.g. the kmalloc() in skcipher_copy_iv()?

> +int skcipher_walk_async(struct skcipher_walk *walk,
> + struct skcipher_request *req)
> +{
> + walk->flags |= SKCIPHER_WALK_PHYS;
> +
> + INIT_LIST_HEAD(>buffers);
> +
> + return skcipher_walk_skcipher(walk, req);
> +}
> +EXPORT_SYMBOL_GPL(skcipher_walk_async);

I don't see any users of the "async" walking being introduced; are some planned?

Eric
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

2016-11-02 Thread Sandy Harris
On Wed, Nov 2, 2016 at 4:09 PM, Herbert Xu  wrote:

> On Wed, Nov 02, 2016 at 06:58:10PM +0100, Jason A. Donenfeld wrote:
>> On MIPS chips commonly found in inexpensive routers, this makes a big
>> difference in performance.
>>
>> Signed-off-by: Jason A. Donenfeld 
>
> Can you give some numbers please? What about other architectures
> that your patch impacts?

In general it is not always clear that using whatever hardware crypto
is available is a good idea. Not all such hardware is fast, some CPUs
are, some CPUs have hardware for AES, and even if the hardware is
faster than the CPU, the context switch overheads may exceed the
advantage.

Ideally the patch development or acceptance process would be
testing this, but I think it might be difficult to reach that ideal.

The exception is a hardware RNG; that should always be used unless
it is clearly awful. It cannot do harm, speed is not much of an issue,
and it solves the hardest problem in the random(4) driver, making
sure of correct initialisation before any use.
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access

2016-11-02 Thread Herbert Xu
On Wed, Nov 02, 2016 at 06:58:10PM +0100, Jason A. Donenfeld wrote:
> On MIPS chips commonly found in inexpensive routers, this makes a big
> difference in performance.
> 
> Signed-off-by: Jason A. Donenfeld 

Can you give some numbers please? What about other architectures
that your patch impacts?

Thanks,
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] poly1305: generic C can be faster on chips with slow unaligned access

2016-11-02 Thread Jason A. Donenfeld
On MIPS chips commonly found in inexpensive routers, this makes a big
difference in performance.

Signed-off-by: Jason A. Donenfeld 
---
 crypto/poly1305_generic.c | 29 -
 1 file changed, 28 insertions(+), 1 deletion(-)

diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c
index 2df9835d..186e33d 100644
--- a/crypto/poly1305_generic.c
+++ b/crypto/poly1305_generic.c
@@ -65,11 +65,24 @@ EXPORT_SYMBOL_GPL(crypto_poly1305_setkey);
 static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key)
 {
/* r &= 0xffc0ffc0ffc0fff */
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
dctx->r[0] = (le32_to_cpuvp(key +  0) >> 0) & 0x3ff;
dctx->r[1] = (le32_to_cpuvp(key +  3) >> 2) & 0x303;
dctx->r[2] = (le32_to_cpuvp(key +  6) >> 4) & 0x3ffc0ff;
dctx->r[3] = (le32_to_cpuvp(key +  9) >> 6) & 0x3f03fff;
dctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00f;
+#else
+   u32 t0, t1, t2, t3;
+   t0 = le32_to_cpuvp(key +  0);
+   t1 = le32_to_cpuvp(key +  4);
+   t2 = le32_to_cpuvp(key +  8);
+   t3 = le32_to_cpuvp(key + 12);
+   dctx->r[0] = t0 & 0x3ff; t0 >>= 26; t0 |= t1 << 6;
+   dctx->r[1] = t0 & 0x303; t1 >>= 20; t1 |= t2 << 12;
+   dctx->r[2] = t1 & 0x3ffc0ff; t2 >>= 14; t2 |= t3 << 18;
+   dctx->r[3] = t2 & 0x3f03fff; t3 >>= 8;
+   dctx->r[4] = t3 & 0x00f;
+#endif
 }
 
 static void poly1305_setskey(struct poly1305_desc_ctx *dctx, const u8 *key)
@@ -109,6 +122,9 @@ static unsigned int poly1305_blocks(struct 
poly1305_desc_ctx *dctx,
u32 s1, s2, s3, s4;
u32 h0, h1, h2, h3, h4;
u64 d0, d1, d2, d3, d4;
+#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
+   u32 t0, t1, t2, t3;
+#endif
unsigned int datalen;
 
if (unlikely(!dctx->sset)) {
@@ -135,13 +151,24 @@ static unsigned int poly1305_blocks(struct 
poly1305_desc_ctx *dctx,
h4 = dctx->h[4];
 
while (likely(srclen >= POLY1305_BLOCK_SIZE)) {
-
/* h += m[i] */
+#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
h0 += (le32_to_cpuvp(src +  0) >> 0) & 0x3ff;
h1 += (le32_to_cpuvp(src +  3) >> 2) & 0x3ff;
h2 += (le32_to_cpuvp(src +  6) >> 4) & 0x3ff;
h3 += (le32_to_cpuvp(src +  9) >> 6) & 0x3ff;
h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit;
+#else
+   t0 = le32_to_cpuvp(src +  0);
+   t1 = le32_to_cpuvp(src +  4);
+   t2 = le32_to_cpuvp(src +  8);
+   t3 = le32_to_cpuvp(src + 12);
+   h0 += t0 & 0x3ff;
+   h1 += sru64)t1 << 32) | t0), 26) & 0x3ff;
+   h2 += sru64)t2 << 32) | t1), 20) & 0x3ff;
+   h3 += sru64)t3 << 32) | t2), 14) & 0x3ff;
+   h4 += (t3 >> 8) | hibit;
+#endif
 
/* h *= r */
d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) +
-- 
2.10.2

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] crypto: ccp - Update the command queue on errors

2016-11-02 Thread Gary R Hook
Move the command queue tail pointer when an error is
detected. Always return the error.

Signed-off-by: Gary R Hook 
---
 drivers/crypto/ccp/ccp-dev-v5.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/drivers/crypto/ccp/ccp-dev-v5.c b/drivers/crypto/ccp/ccp-dev-v5.c
index ff7816a..0baa99e 100644
--- a/drivers/crypto/ccp/ccp-dev-v5.c
+++ b/drivers/crypto/ccp/ccp-dev-v5.c
@@ -250,17 +250,20 @@ static int ccp5_do_cmd(struct ccp5_desc *desc,
ret = wait_event_interruptible(cmd_q->int_queue,
   cmd_q->int_rcvd);
if (ret || cmd_q->cmd_error) {
+   /* Log the error and flush the queue by
+* moving the head pointer
+*/
if (cmd_q->cmd_error)
ccp_log_error(cmd_q->ccp,
  cmd_q->cmd_error);
-   /* A version 5 device doesn't use Job IDs... */
+   iowrite32(tail, cmd_q->reg_head_lo);
if (!ret)
ret = -EIO;
}
cmd_q->int_rcvd = 0;
}
 
-   return 0;
+   return ret;
 }
 
 static int ccp5_perform_aes(struct ccp_op *op)

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] Crypto: mv_cesa: Switch to using managed resources

2016-11-02 Thread Nadim Almas
Switch to resource-managed function devm_kzalloc instead
of kzalloc and remove unneeded kfree

Also, remove kfree in probe function and remove
function, mv_remove as it is now has nothing to do.
The Coccinelle semantic patch used to make this change is as follows:
//
@platform@
identifier p, probefn, removefn;
@@
struct platform_driver p = {
.probe = probefn,
.remove = removefn,
};

@prb@
identifier platform.probefn, pdev;
expression e, e1, e2;
@@
probefn(struct platform_device *pdev, ...) {
<+...
- e = kzalloc(e1, e2)
+ e = devm_kzalloc(>dev, e1, e2)
...
?-kfree(e);
...+>
}
@rem depends on prb@
identifier platform.removefn;
expression prb.e;
@@
removefn(...) {
<...
- kfree(e);
...>
}
//

Signed-off-by: Nadim Almas 
---
 drivers/crypto/mv_cesa.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/drivers/crypto/mv_cesa.c b/drivers/crypto/mv_cesa.c
index 104e9ce..451fa18 100644
--- a/drivers/crypto/mv_cesa.c
+++ b/drivers/crypto/mv_cesa.c
@@ -1073,7 +1073,7 @@ static int mv_probe(struct platform_device *pdev)
if (!res)
return -ENXIO;
 
-   cp = kzalloc(sizeof(*cp), GFP_KERNEL);
+   cp = devm_kzalloc(>dev, sizeof(*cp), GFP_KERNEL);
if (!cp)
return -ENOMEM;
 
@@ -1163,7 +1163,6 @@ static int mv_probe(struct platform_device *pdev)
 err_thread:
kthread_stop(cp->queue_th);
 err:
-   kfree(cp);
cpg = NULL;
return ret;
 }
@@ -1187,7 +1186,6 @@ static int mv_remove(struct platform_device *pdev)
clk_put(cp->clk);
}
 
-   kfree(cp);
cpg = NULL;
return 0;
 }
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html