RE: [PATCH 01/10] crypto: aead - allow to allocate AEAD requests on the stack

2018-05-02 Thread David Laight
From: Antoine Tenart
> Sent: 02 May 2018 10:57
> Adds the AEAD_REQUEST_ON_STACK primitive to allow allocating AEAD
> requests on the stack, as it can already be done with various other
> crypto algorithms within the kernel.
> 
> Signed-off-by: Antoine Tenart 
> ---
>  include/crypto/aead.h | 5 +
>  1 file changed, 5 insertions(+)
> 
> diff --git a/include/crypto/aead.h b/include/crypto/aead.h
> index 1e26f790b03f..b67064786546 100644
> --- a/include/crypto/aead.h
> +++ b/include/crypto/aead.h
> @@ -151,6 +151,11 @@ struct aead_alg {
>   struct crypto_alg base;
>  };
> 
> +#define AEAD_REQUEST_ON_STACK(name, tfm) \
> + char __##name##_desc[sizeof(struct aead_request) + \
> + crypto_aead_reqsize(tfm)] CRYPTO_MINALIGN_ATTR; \
> + struct aead_request *name = (void *)__##name##_desc
> +

This looks stunningly like a VLA.

David



RE: [PATCH v2 0/2] crypto: removing various VLAs

2018-04-11 Thread David Laight
From: Salvatore Mesoraca
> Sent: 09 April 2018 17:38
...
> > You can also do much better than allocating MAX_BLOCKSIZE + MAX_ALIGNMASK
> > bytes by requesting 'long' aligned on-stack memory.
> > The easiest way is to define a union like:
> >
> > union crypto_tmp {
> > u8 buf[CRYPTO_MAX_TMP_BUF];
> > long buf_align;
> > };
> >
> > Then in each function:
> >
> > union tmp crypto_tmp;
> > u8 *keystream = PTR_ALIGN(tmp.buf, alignmask + 1);
> >
> > I think CRYPTO_MAX_TMP_BUF needs to be MAX_BLOCKSIZE + MAX_ALIGNMASK - 
> > sizeof (long).
> 
> Yeah, that would be nice, it might save us 4-8 bytes on the stack.
> But I was thinking, wouldn't it be even better to do something like:
> 
> u8 buf[CRYPTO_MAX_TMP_BUF] __aligned(__alignof__(long));
> u8 *keystream = PTR_ALIGN(buf, alignmask + 1);
> 
> In this case __aligned should work, if I'm not missing some other
> subtle GCC caveat.

Thinking further, there is no point aligning the buffer to less than
the maximum alignment allowed - it just adds code.

So you end up with:
#define MAX_STACK_ALIGN __alignof__(long)  /* Largest type the compiler can 
align on stack */
#define CRYPTO_MAX_TMP_BUF (MAX_BLOCKSIZE + MAX_ALIGNMASK + 1 - MAX_STACK_ALIGN)
u8 buf[CRYPTO_MAX_TMP_BUF] __aligned(MAX_STACK_ALIGN);
u8 *keystream = PTR_ALIGN(buf, MAX_ALIGNMASK + 1);

The last two lines could be put into a #define of their own so that the 'call 
sites'
don't need to know the gory details of how the buffer is defined.

In principle you could just have:
u8 keystream[MAX_BLOCKSIZE] __aligned(MAX_ALIGNMASK + 1);

But that will go wrong if the stack alignment has gone wrong somewhere
and generates a double stack frame if the requested alignment is larger
than the expected stack alignment.

IIRC there is a gcc command line option to enforce stack alignment on
some/all function entry prologues. The gory details are held in some
old brain cells somewhere.

David



RE: [PATCH v2 0/2] crypto: removing various VLAs

2018-04-09 Thread David Laight
From: Salvatore Mesoraca
> Sent: 09 April 2018 14:55
> 
> v2:
>   As suggested by Herbert Xu, the blocksize and alignmask checks
>   have been moved to crypto_check_alg.
>   So, now, all the other separate checks are not necessary.
>   Also, the defines have been moved to include/crypto/algapi.h.
> 
> v1:
>   As suggested by Laura Abbott[1], I'm resending my patch with
>   MAX_BLOCKSIZE and MAX_ALIGNMASK defined in an header, so they
>   can be used in other places.
>   I took this opportunity to deal with some other VLAs not
>   handled in the old patch.

If the constants are visible they need better names.
Maybe CRYPTO_MAX_xxx.

You can also do much better than allocating MAX_BLOCKSIZE + MAX_ALIGNMASK
bytes by requesting 'long' aligned on-stack memory.
The easiest way is to define a union like:

union crypto_tmp {
u8 buf[CRYPTO_MAX_TMP_BUF];
long buf_align;
};

Then in each function:

union tmp crypto_tmp;
u8 *keystream = PTR_ALIGN(tmp.buf, alignmask + 1);

I think CRYPTO_MAX_TMP_BUF needs to be MAX_BLOCKSIZE + MAX_ALIGNMASK - sizeof 
(long).

David


RE: [PATCH] crypto: ctr: avoid VLA use

2018-03-15 Thread David Laight
From: Eric Biggers
> Sent: 14 March 2018 18:32
...
> Also, I recall there being a long discussion a while back about how
> __aligned(16) doesn't work on local variables because the kernel's stack 
> pointer
> isn't guaranteed to maintain the alignment assumed by the compiler (see commit
> b8fbe71f7535)...

ISTR that gcc arbitrarily decided that the x86 stack (for 32 bit) would be
kept aligned to more than 4 bytes (16??) - probably so that xmm registers
could be written to stack locations.
This was a massive ABI change that they didn't tell anyone about!
While gcc compiled code maintained the alignment a lot of asm code didn't.
I don't know about Linux, but NetBSD didn't even align user stacks.

There is a gcc option to not assume that the stack is 'appropriately aligned',
but ISTR that it generates rather more code that one might have wished.

If the compiler does align the stack, it does so by generating a double
stack frame - not pretty at all.

David



RE: [PATCH] crypto: x86/twofish-3way - Fix %rbp usage

2017-12-19 Thread David Laight
From: Juergen Gross
> Sent: 19 December 2017 08:05
..
> 
> Exchanging 2 registers can be done without memory access via:
> 
> xor reg1, reg2
> xor reg2, reg1
> xor reg1, reg2

That'll generate horrid data dependencies.
ISTR that there are some optimisations for the stack,
so even 'push reg1', 'mov reg2,reg1', 'pop reg2' might
be faster than the above.

David



RE: [PATCH 16/22] xen-blkfront: Make use of the new sg_map helper function

2017-04-18 Thread David Laight
From: Logan Gunthorpe
> Sent: 13 April 2017 23:05
> Straightforward conversion to the new helper, except due to
> the lack of error path, we have to warn if unmapable memory
> is ever present in the sgl.
> 
> Signed-off-by: Logan Gunthorpe 
> ---
>  drivers/block/xen-blkfront.c | 33 +++--
>  1 file changed, 27 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/block/xen-blkfront.c b/drivers/block/xen-blkfront.c
> index 5067a0a..7dcf41d 100644
> --- a/drivers/block/xen-blkfront.c
> +++ b/drivers/block/xen-blkfront.c
> @@ -807,8 +807,19 @@ static int blkif_queue_rw_req(struct request *req, 
> struct blkfront_ring_info *ri
>   BUG_ON(sg->offset + sg->length > PAGE_SIZE);
> 
>   if (setup.need_copy) {
> - setup.bvec_off = sg->offset;
> - setup.bvec_data = kmap_atomic(sg_page(sg));
> + setup.bvec_off = 0;
> + setup.bvec_data = sg_map(sg, SG_KMAP_ATOMIC);
> + if (IS_ERR(setup.bvec_data)) {
> + /*
> +  * This should really never happen unless
> +  * the code is changed to use memory that is
> +  * not mappable in the sg. Seeing there is a
> +  * questionable error path out of here,
> +  * we WARN.
> +  */
> + WARN(1, "Non-mappable memory used in sg!");
> + return 1;
> + }
...

Perhaps add a flag to mark failure as 'unexpected' and trace (and panic?)
inside sg_map().

David




RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-16 Thread David Laight
From: Daniel Axtens
> Sent: 15 March 2017 22:30
> Hi David,
> 
> > While not part of this change, the unrolled loops look as though
> > they just destroy the cpu cache.
> > I'd like be convinced that anything does CRC over long enough buffers
> > to make it a gain at all.
> >
> > With modern (not that modern now) superscalar cpus you can often
> > get the loop instructions 'for free'.
> > Sometimes pipelining the loop is needed to get full throughput.
> > Unlike the IP checksum, you don't even have to 'loop carry' the
> > cpu carry flag.
> 
> Internal testing on a NVMe device with T10DIF enabled on 4k blocks
> shows a 20x - 30x improvement. Without these patches, crc_t10dif_generic
> uses over 60% of CPU time - with these patches CRC drops to single
> digits.
> 
> I should probably have lead with that, sorry.

I'm not doubting that using the cpu instruction for crcs gives a
massive performance boost.

Just that the heavily unrolled loop is unlikely to help overall.
Some 'cold cache' tests on shorter buffers might be illuminating.
 
> FWIW, the original patch showed a 3.7x gain on btrfs as well -
> 6dd7a82cc54e ("crypto: powerpc - Add POWER8 optimised crc32c")
> 
> When Anton wrote the original code he had access to IBM's internal
> tooling for looking at how instructions flow through the various stages
> of the CPU, so I trust it's pretty much optimal from that point of view.

Doesn't mean he used it :-)

David




RE: [PATCH 1/4] crypto: powerpc - Factor out the core CRC vpmsum algorithm

2017-03-15 Thread David Laight
From: Linuxppc-dev Daniel Axtens
> Sent: 15 March 2017 12:38
> The core nuts and bolts of the crc32c vpmsum algorithm will
> also work for a number of other CRC algorithms with different
> polynomials. Factor out the function into a new asm file.
> 
> To handle multiple users of the function, a user simply
> provides constants, defines the name of their CRC function,
> and then #includes the core algorithm file.
...

While not part of this change, the unrolled loops look as though
they just destroy the cpu cache.
I'd like be convinced that anything does CRC over long enough buffers
to make it a gain at all.

With modern (not that modern now) superscalar cpus you can often
get the loop instructions 'for free'.
Sometimes pipelining the loop is needed to get full throughput.
Unlike the IP checksum, you don't even have to 'loop carry' the
cpu carry flag.

David



RE: [PATCH v2 8/8] crypto/testmgr: Allocate only the required output size for hash tests

2017-01-11 Thread David Laight
From: Andy Lutomirski
> Sent: 10 January 2017 23:25
> There are some hashes (e.g. sha224) that have some internal trickery
> to make sure that only the correct number of output bytes are
> generated.  If something goes wrong, they could potentially overrun
> the output buffer.
> 
> Make the test more robust by allocating only enough space for the
> correct output size so that memory debugging will catch the error if
> the output is overrun.

Might be better to test this by allocating an overlong buffer
and then explicitly checking that the output hasn't overrun
the allowed space.

If nothing else the error message will be clearer.

David

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-19 Thread David Laight
From: George Spelvin
> Sent: 17 December 2016 15:21
...
> uint32_t
> hsiphash24(char const *in, size_t len, uint32_t const key[2])
> {
>   uint32_t c = key[0];
>   uint32_t d = key[1];
>   uint32_t a = 0x6c796765 ^ 0x736f6d65;
>   uint32_t b = d ^ 0x74656462 ^ 0x646f7261;

I've not looked closely, but is that (in some sense) duplicating
the key length?
So you could set a = key[2] and b = key[3] and still have an
working hash - albeit not exactly the one specified.

I'll add another comment here...
Is it worth using the 32bit hash for IP addresses on 64bit systems that
can't do misaligned accessed?

David

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v5 1/4] siphash: add cryptographically secure PRF

2016-12-16 Thread David Laight
From: George Spelvin
> Sent: 15 December 2016 23:29
> > If a halved version of SipHash can bring significant performance boost
> > (with 32b words instead of 64b words) with an acceptable security level
> > (64-bit enough?) then we may design such a version.
> 
> I was thinking if the key could be pushed to 80 bits, that would be nice,
> but honestly 64 bits is fine.  This is DoS protection, and while it's
> possible to brute-force a 64 bit secret, there are more effective (DDoS)
> attacks possible for the same cost.

A 32bit hash would also remove all the issues about the alignment
of IP addresses (etc) on 64bit systems.

> (I'd suggest a name of "HalfSipHash" to convey the reduced security
> effectively.)
> 
> > Regarding output size, are 64 bits sufficient?
> 
> As a replacement for jhash, 32 bits are sufficient.  It's for
> indexing an in-memory hash table on a 32-bit machine.

It is also worth remembering that if the intent is to generate
a hash table index (not a unique fingerprint) you will always
get collisions on the final value.
Randomness could still give overlong hash chains - which might
still need rehashing with a different key.

David


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v5 2/4] siphash: add Nu{32,64} helpers

2016-12-16 Thread David Laight
From: Jason A. Donenfeld
> Sent: 15 December 2016 20:30
> These restore parity with the jhash interface by providing high
> performance helpers for common input sizes.
...
> +#define PREAMBLE(len) \
> + u64 v0 = 0x736f6d6570736575ULL; \
> + u64 v1 = 0x646f72616e646f6dULL; \
> + u64 v2 = 0x6c7967656e657261ULL; \
> + u64 v3 = 0x7465646279746573ULL; \
> + u64 b = ((u64)len) << 56; \
> + v3 ^= key[1]; \
> + v2 ^= key[0]; \
> + v1 ^= key[1]; \
> + v0 ^= key[0];

Isn't that equivalent to:
v0 = key[0];
v1 = key[1];
v2 = key[0] ^ (0x736f6d6570736575ULL ^ 0x646f72616e646f6dULL);
v3 = key[1] ^ (0x646f72616e646f6dULL ^ 0x7465646279746573ULL);

Those constants also look like ASCII strings.
What cryptographic analysis has been done on the values?

David

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v5 3/4] secure_seq: use SipHash in place of MD5

2016-12-16 Thread David Laight
From: Jason A. Donenfeld
> Sent: 15 December 2016 20:30
> This gives a clear speed and security improvement. Siphash is both
> faster and is more solid crypto than the aging MD5.
> 
> Rather than manually filling MD5 buffers, for IPv6, we simply create
> a layout by a simple anonymous struct, for which gcc generates
> rather efficient code. For IPv4, we pass the values directly to the
> short input convenience functions.
...
> diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
> index 88a8e429fc3e..c80583bf3213 100644
...
> + const struct {
> + struct in6_addr saddr;
> + struct in6_addr daddr;
> + __be16 sport;
> + __be16 dport;
> + u32 padding;
> + } __aligned(SIPHASH_ALIGNMENT) combined = {
> + .saddr = *(struct in6_addr *)saddr,
> + .daddr = *(struct in6_addr *)daddr,
> + .sport = sport,
> + .dport = dport
> + };

I think you should explicitly initialise the 'padding'.
It can do no harm and makes it obvious that it is necessary.

You are still putting over-aligned data on stack.
You only need to align it to the alignment of u64 (not the size of u64).
If an on-stack item has a stronger alignment requirement than the stack
the gcc has to generate two stack frames for the function.

If you assign to each field (instead of using initialisers) then you
can get the alignment by making the first member an anonymous union
of in6_addr and u64.

Oh - and wait a bit longer between revisions.

David

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-15 Thread David Laight
From: Hannes Frederic Sowa
> Sent: 15 December 2016 14:57
> On 15.12.2016 14:56, David Laight wrote:
> > From: Hannes Frederic Sowa
> >> Sent: 15 December 2016 12:50
> >> On 15.12.2016 13:28, David Laight wrote:
> >>> From: Hannes Frederic Sowa
> >>>> Sent: 15 December 2016 12:23
> >>> ...
> >>>> Hmm? Even the Intel ABI expects alignment of unsigned long long to be 8
> >>>> bytes on 32 bit. Do you question that?
> >>>
> >>> Yes.
> >>>
> >>> The linux ABI for x86 (32 bit) only requires 32bit alignment for u64 
> >>> (etc).
> >>
> >> Hmm, u64 on 32 bit is unsigned long long and not unsigned long. Thus I
> >> am actually not sure if the ABI would say anything about that (sorry
> >> also for my wrong statement above).
> >>
> >> Alignment requirement of unsigned long long on gcc with -m32 actually
> >> seem to be 8.
> >
> > It depends on the architecture.
> > For x86 it is definitely 4.
> 
> May I ask for a reference?

Ask anyone who has had to do compatibility layers to support 32bit
binaries on 64bit systems.

> I couldn't see unsigned long long being
> mentioned in the ia32 abi spec that I found. I agree that those accesses
> might be synthetically assembled by gcc and for me the alignment of 4
> would have seemed natural. But my gcc at least in 32 bit mode disagrees
> with that.

Try (retyped):

echo 'struct { long a; long long b; } s; int bar { return sizeof s; }' >foo.c
gcc [-m32] -O2 -S foo.c; cat foo.s

And look at what is generated.

> Right now ipv6 addresses have an alignment of 4. So we couldn't even
> naturally pass them to siphash but would need to copy them around, which
> I feel like a source of bugs.

That is more of a problem on systems that don't support misaligned accesses.
Reading the 64bit values with two explicit 32bit reads would work.
I think you can get gcc to do that by adding an aligned(4) attribute to the
structure member.

David



RE: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-15 Thread David Laight
From: Hannes Frederic Sowa
> Sent: 15 December 2016 12:50
> On 15.12.2016 13:28, David Laight wrote:
> > From: Hannes Frederic Sowa
> >> Sent: 15 December 2016 12:23
> > ...
> >> Hmm? Even the Intel ABI expects alignment of unsigned long long to be 8
> >> bytes on 32 bit. Do you question that?
> >
> > Yes.
> >
> > The linux ABI for x86 (32 bit) only requires 32bit alignment for u64 (etc).
> 
> Hmm, u64 on 32 bit is unsigned long long and not unsigned long. Thus I
> am actually not sure if the ABI would say anything about that (sorry
> also for my wrong statement above).
> 
> Alignment requirement of unsigned long long on gcc with -m32 actually
> seem to be 8.

It depends on the architecture.
For x86 it is definitely 4.
It might be 8 for sparc, ppc and/or alpha.

David

N�r��yb�X��ǧv�^�)޺{.n�+{�r����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

RE: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-15 Thread David Laight
From: Hannes Frederic Sowa
> Sent: 15 December 2016 12:23
...
> Hmm? Even the Intel ABI expects alignment of unsigned long long to be 8
> bytes on 32 bit. Do you question that?

Yes.

The linux ABI for x86 (32 bit) only requires 32bit alignment for u64 (etc).

David



RE: [PATCH v2 1/4] siphash: add cryptographically secure hashtable function

2016-12-15 Thread David Laight
From: Hannes Frederic Sowa
> Sent: 14 December 2016 22:03
> On 14.12.2016 13:46, Jason A. Donenfeld wrote:
> > Hi David,
> >
> > On Wed, Dec 14, 2016 at 10:56 AM, David Laight <david.lai...@aculab.com> 
> > wrote:
> >> ...
> >>> +u64 siphash24(const u8 *data, size_t len, const u8 
> >>> key[SIPHASH24_KEY_LEN])
> >> ...
> >>> + u64 k0 = get_unaligned_le64(key);
> >>> + u64 k1 = get_unaligned_le64(key + sizeof(u64));
> >> ...
> >>> + m = get_unaligned_le64(data);
> >>
> >> All these unaligned accesses are going to get expensive on architectures
> >> like sparc64.
> >
> > Yes, the unaligned accesses aren't pretty. Since in pretty much all
> > use cases thus far, the data can easily be made aligned, perhaps it
> > makes sense to create siphash24() and siphash24_unaligned(). Any
> > thoughts on doing something like that?
> 
> I fear that the alignment requirement will be a source of bugs on 32 bit
> machines, where you cannot even simply take a well aligned struct on a
> stack and put it into the normal siphash(aligned) function without
> adding alignment annotations everywhere. Even blocks returned from
> kmalloc on 32 bit are not aligned to 64 bit.

Are you doing anything that will require 64bit alignment on 32bit systems?
It is unlikely that the kernel can use any simd registers that have wider
alignment requirements.

You also really don't want to request on-stack items have large alignments.
While gcc can generate code to do it, it isn't pretty.

David




RE: [PATCH v3 1/3] siphash: add cryptographically secure hashtable function

2016-12-15 Thread David Laight
From: Linus Torvalds
> Sent: 15 December 2016 00:11
> On Wed, Dec 14, 2016 at 3:34 PM, Jason A. Donenfeld  wrote:
> >
> > Or does your reasonable dislike of "word" still allow for the use of
> > dword and qword, so that the current function names of:
> 
> dword really is confusing to people.
>
> If you have a MIPS background, it means 64 bits. While to people with
> Windows programming backgrounds it means 32 bits.

Guess what a DWORD_PTR is on 64bit windows ...
(it is an integer type).

David



RE: [PATCH v3 3/3] random: use siphash24 instead of md5 for get_random_int/long

2016-12-15 Thread David Laight
From: Behalf Of Jason A. Donenfeld
> Sent: 14 December 2016 18:46
...
> + ret = *chaining = siphash24((u8 *), offsetof(typeof(combined), 
> end),

If you make the first argument 'const void *' you won't need the cast
on every call.

I'd also suggest making the key u64[2].

David

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] keys/encrypted: Fix two crypto-on-the-stack bugs

2016-12-13 Thread David Laight
From: Andy Lutomirski
> Sent: 12 December 2016 20:53
> The driver put a constant buffer of all zeros on the stack and
> pointed a scatterlist entry at it in two places.  This doesn't work
> with virtual stacks.  Use a static 16-byte buffer of zeros instead.
...

I didn't think you could dma from static data either.

David

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] crypto: vmx - Ignore generated files

2016-07-20 Thread David Laight
From:  Paulo Flabiano Smorigo
> Sent: 19 July 2016 14:36
> Ignore assembly files generated by the perl script.
...
> diff --git a/drivers/crypto/vmx/.gitignore b/drivers/crypto/vmx/.gitignore
> new file mode 100644
> index 000..af4a7ce
> --- /dev/null
> +++ b/drivers/crypto/vmx/.gitignore
> @@ -0,0 +1,2 @@
> +aesp8-ppc.S
> +ghashp8-ppc.S

Shouldn't the generated files be written to the object tree?

I would hope the linux kernel builds from a readonly source tree.

David

N�r��yb�X��ǧv�^�)޺{.n�+{�r����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

RE: [PATCH 1/2] crypto: vmx - Adding asm subroutines for XTS

2016-07-12 Thread David Laight
From: Paulo Flabiano Smorigo
> Sent: 11 July 2016 20:08
> 
> This patch add XTS subroutines using VMX-crypto driver.
> 
> It gives a boost of 20 times using XTS.
> 
> These code has been adopted from OpenSSL project in collaboration
> with the original author (Andy Polyakov ).

Yep, typical openssl code. 1000+ lines of uncommented impenetrable assembler.
There is 0 chance of anyone ever checking this does what it should.

David

N�r��yb�X��ǧv�^�)޺{.n�+{�r����ܨ}���Ơz�:+v���zZ+��+zf���h���~i���z��w���?�&�)ߢf

RE: ipsec impact on performance

2015-12-02 Thread David Laight
From: Sowmini Varadhan
> Sent: 01 December 2015 18:37
...
> I was using esp-null merely to not have the crypto itself perturb
> the numbers (i.e., just focus on the s/w overhead for now), but here
> are the numbers for the stock linux kernel stack
> Gbps  peak cpu util
> esp-null 1.8   71%
> aes-gcm-c-2561.6   79%
> aes-ccm-a-1280.7   96%
> 
> That trend made me think that if we can get esp-null to be as close
> as possible to GSO/GRO, the rest will follow closely behind.

That's not how I read those figures.
They imply to me that there is a massive cost for the actual encryption
(particularly for aes-ccm-a-128) - so whatever you do to the esp-null
case won't help.

One way to get a view of the cost of the encryption (and copies)
is to do the operation twice.

David

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ipsec impact on performance

2015-12-02 Thread David Laight
From: Sowmini Varadhan
> Sent: 02 December 2015 12:12
> On (12/02/15 11:56), David Laight wrote:
> > > Gbps  peak cpu util
> > > esp-null 1.8   71%
> > > aes-gcm-c-2561.6   79%
> > > aes-ccm-a-1280.7   96%
> > >
> > > That trend made me think that if we can get esp-null to be as close
> > > as possible to GSO/GRO, the rest will follow closely behind.
> >
> > That's not how I read those figures.
> > They imply to me that there is a massive cost for the actual encryption
> > (particularly for aes-ccm-a-128) - so whatever you do to the esp-null
> > case won't help.
> 
> I'm not a crypto expert, but my understanding is that the CCM mode
> is the "older" encryption algorithm, and GCM is the way of the future.
> Plus, I think the GCM mode has some type of h/w support (hence the
> lower cpu util)
> 
> I'm sure that crypto has a cost, not disputing that, but my point
> was that 1.8 -> 1.6 -> 0.7 is a curve with a much gentler slope than
> the 9 Gbps (clear traffic, GSO, GRO)
> -> 4 Gbps (clear, no gro, gso)
>-> 1.8 (esp-null)
> That steeper slope smells of s/w perf that we need to resolve first,
> before getting into the work of faster crypto?

That isn't the way cpu cost works.
You are getting 0.7 Gbps with ass-ccm-a-128, scale the esp-null back to
that and it would use 7/18*71 = 27% of the cpu.
So 69% of the cpu in the a-128 case is probably caused by the
encryption itself.
Even if the rest of the code cost nothing you'd not increase
above 1Gbps.

The sums for aes-gcm-c-256 are slightly better, about 15%.

Ok, things aren't quite that simple since you are probably changing
the way data flows through the system as well.

Also what/how are you measuring cpu use.
I'm not sure anything on Linux gives you a truly accurate value
when processes are running for very short periods.

On an SMP system you also get big effects when work is switched
between cpus. I've got some tests that run a lot faster if I
put all but one of the cpus into a busy-loop in userspace
(eg: while :; do :; done)!

David


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v3 03/17] crypto: talitos - talitos_ptr renamed ptr for more lisibility

2015-04-17 Thread David Laight
From: Christophe Leroy
 Linux CodyingStyle recommends to use short variables for local
 variables. ptr is just good enough for those 3 lines functions.
 It helps keep single lines shorter than 80 characters.
...
 -static void to_talitos_ptr(struct talitos_ptr *talitos_ptr, dma_addr_t 
 dma_addr)
 +static void to_talitos_ptr(struct talitos_ptr *ptr, dma_addr_t dma_addr)
  {
 - talitos_ptr-ptr = cpu_to_be32(lower_32_bits(dma_addr));
 - talitos_ptr-eptr = upper_32_bits(dma_addr);
 + ptr-ptr = cpu_to_be32(lower_32_bits(dma_addr));
 + ptr-eptr = upper_32_bits(dma_addr);
  }
...

Maybe, but 'ptr' isn't a good choice.

David

N�r��yb�X��ǧv�^�)޺{.n�+{�r����ܨ}���Ơz�j:+v���zZ+��+zf���h���~i���z��w���?��)ߢf

RE: [PATCH v1 1/3] SHA1 for PPC/SPE - assembler

2015-02-25 Thread David Laight
From: Markus Stockhausen
 [PATCH v1 1/3] SHA1 for PPC/SPE - assembler
 
 This is the assembler code for SHA1 implementation with
 the SIMD SPE instruction set. With the enhanced instruction
 set we can operate on 2 32 bit words in parallel. That helps
 reducing the time to calculate W16-W79. For increasing
 performance even more the assembler function can compute
 hashes for more than one 64 byte input block.
 
 The state of the used SPE registers is preserved via the
 stack so we can run from interrupt context

Does the ppc use the same kind of delayed state save for the SPE
resisters that x86 uses (at least on the BSDs) for its FP (etc) regs.

That would mean that the registers might contain values for
a different process, and that the cpu could receive an IPI
requesting they be written to the processes normal save area
so that they can be reloaded onto a different cpu.

David

--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v1 2/7] AES for PPC/SPE - aes tables

2015-02-16 Thread David Laight
From:  Markus Stockhausen
 4K AES tables for big endian

I can't help feeling that you could give more information about how the
values are generated.

...
 + * These big endian AES encryption/decryption tables are designed to be 
 simply
 + * accessed by a combination of rlwimi/lwz instructions with a minimum
 + * of table registers (usually only one required). Thus they are aligned to
 + * 4K. The locality of rotated values is derived from the reduced offsets 
 that
 + * are available in the SPE load instructions. E.g. evldw, evlwwsplat, ...
 + *
 + */
 +.data
 +.align 12
 +.globl PPC_AES_4K_ENCTAB
 +PPC_AES_4K_ENCTAB:
 + .long 0xc66363a5,0xa5c66363,0x63a5c663,0x6363a5c6

These seem to be byte rotates (all down the table).
If so then use a CPP define to generate the rotated values.

I'd like to see a reference to where the values themselves come from.

 + .long 0xf87c7c84,0x84f87c7c,0x7c84f87c,0x7c7c84f8
...
 + .long 0x6dd6,0xd66d,0xbbd66dbb,0xd66d
 + .long 0x2c16163a,0x3a2c1616,0x163a2c16,0x16163a2c
 +.globl PPC_AES_4K_DECTAB
 +PPC_AES_4K_DECTAB:
 + .long 0x51f4a750,0x5051f4a7,0xa75051f4,0xf4a75051
...
 + .long 0xd0b85742,0x42d0b857,0x5742d0b8,0xb85742d0

Some explanation of this third dataset is also needed.

 + .byte 0x52, 0x09, 0x6a, 0xd5, 0x30, 0x36, 0xa5, 0x38
...
 + .byte 0xe1, 0x69, 0x14, 0x63, 0x55, 0x21, 0x0c, 0x7d

David

--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH 2/3] sha512: reduce stack usage to safe number

2012-01-16 Thread David Laight
Doesn't this badly overflow W[] ..
 
 +#define SHA512_0_15(i, a, b, c, d, e, f, g, h) \
 + t1 = h + e1(e) + Ch(e, f, g) + sha512_K[i] + W[i];  \
...
 + for (i = 0; i  16; i += 8) {
...
 + SHA512_0_15(i + 7, b, c, d, e, f, g, h, a);
 + }

David


--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: sha512: make it work, undo percpu message schedule

2012-01-13 Thread David Laight
 
 Trying a dynamic memory allocation, and fallback on a single
 pre-allocated bloc of memory, shared by all cpus, protected by a
 spinlock
...
 -
 + static u64 msg_schedule[80];
 + static DEFINE_SPINLOCK(msg_schedule_lock);
   int i;
 - u64 *W = get_cpu_var(msg_schedule);
 + u64 *W = kzalloc(sizeof(msg_schedule), GFP_ATOMIC |
__GFP_NOWARN);
  
 + 
 + if (!W) {
 + spin_lock_bh(msg_schedule_lock);
 + W = msg_schedule;
 + }

If this code can be called from an ISR is the kmalloc()
call safe?

If the above is safe, wouldn't it be better to:
1) try to use the static buffer
2) try to kalloc() a buffer
3) spinwait for the static buffer to be free

David



--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html