Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Mon, Nov 07, 2016 at 08:02:35PM +0100, Jason A. Donenfeld wrote: > On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggerswrote: > > > > I was not referring to any users in particular, only what users could do. > > As an > > example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and > > the > > underlying algorithm is poly1305-generic, the last block would end up > > misaligned. This doesn't appear possible with your pseudocode because it > > only > > passes in multiples of the block size until the very end. However I don't > > see > > it claimed anywhere that shash API users have to do that. > > Actually it appears that crypto/poly1305_generic.c already buffers > incoming blocks to a buffer that definitely looks aligned, to prevent > this condition! > No it does *not* buffer all incoming blocks, which is why the source pointer can fall out of alignment. Yes, I actually tested this. In fact this situation is even hit, in both possible places, in the self-tests. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Mon, Nov 7, 2016 at 8:25 PM, Eric Biggerswrote: > No it does *not* buffer all incoming blocks, which is why the source pointer > can > fall out of alignment. Yes, I actually tested this. In fact this situation > is > even hit, in both possible places, in the self-tests. Urgh! v3 coming right up... -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Mon, Nov 7, 2016 at 7:26 PM, Eric Biggerswrote: > > I was not referring to any users in particular, only what users could do. As > an > example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the > underlying algorithm is poly1305-generic, the last block would end up > misaligned. This doesn't appear possible with your pseudocode because it only > passes in multiples of the block size until the very end. However I don't see > it claimed anywhere that shash API users have to do that. Actually it appears that crypto/poly1305_generic.c already buffers incoming blocks to a buffer that definitely looks aligned, to prevent this condition! I'll submit a v2 with only the inner unaligned operations changed. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Mon, Nov 07, 2016 at 07:08:22PM +0100, Jason A. Donenfeld wrote: > Hmm... The general data flow that strikes me as most pertinent is > something like: > > struct sk_buff *skb = get_it_from_somewhere(); > skb = skb_share_check(skb, GFP_ATOMIC); > num_frags = skb_cow_data(skb, ..., ...); > struct scatterlist sg[num_frags]; > sg_init_table(sg, num_frags); > skb_to_sgvec(skb, sg, ..., ...); > blkcipher_walk_init(, sg, sg, len); > blkcipher_walk_virt_block(, , BLOCK_SIZE); > while (walk.nbytes >= BLOCK_SIZE) { > size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE); > poly1305_update(_state, walk.src.virt.addr, chunk_len); > blkcipher_walk_done(, , walk.nbytes % BLOCK_SIZE); > } > if (walk.nbytes) { > poly1305_update(_state, walk.src.virt.addr, walk.nbytes); > blkcipher_walk_done(, , 0); > } > > Is your suggestion that that in the final if block, walk.src.virt.addr > might be unaligned? Like in the case of the last fragment being 67 > bytes long? I was not referring to any users in particular, only what users could do. As an example, if you did crypto_shash_update() with 32, 15, then 17 bytes, and the underlying algorithm is poly1305-generic, the last block would end up misaligned. This doesn't appear possible with your pseudocode because it only passes in multiples of the block size until the very end. However I don't see it claimed anywhere that shash API users have to do that. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Mon, Nov 7, 2016 at 7:08 PM, Jason A. Donenfeldwrote: > Hmm... The general data flow that strikes me as most pertinent is > something like: > > struct sk_buff *skb = get_it_from_somewhere(); > skb = skb_share_check(skb, GFP_ATOMIC); > num_frags = skb_cow_data(skb, ..., ...); > struct scatterlist sg[num_frags]; > sg_init_table(sg, num_frags); > skb_to_sgvec(skb, sg, ..., ...); > blkcipher_walk_init(, sg, sg, len); > blkcipher_walk_virt_block(, , BLOCK_SIZE); > while (walk.nbytes >= BLOCK_SIZE) { > size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE); > poly1305_update(_state, walk.src.virt.addr, chunk_len); > blkcipher_walk_done(, , walk.nbytes % BLOCK_SIZE); > } > if (walk.nbytes) { > poly1305_update(_state, walk.src.virt.addr, walk.nbytes); > blkcipher_walk_done(, , 0); > } > > Is your suggestion that that in the final if block, walk.src.virt.addr > might be unaligned? Like in the case of the last fragment being 67 > bytes long? In fact, I'm not so sure this happens here. In the while loop, each new walk.src.virt.addr will be aligned to BLOCK_SIZE or be aligned by virtue of being at the start of a new page. In the subsequent if block, walk.src.virt.addr will either be some_aligned_address+BLOCK_SIZE, which will be aligned, or it will be a start of a new page, which will be aligned. So what did you have in mind exactly? I don't think anybody is running code like: for (size_t i = 0; i < len; i += 17) poly1305_update(, [i], 17); (And if so, those consumers should be fixed.) -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
Hi Eric, On Fri, Nov 4, 2016 at 6:37 PM, Eric Biggerswrote: > I agree, and the current code is wrong; but do note that this proposal is > correct for poly1305_setrkey() but not for poly1305_setskey() and > poly1305_blocks(). In the latter two cases, 4-byte alignment of the source > buffer is *not* guaranteed. Although crypto_poly1305_update() will be called > with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the > algorithm operates on 16-byte blocks and therefore has to buffer partial > blocks. > If some number of bytes that is not 0 mod 4 is buffered, then the buffer will > fall out of alignment on the next update call. Hence, get_unaligned_le32() is > actually needed on all the loads, since the buffer will, in general, be of > unknown alignment. Hmm... The general data flow that strikes me as most pertinent is something like: struct sk_buff *skb = get_it_from_somewhere(); skb = skb_share_check(skb, GFP_ATOMIC); num_frags = skb_cow_data(skb, ..., ...); struct scatterlist sg[num_frags]; sg_init_table(sg, num_frags); skb_to_sgvec(skb, sg, ..., ...); blkcipher_walk_init(, sg, sg, len); blkcipher_walk_virt_block(, , BLOCK_SIZE); while (walk.nbytes >= BLOCK_SIZE) { size_t chunk_len = rounddown(walk.nbytes, BLOCK_SIZE); poly1305_update(_state, walk.src.virt.addr, chunk_len); blkcipher_walk_done(, , walk.nbytes % BLOCK_SIZE); } if (walk.nbytes) { poly1305_update(_state, walk.src.virt.addr, walk.nbytes); blkcipher_walk_done(, , 0); } Is your suggestion that that in the final if block, walk.src.virt.addr might be unaligned? Like in the case of the last fragment being 67 bytes long? If so, what a hassle. I hope the performance overhead isn't too awful... I'll resubmit taking into account your suggestions. By the way -- offlist benchmarks sent to me concluded that using the unaligned load helpers like David suggested is just as fast as that handrolled bit magic in the v1. Regards, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Thu, Nov 03, 2016 at 11:20:08PM +0100, Jason A. Donenfeld wrote: > Hi David, > > On Thu, Nov 3, 2016 at 6:08 PM, David Millerwrote: > > In any event no piece of code should be doing 32-bit word reads from > > addresses like "x + 3" without, at a very minimum, going through the > > kernel unaligned access handlers. > > Excellent point. In otherwords, > > ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ff; > ctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x303; > ctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff; > ctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff; > ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00f; > > should change to: > > ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ff; > ctx->r[1] = (get_unaligned_le32(key + 3) >> 2) & 0x303; > ctx->r[2] = (get_unaligned_le32(key + 6) >> 4) & 0x3ffc0ff; > ctx->r[3] = (get_unaligned_le32(key + 9) >> 6) & 0x3f03fff; > ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00f; > I agree, and the current code is wrong; but do note that this proposal is correct for poly1305_setrkey() but not for poly1305_setskey() and poly1305_blocks(). In the latter two cases, 4-byte alignment of the source buffer is *not* guaranteed. Although crypto_poly1305_update() will be called with a 4-byte aligned buffer due to the alignmask set on poly1305_alg, the algorithm operates on 16-byte blocks and therefore has to buffer partial blocks. If some number of bytes that is not 0 mod 4 is buffered, then the buffer will fall out of alignment on the next update call. Hence, get_unaligned_le32() is actually needed on all the loads, since the buffer will, in general, be of unknown alignment. Note: some other shash algorithms have this problem too and do not handle it correctly. It seems to be a common mistake. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
Hi David, On Thu, Nov 3, 2016 at 6:08 PM, David Millerwrote: > In any event no piece of code should be doing 32-bit word reads from > addresses like "x + 3" without, at a very minimum, going through the > kernel unaligned access handlers. Excellent point. In otherwords, ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ff; ctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x303; ctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff; ctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff; ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00f; should change to: ctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ff; ctx->r[1] = (get_unaligned_le32(key + 3) >> 2) & 0x303; ctx->r[2] = (get_unaligned_le32(key + 6) >> 4) & 0x3ffc0ff; ctx->r[3] = (get_unaligned_le32(key + 9) >> 6) & 0x3f03fff; ctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00f; > We know explicitly that these offsets will not be 32-bit aligned, so > it is required that we use the helpers, or alternatively do things to > avoid these unaligned accesses such as using temporary storage when > the HAVE_EFFICIENT_UNALIGNED_ACCESS kconfig value is not set. So the question is: is the clever avoidance of unaligned accesses of the original patch faster or slower than changing the unaligned accesses to use the helper function? I've put a little test harness together for playing with this: $ git clone git://git.zx2c4.com/polybench $ cd polybench $ make run To test with one method, do as normal. To test with the other, remove "#define USE_FIRST_METHOD" from the source code. @René: do you think you could retest on your MIPS32r2 hardware and report back which is faster? And if anybody else has other hardware and would like to try, this could be nice. Regards, Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
From: "Jason A. Donenfeld"Date: Thu, 3 Nov 2016 08:24:57 +0100 > Hi Herbert, > > On Thu, Nov 3, 2016 at 1:49 AM, Herbert Xu > wrote: >> FWIW I'd rather live with a 6% slowdown than having two different >> code paths in the generic code. Anyone who cares about 6% would >> be much better off writing an assembly version of the code. > > Please think twice before deciding that the generic C "is allowed to > be slow". In any event no piece of code should be doing 32-bit word reads from addresses like "x + 3" without, at a very minimum, going through the kernel unaligned access handlers. Yet that is what the generic C poly1305 code is doing, all over the place. We know explicitly that these offsets will not be 32-bit aligned, so it is required that we use the helpers, or alternatively do things to avoid these unaligned accesses such as using temporary storage when the HAVE_EFFICIENT_UNALIGNED_ACCESS kconfig value is not set. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
Hi Herbert, On Thu, Nov 3, 2016 at 1:49 AM, Herbert Xuwrote: > FWIW I'd rather live with a 6% slowdown than having two different > code paths in the generic code. Anyone who cares about 6% would > be much better off writing an assembly version of the code. Please think twice before deciding that the generic C "is allowed to be slow". It turns out to be used far more often than might be obvious. For example, crypto is commonly done on the netdev layer -- like the case with mac80211-based drivers. At this layer, the FPU on x86 isn't always available, depending on the path used. Some combinations of drivers, packet family, and workload can result in the generic C being used instead of the vectorized assembly for a massive percentage of time. So, I think we do have a good motivation for wanting the generic C to be as fast as possible. In the particular case of poly1305, these are the only spots where unaligned accesses take place, and they're rather small, and I think it's pretty obvious what's happening in the two different cases of code from a quick glance. This isn't the "two different paths case" in which there's a significant future-facing maintenance burden. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Wed, Nov 02, 2016 at 11:00:00PM +0100, Jason A. Donenfeld wrote: > > Just tested. I get a 6% slowdown on my Skylake. No good. I think it's > probably best to have the two paths in there, and not reduce it to > one. FWIW I'd rather live with a 6% slowdown than having two different code paths in the generic code. Anyone who cares about 6% would be much better off writing an assembly version of the code. Cheers, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Wed, Nov 2, 2016 at 10:26 PM, Herbert Xuwrote: > What I'm interested in is whether the new code is sufficiently > close in performance to the old code, particularonly on x86. > > I'd much rather only have a single set of code for all architectures. > After all, this is meant to be a generic implementation. Just tested. I get a 6% slowdown on my Skylake. No good. I think it's probably best to have the two paths in there, and not reduce it to one. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Fast Code and HAVE_EFFICIENT_UNALIGNED_ACCESS (was: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access)
On Wed, Nov 2, 2016 at 5:25 PM, Jason A. Donenfeldwrote: > These architectures select HAVE_EFFICIENT_UNALIGNED_ACCESS: > > s390 arm arm64 powerpc x86 x86_64 > > So, these will use the original old code. > > The architectures that will thus use the new code are: > > alpha arc avr32 blackfin c6x cris frv h7300 hexagon ia64 m32r m68k > metag microblaze mips mn10300 nios2 openrisc parisc score sh sparc > tile um unicore32 xtensa What I have found in practice from helping maintain a security library and running benchmarks until my eyes bled UNALIGNED_ACCESS is a kiss of death. It effectively prohibits -O3 and above due to undefined behavior in C and problems with GCC vectorization. In the bigger picture, it simply slows things down. Once we moved away from UNALIGNED_ACCESS and started testing at -O3 and -O5, the benchmarks enjoyed non-trivial speedups on top of any speedups we were trying to achieve with hand tuned assembly language routines. Effectively, the best speedup was the sum of C-code and ASM; they were not disjoint as they appear. The one wrinkle for UNALIGNED_ACCESS is Bernstein's compressed tables (https://cr.yp.to/antiforgery/cachetiming-20050414.pdf). UNALIGNED_ACCESS meets some security goals. The techniques from Bernstein's paper apply equally well to AES, Camellia and other table-driven implementations. Painting with a broad brush (and as far as I know), the kernel is not observing the recommendations. My apologies if I parsed things incorrectly. Jeff -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Wed, Nov 02, 2016 at 10:25:00PM +0100, Jason A. Donenfeld wrote: > These architectures select HAVE_EFFICIENT_UNALIGNED_ACCESS: > > s390 arm arm64 powerpc x86 x86_64 > > So, these will use the original old code. What I'm interested in is whether the new code is sufficiently close in performance to the old code, particularonly on x86. I'd much rather only have a single set of code for all architectures. After all, this is meant to be a generic implementation. Thanks, -- Email: Herbert XuHome Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
These architectures select HAVE_EFFICIENT_UNALIGNED_ACCESS: s390 arm arm64 powerpc x86 x86_64 So, these will use the original old code. The architectures that will thus use the new code are: alpha arc avr32 blackfin c6x cris frv h7300 hexagon ia64 m32r m68k metag microblaze mips mn10300 nios2 openrisc parisc score sh sparc tile um unicore32 xtensa Unfortunately, of these, the only machines I have access to are MIPS. My SPARC access went cold a few years ago. If you insist on a data-motivated approach approach, then I fear my test of 1 out of 26 different architectures is woefully insufficient. Does anybody else on the list have access to more hardware and is interested in benchmarking? If not, is there a reasonable way to decide on this by considering the added complexity of code? Are we able to reason best and worst cases of instruction latency vs unalignment stalls for most CPU designs? -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Wed, Nov 02, 2016 at 10:06:39PM +0100, Jason A. Donenfeld wrote: > On Wed, Nov 2, 2016 at 9:09 PM, Herbert Xu> wrote: > > Can you give some numbers please? What about other architectures > > that your patch impacts? > > Per [1], the patch gives a 181% speed up on MIPS32r2. > > [1] https://lists.zx2c4.com/pipermail/wireguard/2016-September/000398.html What about architectures? In particular, what if we just use your new code for all architectures. How much would we lose? Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Wed, Nov 2, 2016 at 9:09 PM, Herbert Xuwrote: > Can you give some numbers please? What about other architectures > that your patch impacts? Per [1], the patch gives a 181% speed up on MIPS32r2. [1] https://lists.zx2c4.com/pipermail/wireguard/2016-September/000398.html -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Wed, Nov 2, 2016 at 4:09 PM, Herbert Xuwrote: > On Wed, Nov 02, 2016 at 06:58:10PM +0100, Jason A. Donenfeld wrote: >> On MIPS chips commonly found in inexpensive routers, this makes a big >> difference in performance. >> >> Signed-off-by: Jason A. Donenfeld > > Can you give some numbers please? What about other architectures > that your patch impacts? In general it is not always clear that using whatever hardware crypto is available is a good idea. Not all such hardware is fast, some CPUs are, some CPUs have hardware for AES, and even if the hardware is faster than the CPU, the context switch overheads may exceed the advantage. Ideally the patch development or acceptance process would be testing this, but I think it might be difficult to reach that ideal. The exception is a hardware RNG; that should always be used unless it is clearly awful. It cannot do harm, speed is not much of an issue, and it solves the hardest problem in the random(4) driver, making sure of correct initialisation before any use. -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On Wed, Nov 02, 2016 at 06:58:10PM +0100, Jason A. Donenfeld wrote: > On MIPS chips commonly found in inexpensive routers, this makes a big > difference in performance. > > Signed-off-by: Jason A. DonenfeldCan you give some numbers please? What about other architectures that your patch impacts? Thanks, -- Email: Herbert Xu Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] poly1305: generic C can be faster on chips with slow unaligned access
On MIPS chips commonly found in inexpensive routers, this makes a big difference in performance. Signed-off-by: Jason A. Donenfeld--- crypto/poly1305_generic.c | 29 - 1 file changed, 28 insertions(+), 1 deletion(-) diff --git a/crypto/poly1305_generic.c b/crypto/poly1305_generic.c index 2df9835d..186e33d 100644 --- a/crypto/poly1305_generic.c +++ b/crypto/poly1305_generic.c @@ -65,11 +65,24 @@ EXPORT_SYMBOL_GPL(crypto_poly1305_setkey); static void poly1305_setrkey(struct poly1305_desc_ctx *dctx, const u8 *key) { /* r &= 0xffc0ffc0ffc0fff */ +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS dctx->r[0] = (le32_to_cpuvp(key + 0) >> 0) & 0x3ff; dctx->r[1] = (le32_to_cpuvp(key + 3) >> 2) & 0x303; dctx->r[2] = (le32_to_cpuvp(key + 6) >> 4) & 0x3ffc0ff; dctx->r[3] = (le32_to_cpuvp(key + 9) >> 6) & 0x3f03fff; dctx->r[4] = (le32_to_cpuvp(key + 12) >> 8) & 0x00f; +#else + u32 t0, t1, t2, t3; + t0 = le32_to_cpuvp(key + 0); + t1 = le32_to_cpuvp(key + 4); + t2 = le32_to_cpuvp(key + 8); + t3 = le32_to_cpuvp(key + 12); + dctx->r[0] = t0 & 0x3ff; t0 >>= 26; t0 |= t1 << 6; + dctx->r[1] = t0 & 0x303; t1 >>= 20; t1 |= t2 << 12; + dctx->r[2] = t1 & 0x3ffc0ff; t2 >>= 14; t2 |= t3 << 18; + dctx->r[3] = t2 & 0x3f03fff; t3 >>= 8; + dctx->r[4] = t3 & 0x00f; +#endif } static void poly1305_setskey(struct poly1305_desc_ctx *dctx, const u8 *key) @@ -109,6 +122,9 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx, u32 s1, s2, s3, s4; u32 h0, h1, h2, h3, h4; u64 d0, d1, d2, d3, d4; +#ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS + u32 t0, t1, t2, t3; +#endif unsigned int datalen; if (unlikely(!dctx->sset)) { @@ -135,13 +151,24 @@ static unsigned int poly1305_blocks(struct poly1305_desc_ctx *dctx, h4 = dctx->h[4]; while (likely(srclen >= POLY1305_BLOCK_SIZE)) { - /* h += m[i] */ +#ifdef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS h0 += (le32_to_cpuvp(src + 0) >> 0) & 0x3ff; h1 += (le32_to_cpuvp(src + 3) >> 2) & 0x3ff; h2 += (le32_to_cpuvp(src + 6) >> 4) & 0x3ff; h3 += (le32_to_cpuvp(src + 9) >> 6) & 0x3ff; h4 += (le32_to_cpuvp(src + 12) >> 8) | hibit; +#else + t0 = le32_to_cpuvp(src + 0); + t1 = le32_to_cpuvp(src + 4); + t2 = le32_to_cpuvp(src + 8); + t3 = le32_to_cpuvp(src + 12); + h0 += t0 & 0x3ff; + h1 += sru64)t1 << 32) | t0), 26) & 0x3ff; + h2 += sru64)t2 << 32) | t1), 20) & 0x3ff; + h3 += sru64)t3 << 32) | t2), 14) & 0x3ff; + h4 += (t3 >> 8) | hibit; +#endif /* h *= r */ d0 = mlt(h0, r0) + mlt(h1, s4) + mlt(h2, s3) + -- 2.10.2 -- To unsubscribe from this list: send the line "unsubscribe linux-crypto" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html