Re: [PATCH] crypto: virtio/akcipher - Fix stack overflow on memcpy
On Tue, Jan 30, 2024 at 7:28 PM zhenwei pi wrote: > > sizeof(struct virtio_crypto_akcipher_session_para) is less than > sizeof(struct virtio_crypto_op_ctrl_req::u), copying more bytes from > stack variable leads stack overflow. Clang reports this issue by > commands: > make -j CC=clang-14 mrproper >/dev/null 2>&1 > make -j O=/tmp/crypto-build CC=clang-14 allmodconfig >/dev/null 2>&1 > make -j O=/tmp/crypto-build W=1 CC=clang-14 drivers/crypto/virtio/ > virtio_crypto_akcipher_algs.o > > Fixes: 59ca6c93387d ("virtio-crypto: implement RSA algorithm") > Link: > https://lore.kernel.org/all/0a194a79-e3a3-45e7-be98-83abd3e1c...@roeck-us.net/ > Signed-off-by: zhenwei pi Acked-by: Jason Wang Thanks
Re: [PATCH] crypto: arm/curve25519 - Move '.fpu' after '.arch'
Seems reasonable to me. Acked-by: Jason A. Donenfeld
Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode
On Fri, Apr 9, 2021 at 6:47 AM Simo Sorce wrote: > > depends on m || !CRYPTO_FIPS > > > > but I am a bit concerned that the rather intricate kconfig > > dependencies between the generic and arch-optimized versions of those > > drivers get complicated even further. > > Actually this is the opposite direction we are planning to go for > future fips certifications. > > Due to requirements about crypto module naming and versioning in the > new FIPS-140-3 standard we are planning to always build all the CRYPTO > as bultin (and maybe even forbid loading additional crypto modules in > FIPS mode). This is clearly just a vendor choice and has no bearing on > what upstream ultimately will do, but just throwing it here as a data > point. I'm wondering: do you intend to apply similar patches to all the other uses of "non-FIPS-certified" crypto in the kernel? I've already brought up big_key.c, for example. Also if you're intent on adding this check to WireGuard, because it tunnels packets without using FIPS-certified crypto primitives, do you also plan on adding this check to other network tunnels that don't tunnel packets using FIPS-certified crypto primitives? For example, GRE, VXLAN, GENEVE? I'd be inclined to take this patch more seriously if it was exhaustive and coherent for your use case. The targeted hit on WireGuard seems incoherent as a standalone patch, making it hard to even evaluate. So I think either you should send an exhaustive patch series that forbids all use of non-FIPS crypto anywhere in the kernel (another example: net/core/secure_seq.c) in addition to all tunneling modules that don't use FIPS-certified crypto, or figure out how to disable the lib/crypto primitives that you want to be disabled in "fips mode". With a coherent patchset for either of these, we can then evaluate it. Jason
Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode
On Fri, Apr 9, 2021 at 2:08 AM Hangbin Liu wrote: > After offline discussion with Herbert, here is > what he said: > > """ > This is not a problem in RHEL8 because the Crypto API RNG replaces /dev/random > in FIPS mode. > """ So far as I can see, this isn't the case in the kernel sources I'm reading? Maybe you're doing some userspace hack with CUSE? But at least get_random_bytes doesn't behave this way...
Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode
On Fri, Apr 09, 2021 at 10:49:07AM +0800, Hangbin Liu wrote: > On Thu, Apr 08, 2021 at 08:44:35PM -0600, Jason A. Donenfeld wrote: > > Since it's just a normal module library, you can simply do this in the > > module_init function, rather than deep within registration > > abstractions. > > I did a try but looks it's not that simple. Not sure if it's because wireguard > calls the library directly. Need to check more... Something like the below should work... diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c index a408f4bcfd62..47212f9421c1 100644 --- a/arch/arm/crypto/chacha-glue.c +++ b/arch/arm/crypto/chacha-glue.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include @@ -297,6 +298,9 @@ static int __init chacha_simd_mod_init(void) { int err = 0; + if (fips_enabled) + return -EOPNOTSUPP; + if (IS_REACHABLE(CONFIG_CRYPTO_BLKCIPHER)) { err = crypto_register_skciphers(arm_algs, ARRAY_SIZE(arm_algs)); if (err) diff --git a/arch/arm/crypto/curve25519-glue.c b/arch/arm/crypto/curve25519-glue.c index 31eb75b6002f..d03f810fdaf3 100644 --- a/arch/arm/crypto/curve25519-glue.c +++ b/arch/arm/crypto/curve25519-glue.c @@ -14,6 +14,7 @@ #include #include #include +#include #include #include #include @@ -114,6 +115,9 @@ static struct kpp_alg curve25519_alg = { static int __init mod_init(void) { + if (fips_enabled) + return -EOPNOTSUPP; + if (elf_hwcap & HWCAP_NEON) { static_branch_enable(&have_neon); return IS_REACHABLE(CONFIG_CRYPTO_KPP) ? diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c index 3023c1acfa19..30d6c6de7a27 100644 --- a/arch/arm/crypto/poly1305-glue.c +++ b/arch/arm/crypto/poly1305-glue.c @@ -17,6 +17,7 @@ #include #include #include +#include void poly1305_init_arm(void *state, const u8 *key); void poly1305_blocks_arm(void *state, const u8 *src, u32 len, u32 hibit); @@ -240,6 +241,9 @@ static struct shash_alg arm_poly1305_algs[] = {{ static int __init arm_poly1305_mod_init(void) { + if (fips_enabled) + return -EOPNOTSUPP; + if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && (elf_hwcap & HWCAP_NEON)) static_branch_enable(&have_neon); diff --git a/arch/arm64/crypto/chacha-neon-glue.c b/arch/arm64/crypto/chacha-neon-glue.c index 1d9824c4ae43..1696993326b5 100644 --- a/arch/arm64/crypto/chacha-neon-glue.c +++ b/arch/arm64/crypto/chacha-neon-glue.c @@ -26,6 +26,7 @@ #include #include #include +#include #include #include @@ -214,6 +215,9 @@ static struct skcipher_alg algs[] = { static int __init chacha_simd_mod_init(void) { + if (fips_enabled) + return -EOPNOTSUPP; + if (!cpu_have_named_feature(ASIMD)) return 0; diff --git a/arch/arm64/crypto/poly1305-glue.c b/arch/arm64/crypto/poly1305-glue.c index f33ada70c4ed..ac257a52be4d 100644 --- a/arch/arm64/crypto/poly1305-glue.c +++ b/arch/arm64/crypto/poly1305-glue.c @@ -17,6 +17,7 @@ #include #include #include +#include asmlinkage void poly1305_init_arm64(void *state, const u8 *key); asmlinkage void poly1305_blocks(void *state, const u8 *src, u32 len, u32 hibit); @@ -208,6 +209,9 @@ static struct shash_alg neon_poly1305_alg = { static int __init neon_poly1305_mod_init(void) { + if (fips_enabled) + return -EOPNOTSUPP; + if (!cpu_have_named_feature(ASIMD)) return 0; diff --git a/arch/mips/crypto/chacha-glue.c b/arch/mips/crypto/chacha-glue.c index 90896029d0cd..31f8294f2a31 100644 --- a/arch/mips/crypto/chacha-glue.c +++ b/arch/mips/crypto/chacha-glue.c @@ -12,6 +12,7 @@ #include #include #include +#include asmlinkage void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src, unsigned int bytes, int nrounds); @@ -128,6 +129,9 @@ static struct skcipher_alg algs[] = { static int __init chacha_simd_mod_init(void) { + if (fips_enabled) + return -EOPNOTSUPP; + return IS_REACHABLE(CONFIG_CRYPTO_BLKCIPHER) ? crypto_register_skciphers(algs, ARRAY_SIZE(algs)) : 0; } diff --git a/arch/mips/crypto/poly1305-glue.c b/arch/mips/crypto/poly1305-glue.c index fc881b46d911..f5edec10cef8 100644 --- a/arch/mips/crypto/poly1305-glue.c +++ b/arch/mips/crypto/poly1305-glue.c @@ -12,6 +12,7 @@ #include #include #include +#include asmlinkage void poly1305_init_mips(void *state, const u8 *key); asmlinkage void poly1305_blocks_mips(void *state, const u8 *src, u32 len, u32 hibit); @@ -173,6 +174,9 @@ static struct shash_alg mips_poly1305_alg = { static int __init mips_poly1305_mod_init(void) { + if (fips_enabled) + return -EOPNOTSUPP; + return IS_REACHABLE(CONFIG_CRYPTO_HASH) ?
Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode
Hi Hangbin, On Thu, Apr 8, 2021 at 8:41 PM Hangbin Liu wrote: > I agree that the best way is to disable the crypto modules in FIPS mode. > But the code in lib/crypto looks not the same with crypto/. For modules > in crypto, there is an alg_test() to check if the crytpo is FIPS allowed > when do register. > > - crypto_register_alg() > - crypto_wait_for_test() > - crypto_probing_notify(CRYPTO_MSG_ALG_REGISTER, larval->adult) > - cryptomgr_schedule_test() > - cryptomgr_test() > - alg_test() > > But in lib/crypto the code are more like a library. We can call it anytime > and there is no register. Maybe we should add a similar check in lib/crypto. > But I'm not familiar with crypto code... Not sure if anyone in linux-crypto@ > would like help do that. Since it's just a normal module library, you can simply do this in the module_init function, rather than deep within registration abstractions. > > diff --git a/lib/crypto/curve25519.c b/lib/crypto/curve25519.c > > index 288a62cd29b2..b794f49c291a 100644 > > --- a/lib/crypto/curve25519.c > > +++ b/lib/crypto/curve25519.c > > @@ -12,11 +12,15 @@ > > #include > > #include > > #include > > +#include > > > > bool curve25519_selftest(void); > > > > static int __init mod_init(void) > > { > > + if (!fips_enabled) > > + return -EOPNOTSUPP; > > Question here, why it is !fips_enabled? Shouldn't we return error when > fips_enabled? Er, just not thinking straight today. `if (fips_enabled)` is probably what you want indeed. Jason
Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode
On Thu, Apr 8, 2021 at 7:55 AM Simo Sorce wrote: > > I'm not sure this makes so much sense to do _in wireguard_. If you > > feel like the FIPS-allergic part is actually blake, 25519, chacha, and > > poly1305, then wouldn't it make most sense to disable _those_ modules > > instead? And then the various things that rely on those (such as > > wireguard, but maybe there are other things too, like > > security/keys/big_key.c) would be naturally disabled transitively? > > The reason why it is better to disable the whole module is that it > provide much better feedback to users. Letting init go through and then > just fail operations once someone tries to use it is much harder to > deal with in terms of figuring out what went wrong. > Also wireguard seem to be poking directly into the algorithms > implementations and not use the crypto API, so disabling individual > algorithm via the regular fips_enabled mechanism at runtime doesn't > work. What I'm suggesting _would_ work in basically the exact same way as this patch. Namely, something like: diff --git a/lib/crypto/curve25519.c b/lib/crypto/curve25519.c index 288a62cd29b2..b794f49c291a 100644 --- a/lib/crypto/curve25519.c +++ b/lib/crypto/curve25519.c @@ -12,11 +12,15 @@ #include #include #include +#include bool curve25519_selftest(void); static int __init mod_init(void) { + if (!fips_enabled) + return -EOPNOTSUPP; + if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) && WARN_ON(!curve25519_selftest())) return -ENODEV; Making the various lib/crypto/* modules return EOPNOTSUPP will in turn mean that wireguard will refuse to load, due to !fips_enabled. It has the positive effect that all other things that use it will also be EOPNOTSUPP. For example, what are you doing about big_key.c? Right now, I assume nothing. But this way, you'd get all of the various effects for free. Are you going to continuously audit all uses of non-FIPS crypto and add `if (!fips_enabled)` to every new use case, always, everywhere, from now into the boundless future? By adding `if (!fips_enabled)` to wireguard, that's what you're signing yourself up for. Instead, by restricting the lib/crypto/* modules to !fips_enabled, you can get all of those transitive effects without having to do anything additional. Alternatively, I agree with Eric - why not just consider this outside your boundary? Jason
Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode
Hi Hangbin, On Wed, Apr 7, 2021 at 5:39 AM Hangbin Liu wrote: > > As the cryptos(BLAKE2S, Curve25519, CHACHA20POLY1305) in WireGuard are not > FIPS certified, the WireGuard module should be disabled in FIPS mode. I'm not sure this makes so much sense to do _in wireguard_. If you feel like the FIPS-allergic part is actually blake, 25519, chacha, and poly1305, then wouldn't it make most sense to disable _those_ modules instead? And then the various things that rely on those (such as wireguard, but maybe there are other things too, like security/keys/big_key.c) would be naturally disabled transitively? [As an aside, I don't think any of this fips-flag-in-the-kernel makes much sense at all for anything, but that seems like a different discussion, maybe?] Jason
Re: [PATCH v2] crypto: lib/chacha20poly1305 - define empty module exit function
On Sat, Jan 16, 2021 at 1:30 AM John Donnelly wrote: > > > > > On Jan 15, 2021, at 1:30 PM, Jason A. Donenfeld wrote: > > > > With no mod_exit function, users are unable to unload the module after > > use. I'm not aware of any reason why module unloading should be > > prohibited for this one, so this commit simply adds an empty exit > > function. > > > > Reported-and-tested-by: John Donnelly > > Acked-by: Ard Biesheuvel > > Signed-off-by: Jason A. Donenfeld > > Thanks! > > Would someone be kind enough to remind when this appears and I will apply it > to our product ? We like to use published commits when possible. > > JD It'll show up in one of these two repos in a week or two: https://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git/ https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git/
[PATCH v2] crypto: lib/chacha20poly1305 - define empty module exit function
With no mod_exit function, users are unable to unload the module after use. I'm not aware of any reason why module unloading should be prohibited for this one, so this commit simply adds an empty exit function. Reported-and-tested-by: John Donnelly Acked-by: Ard Biesheuvel Signed-off-by: Jason A. Donenfeld --- v1->v2: - Fix typo in commit message. lib/crypto/chacha20poly1305.c | 5 + 1 file changed, 5 insertions(+) diff --git a/lib/crypto/chacha20poly1305.c b/lib/crypto/chacha20poly1305.c index 5850f3b87359..c2fcdb98cc02 100644 --- a/lib/crypto/chacha20poly1305.c +++ b/lib/crypto/chacha20poly1305.c @@ -362,7 +362,12 @@ static int __init mod_init(void) return 0; } +static void __exit mod_exit(void) +{ +} + module_init(mod_init); +module_exit(mod_exit); MODULE_LICENSE("GPL v2"); MODULE_DESCRIPTION("ChaCha20Poly1305 AEAD construction"); MODULE_AUTHOR("Jason A. Donenfeld "); -- 2.30.0
[PATCH] crypto: lib/chacha20poly1305 - define empty module exit function
With no mod_exit function, users are unable to load the module after use. I'm not aware of any reason why module unloading should be prohibited for this one, so this commit simply adds an empty exit function. Reported-by: John Donnelly Signed-off-by: Jason A. Donenfeld --- lib/crypto/chacha20poly1305.c | 5 + 1 file changed, 5 insertions(+) diff --git a/lib/crypto/chacha20poly1305.c b/lib/crypto/chacha20poly1305.c index 5850f3b87359..c2fcdb98cc02 100644 --- a/lib/crypto/chacha20poly1305.c +++ b/lib/crypto/chacha20poly1305.c @@ -362,7 +362,12 @@ static int __init mod_init(void) return 0; } +static void __exit mod_exit(void) +{ +} + module_init(mod_init); +module_exit(mod_exit); MODULE_LICENSE("GPL v2"); MODULE_DESCRIPTION("ChaCha20Poly1305 AEAD construction"); MODULE_AUTHOR("Jason A. Donenfeld "); -- 2.30.0
Re: drivers/char/random.c needs a (new) maintainer
On Wed, Dec 23, 2020 at 5:03 PM Jason A. Donenfeld wrote: > > Hi Peter, > > On Wed, Dec 23, 2020 at 5:01 PM Petr Tesarik wrote: > > I never suggested that this should serve as a supportive argument. I was > > just trying to be honest about our motivations. > > > > I'm a bit sad that this discussion has quickly gone back to the choice of > > algorithms and how they can be implemented. > > Why are you sad? You are interested in FIPS. FIPS indicates a certain > set of algorithms. The ones most suitable to the task seem like they'd > run into real practical problems in the kernel's RNG. > > That's not the _only_ reason I'm not keen on FIPS, but it does seem > like a very basic one. > > Jason And just to add to that: in working through Nicholai's patches (an ongoing process), I'm reminded of his admonishment in the 00 cover letter that at some point chacha20 will have to be replaced, due to FIPS. So it seems like that's very much on the table. I brought it up here as an example ("For example, " is how I began that sentence), but it is a concern. If you want to make lots of changes for cryptographic or technical reasons, that seems like a decent way to engage. But if the motivation for each of these is the bean counting, then again, I'm pretty wary of churn for nothing. And if that bean counting will eventually lead us into bad corners, like the concerns I brought up about FPU usage in the kernel, then I'm even more hesitant. However, I think there may be good arguments to be made that some of Nicholai's patches stand on their own, without the FIPS motivation. And that's the set of arguments that are compelling. Jason
Re: drivers/char/random.c needs a (new) maintainer
Hi Peter, On Wed, Dec 23, 2020 at 5:01 PM Petr Tesarik wrote: > I never suggested that this should serve as a supportive argument. I was just > trying to be honest about our motivations. > > I'm a bit sad that this discussion has quickly gone back to the choice of > algorithms and how they can be implemented. Why are you sad? You are interested in FIPS. FIPS indicates a certain set of algorithms. The ones most suitable to the task seem like they'd run into real practical problems in the kernel's RNG. That's not the _only_ reason I'm not keen on FIPS, but it does seem like a very basic one. Jason
Re: drivers/char/random.c needs a (new) maintainer
On Wed, Dec 23, 2020 at 4:26 PM Stephan Mueller wrote: > > Am Mittwoch, dem 23.12.2020 um 15:32 +0100 schrieb Jason A. Donenfeld: > > > > I would, however, be interested in a keccak-based construction. But > > just using the keccak permutation does not automatically make it > > "SHA-3", so we're back at the same issue again. FIPS is simply not > > interesting for our requirements. > > Using non-assessed cryptography? Sounds dangerous to me even though it may be > based on some well-known construction. "assessed" is not necessarily the same as FIPS. Don't conflate the two. I don't appreciate that kind of dishonest argumentation. And new constructions that I'm interested in would be formally verified (like the other crypto work I've done) with review and buy-in from the cryptographic community, both engineering and academic. I have no interest in submitting "non-assessed" things developed in a vacuum, and I'm displeased with your attempting to make that characterization. Similarly, any other new design proposed I would expect a similar amount of rigor. The current RNG is admittedly a bit of a mess, but at least it's a design that's evolved. Something that's "revolutionary", rather than evolutionary, needs considerably more argumentation. So, please, don't strawman this into the "non-assessed" rhetoric.
Re: drivers/char/random.c needs a (new) maintainer
On Wed, Dec 23, 2020 at 3:17 PM Petr Tesarik wrote: > Upfront, let me admit that SUSE has a vested interest in a FIPS-certifiable > Linux kernel. Sorry, but just because you have a "vested interest", or a financial interest, or because you want it does not suddenly make it a good idea. The idea is to have good crypto, not to merely check some boxes for the bean counters. For example, it's very unlikely that future kernel RNGs will move to using AES, due to the performance overhead involved on non-table-based implementations, and the lack of availability of FPU/AES-NI in all the contexts we need. NT's fortuna machine can use AES, because NT allows the FPU in all contexts. We don't have that luxury (or associated performance penalty). I would, however, be interested in a keccak-based construction. But just using the keccak permutation does not automatically make it "SHA-3", so we're back at the same issue again. FIPS is simply not interesting for our requirements. Jason
Re: [PATCH v2 09/11] crypto: blake2s - share the "shash" API boilerplate code
Hey Eric, The solution you've proposed at the end of your email is actually kind of similar to what we do with curve25519. Check out include/crypto/curve25519.h. The critical difference between that and the blake proposal is that it's in the header for curve25519, so the indirection disappears. Could we do that with headers for blake? Jason
Re: [PATCH v2 09/11] crypto: blake2s - share the "shash" API boilerplate code
On Thu, Dec 17, 2020 at 11:25 PM Eric Biggers wrote: > > From: Eric Biggers > > Move the boilerplate code for setkey(), init(), update(), and final() > from blake2s_generic.ko into a new module blake2s_helpers.ko, and export > it so that it can be used by other shash implementations of BLAKE2s. > > setkey() and init() are exported as-is, while update() and final() have > a blake2s_compress_t function pointer argument added. This allows the > implementation of the compression function to be overridden, which is > the only part that optimized implementations really care about. > > The helper functions are defined in a separate module blake2s_helpers.ko > (rather than just than in blake2s_generic.ko) because we can't simply > select CRYPTO_BLAKE2B from CRYPTO_BLAKE2S_X86. Doing this selection > unconditionally would make the library API select the shash API, while > doing it conditionally on CRYPTO_HASH would create a recursive kconfig > dependency on CRYPTO_HASH. As a bonus, using a separate module also > allows the generic implementation to be omitted when unneeded. > > These helper functions very closely match the ones I defined for > BLAKE2b, except the BLAKE2b ones didn't go in a separate module yet > because BLAKE2b isn't exposed through the library API yet. > > Finally, use these new helper functions in the x86 implementation of > BLAKE2s. (This part should be a separate patch, but unfortunately the > x86 implementation used the exact same function names like > "crypto_blake2s_update()", so it had to be updated at the same time.) There's a similar situation happening with chacha20poly1305 and with curve25519. Each of these uses a mildly different approach to how we split things up between library and crypto api code. The _helpers.ko is another one. There any opportunity here to take a more unified/consistant approach? Or is shash slightly different than the others and benefits from a third way? Jason
Re: [PATCH v2 11/11] wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM
On Thu, Dec 17, 2020 at 11:25 PM Eric Biggers wrote: > > From: Eric Biggers > > When available, select the new implementation of BLAKE2s for 32-bit ARM. > This is faster than the generic C implementation. > > Signed-off-by: Eric Biggers Reviewed-by: Jason A. Donenfeld When this series is ready, feel free to pull this commit in via Herbert's tree, rather than my usual wireguard-linux.git->net-next.git route. That will ensure the blake2s stuff lands all at once and we won't have to synchronize anything.
Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
On Thu, Dec 17, 2020 at 4:54 AM Eric Biggers wrote: > > On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote: > > Hi Eric, > > > > On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers wrote: > > > By the way, if people are interested in having my ARM scalar > > > implementation of > > > BLAKE2s in the kernel too, I can send a patchset for that too. It just > > > ended up > > > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case > > > mentioned above. If it were to be added as "blake2s-256-arm", we'd have: > > > > I'd certainly be interested in this. Any rough idea how it performs > > for pretty small messages compared to the generic implementation? > > 100-140 byte ranges? Is the speedup about the same as for longer > > messages because this doesn't parallelize across multiple blocks? > > > > It does one block at a time, and there isn't much overhead, so yes the speedup > on short messages should be about the same as on long messages. > > I did a couple quick userspace benchmarks and got (still on Cortex-A7): > > 100-byte messages: > BLAKE2s ARM: 28.9 cpb > BLAKE2s generic: 42.4 cpb > > 140-byte messages: > BLAKE2s ARM: 29.5 cpb > BLAKE2s generic: 44.0 cpb > > The results in the kernel may differ a bit, but probably not by much. That's certainly a nice improvement though, and I'd very much welcome the faster implementation. Jason
Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b
Hi Eric, On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers wrote: > By the way, if people are interested in having my ARM scalar implementation of > BLAKE2s in the kernel too, I can send a patchset for that too. It just ended > up > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case > mentioned above. If it were to be added as "blake2s-256-arm", we'd have: I'd certainly be interested in this. Any rough idea how it performs for pretty small messages compared to the generic implementation? 100-140 byte ranges? Is the speedup about the same as for longer messages because this doesn't parallelize across multiple blocks? Jason
Re: drivers/char/random.c needs a (new) maintainer
On Mon, Nov 30, 2020 at 5:56 PM Theodore Y. Ts'o wrote: > patches this cycle. One thing that would help me is if folks > (especially Jason, if you would) could start with a detailed review of > Nicolai's patches. Sure, I'll take a look.
Re: drivers/char/random.c needs a (new) maintainer
I am willing to maintain random.c and have intentions to have a formally verified RNG. I've mentioned this to Ted before. But I think Ted's reluctance to not accept the recent patches sent to this list is mostly justified, and I have no desire to see us rush into replacing random.c with something suboptimal or FIPSy.
Re: [PATCH 000/141] Fix fall-through warnings for Clang
On Fri, Nov 20, 2020 at 12:21:39PM -0600, Gustavo A. R. Silva wrote: > IB/hfi1: Fix fall-through warnings for Clang > IB/mlx4: Fix fall-through warnings for Clang > IB/qedr: Fix fall-through warnings for Clang > RDMA/mlx5: Fix fall-through warnings for Clang I picked these four to the rdma tree, thanks Jason
Re: [PATCH cryptodev] crypto: lib/chacha20poly1305 - allow users to specify 96bit nonce
On Tue, Nov 17, 2020 at 9:32 AM Ard Biesheuvel wrote: > If you are going back to the drawing board with in-kernel acceleration > for OpenVPN As far as I can tell, they're mostly after compatibility with their existing userspace stuff. Otherwise, if they were going back to the drawing board, they could just make openvpn userspace set up xfrm or wg tunnels to achieve basically the same design. And actually, the xfrm approach kind of makes a lot of sense for what they're doing; it was designed for that type of split-daemon tunneling design.
Re: [PATCH cryptodev] crypto: lib/chacha20poly1305 - allow users to specify 96bit nonce
Nack. This API is meant to take simple integers, so that programmers can use atomic64_t with it and have safe nonces. I'm also interested in preserving the API's ability to safely encrypt more than 4 gigs of data at once. Passing a buffer also encourages people to use randomized nonces, which isn't really safe. Finally, there are no in-tree users of 96bit nonces for this interface. If you're after a cornucopia of compatibility primitives, the ipsec stuff might be more to your fitting. Or, add a new simple function/api. But adding complexity to users of the existing one and confusing future users of it is a non-starter. It's supposed to be deliberately non-awful to use.
Re: [PATCH crypto] crypto: Kconfig - CRYPTO_MANAGER_EXTRA_TESTS requires the manager
On Fri, Nov 13, 2020 at 8:58 PM Herbert Xu wrote: > > On Fri, Nov 13, 2020 at 03:34:36PM +0100, Jason A. Donenfeld wrote: > > Thanks. FYI, I intended this for crypto-2.6.git rather than > > cryptodev-2.6.git, since it fixes a build failure and is a trivial > > fix. > > Well this has been broken since January so I don't see the urgency > in it going in right away. It'll be backported eventually. Okay, no problem.
Re: [PATCH crypto] crypto: Kconfig - CRYPTO_MANAGER_EXTRA_TESTS requires the manager
Thanks. FYI, I intended this for crypto-2.6.git rather than cryptodev-2.6.git, since it fixes a build failure and is a trivial fix.
Re: [PATCH] crypto: sha - split sha.h into sha1.h and sha2.h
Thanks for doing this. Acked-by: Jason A. Donenfeld
[PATCH crypto] crypto: Kconfig - CRYPTO_MANAGER_EXTRA_TESTS requires the manager
The extra tests in the manager actually require the manager to be selected too. Otherwise the linker gives errors like: ld: arch/x86/crypto/chacha_glue.o: in function `chacha_simd_stream_xor': chacha_glue.c:(.text+0x422): undefined reference to `crypto_simd_disabled_for_test' Fixes: 2343d1529aff ("crypto: Kconfig - allow tests to be disabled when manager is disabled") Signed-off-by: Jason A. Donenfeld --- crypto/Kconfig | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/crypto/Kconfig b/crypto/Kconfig index 094ef56ab7b4..37de7d006858 100644 --- a/crypto/Kconfig +++ b/crypto/Kconfig @@ -145,7 +145,7 @@ config CRYPTO_MANAGER_DISABLE_TESTS config CRYPTO_MANAGER_EXTRA_TESTS bool "Enable extra run-time crypto self tests" - depends on DEBUG_KERNEL && !CRYPTO_MANAGER_DISABLE_TESTS + depends on DEBUG_KERNEL && !CRYPTO_MANAGER_DISABLE_TESTS && CRYPTO_MANAGER help Enable extra run-time self tests of registered crypto algorithms, including randomized fuzz tests. -- 2.29.1
Re: [PATCH] crypto: arm/chacha-neon - optimize for non-block size multiples
Cool patch! I look forward to getting out the old arm32 rig and benching this. One question: On Sun, Nov 1, 2020 at 5:33 PM Ard Biesheuvel wrote: > On out-of-order microarchitectures such as Cortex-A57, this results in > a speedup for 1420 byte blocks of about 21%, without any signficant > performance impact of the power-of-2 block sizes. On lower end cores > such as Cortex-A53, the speedup for 1420 byte blocks is only about 2%, > but also without impacting other input sizes. A57 and A53 are 64-bit, but this is code for 32-bit arm, right? So the comparison is more like A15 vs A5? Or are you running 32-bit kernels on armv8 hardware?
Re: [RFC] treewide: cleanup unreachable breaks
On Mon, Oct 19, 2020 at 12:42:15PM -0700, Nick Desaulniers wrote: > On Sat, Oct 17, 2020 at 10:43 PM Greg KH wrote: > > > > On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote: > > > From: Tom Rix > > > > > > This is a upcoming change to clean up a new warning treewide. > > > I am wondering if the change could be one mega patch (see below) or > > > normal patch per file about 100 patches or somewhere half way by > > > collecting > > > early acks. > > > > Please break it up into one-patch-per-subsystem, like normal, and get it > > merged that way. > > > > Sending us a patch, without even a diffstat to review, isn't going to > > get you very far... > > Tom, > If you're able to automate this cleanup, I suggest checking in a > script that can be run on a directory. Then for each subsystem you > can say in your commit "I ran scripts/fix_whatever.py on this subdir." > Then others can help you drive the tree wide cleanup. Then we can > enable -Wunreachable-code-break either by default, or W=2 right now > might be a good idea. I remember using clang-modernize in the past to fix issues very similar to this, if clang machinery can generate the warning, can't something like clang-tidy directly generate the patch? You can send me a patch for drivers/infiniband/* as well Thanks, Jason
Re: [DISCUSSION PATCH 00/41] random: possible ways towards NIST SP800-90B compliance
I haven't looked into the details of this patchset yet, but your description here indicates to me that this is motivated by FIPS certification desires, which...worries me. I would like to rewrite the RNG at some point, and I've started to work on a bunch of designs for this (and proving them correct, too), but going about this via FIPS certification or trying to implement some NIST specs is most certainly the wrong way to go about this, will lock us into subpar crypto for years, and is basically a waste of time.
Re: [trivial PATCH] treewide: Convert switch/case fallthrough; to break;
On Wed, Sep 09, 2020 at 01:06:39PM -0700, Joe Perches wrote: > fallthrough to a separate case/default label break; isn't very readable. > > Convert pseudo-keyword fallthrough; statements to a simple break; when > the next label is case or default and the only statement in the next > label block is break; > > Found using: > > $ grep-2.5.4 -rP --include=*.[ch] -n > "fallthrough;(\s*(case\s+\w+|default)\s*:\s*){1,7}break;" * > > Miscellanea: > > o Move or coalesce a couple label blocks above a default: block. > > Signed-off-by: Joe Perches > --- > > Compiled allyesconfig x86-64 only. > A few files for other arches were not compiled. IB part looks OK, I prefer it like this You could do the same for continue as well, I saw a few of those.. Thanks, Jason
Re: [PATCH] crypto/x86: Use XORL r32,32 in curve25519-x86_64.c
rbx, %%r10;"" movq %%r10, 96(%0);" > + " mulxq 32(%3), %%r8, %%r9;" " xor %%r10d, %%r10d;" " > adcxq 88(%0), %%r8;" " movq %%r8, 88(%0);" > + " mulxq 40(%3), %%r10, %%r11;"" adox %%r9, %%r10;" " > adcx %%rbx, %%r10;"" movq %%r10, 96(%0);" > " mulxq 48(%3), %%rbx, %%r13;"" adox %%r11, %%rbx;"" > adcx %%r14, %%rbx;"" movq %%rbx, 104(%0);"" mov $0, %%r8;" > " mulxq 56(%3), %%r14, %%rdx;"" adox %%r13, %%r14;"" > adcx %%rax, %%r14;"" movq %%r14, 112(%0);"" mov $0, %%rax;" > " adox %%rdx, %%rax;"" > adcx %%r8, %%rax;" " movq %%rax, 120(%0);" > @@ -312,7 +312,7 @@ static inline void fmul2(u64 *out, const u64 *f1, const > u64 *f2, u64 *tmp) > /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ > " mov $38, %%rdx;" > " mulxq 32(%1), %%r8, %%r13;" > - " xor %3, %3;" > + " xor %k3, %k3;" > " adoxq 0(%1), %%r8;" > " mulxq 40(%1), %%r9, %%rbx;" > " adcx %%r13, %%r9;" > @@ -345,7 +345,7 @@ static inline void fmul2(u64 *out, const u64 *f1, const > u64 *f2, u64 *tmp) > /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ > " mov $38, %%rdx;" > " mulxq 96(%1), %%r8, %%r13;" > - " xor %3, %3;" > + " xor %k3, %k3;" > " adoxq 64(%1), %%r8;" > " mulxq 104(%1), %%r9, %%rbx;" > " adcx %%r13, %%r9;" > @@ -516,7 +516,7 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp) > > /* Step 1: Compute all partial products */ > " movq 0(%1), %%rdx;" /* > f[0] */ > - " mulxq 8(%1), %%r8, %%r14;" " xor %%r15, %%r15;" /* > f[1]*f[0] */ > + " mulxq 8(%1), %%r8, %%r14;" " xor %%r15d, %%r15d;" /* > f[1]*f[0] */ > " mulxq 16(%1), %%r9, %%r10;" " adcx %%r14, %%r9;" /* > f[2]*f[0] */ > " mulxq 24(%1), %%rax, %%rcx;"" adcx %%rax, %%r10;"/* > f[3]*f[0] */ > " movq 24(%1), %%rdx;" /* > f[3] */ > @@ -526,7 +526,7 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp) > " mulxq 16(%1), %%rax, %%rcx;"" mov $0, %%r14;"/* > f[2]*f[1] */ > > /* Step 2: Compute two parallel carry chains */ > - " xor %%r15, %%r15;" > + " xor %%r15d, %%r15d;" > " adox %%rax, %%r10;" > " adcx %%r8, %%r8;" > " adox %%rcx, %%r11;" > @@ -563,7 +563,7 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp) > /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ > " mov $38, %%rdx;" > " mulxq 32(%1), %%r8, %%r13;" > - " xor %%rcx, %%rcx;" > + " xor %%ecx, %%ecx;" > " adoxq 0(%1), %%r8;" > " mulxq 40(%1), %%r9, %%rbx;" > " adcx %%r13, %%r9;" > @@ -607,7 +607,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) > asm volatile( > /* Step 1: Compute all partial products */ > " movq 0(%1), %%rdx;" /* > f[0] */ > - " mulxq 8(%1), %%r8, %%r14;" " xor %%r15, %%r15;" /* > f[1]*f[0] */ > + " mulxq 8(%1), %%r8, %%r14;" " xor %%r15d, %%r15d;" /* > f[1]*f[0] */ > " mulxq 16(%1), %%r9, %%r10;" " adcx %%r14, %%r9;" /* > f[2]*f[0] */ > " mulxq 24(%1), %%rax, %%rcx;"" adcx %%rax, %%r10;"/* > f[3]*f[0] */ > " movq 24(%1), %%rdx;" /* > f[3] */ > @@ -617,7 +617,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) > " mulxq 16(%1), %%rax, %%rcx;"" mov $0, %%r14;"/* > f[2]*f[1] */ > > /* Step 2: Compute two parallel carry chains */ > - " xor %%r15, %%r15;" > + " xor %%r15d, %%r15d;" > " adox %%rax, %%r10;" > " adcx %%r8, %%r8;" > " adox %%rcx, %%r11;" > @@ -647,7 +647,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) > > /* Step 1: Compute all partial products */ > " movq 32(%1), %%rdx;" > /* f[0] */ > - " mulxq 40(%1), %%r8, %%r14;" " xor %%r15, %%r15;" > /* f[1]*f[0] */ > + " mulxq 40(%1), %%r8, %%r14;" " xor %%r15d, %%r15d;" /* > f[1]*f[0] */ > " mulxq 48(%1), %%r9, %%r10;" " adcx %%r14, %%r9;" /* > f[2]*f[0] */ > " mulxq 56(%1), %%rax, %%rcx;"" adcx %%rax, %%r10;"/* > f[3]*f[0] */ > " movq 56(%1), %%rdx;" /* > f[3] */ > @@ -657,7 +657,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) > " mulxq 48(%1), %%rax, %%rcx;"" mov $0, %%r14;"/* > f[2]*f[1] */ > > /* Step 2: Compute two parallel carry chains */ > - " xor %%r15, %%r15;" > + " xor %%r15d, %%r15d;" > " adox %%rax, %%r10;" > " adcx %%r8, %%r8;" > " adox %%rcx, %%r11;" > @@ -692,7 +692,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) > /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ > " mov $38, %%rdx;" > " mulxq 32(%1), %%r8, %%r13;" > - " xor %%rcx, %%rcx;" > + " xor %%ecx, %%ecx;" > " adoxq 0(%1), %%r8;" > " mulxq 40(%1), %%r9, %%rbx;" > " adcx %%r13, %%r9;" > @@ -725,7 +725,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp) > /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */ > " mov $38, %%rdx;" > " mulxq 96(%1), %%r8, %%r13;" > - " xor %%rcx, %%rcx;" > + " xor %%ecx, %%ecx;" > " adoxq 64(%1), %%r8;" > " mulxq 104(%1), %%r9, %%rbx;" > " adcx %%r13, %%r9;" > -- > 2.26.2 > Looks like this is going into HACL in the end, so: Acked-by: Jason A. Donenfeld for cryptodev-2.6.git, rather than crypto-2.6.git Thanks, Jason
Re: [PATCH] crypto/x86: Use XORL r32,32 in poly1305-x86_64-cryptogams.pl
Hi Uros, Herbert, On Thu, Aug 27, 2020 at 07:38:31PM +0200, Uros Bizjak wrote: > x86_64 zero extends 32bit operations, so for 64bit operands, > XORL r32,r32 is functionally equal to XORQ r64,r64, but avoids > a REX prefix byte when legacy registers are used. > > Signed-off-by: Uros Bizjak > Cc: Herbert Xu > Cc: "David S. Miller" > --- > arch/x86/crypto/poly1305-x86_64-cryptogams.pl | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl > b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl > index 137edcf038cb..7d568012cc15 100644 > --- a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl > +++ b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl > @@ -246,7 +246,7 @@ $code.=<<___ if (!$kernel); > ___ > &declare_function("poly1305_init_x86_64", 32, 3); > $code.=<<___; > - xor %rax,%rax > + xor %eax,%eax > mov %rax,0($ctx)# initialize hash value > mov %rax,8($ctx) > mov %rax,16($ctx) > @@ -2853,7 +2853,7 @@ $code.=<<___; > .typepoly1305_init_base2_44,\@function,3 > .align 32 > poly1305_init_base2_44: > - xor %rax,%rax > + xor %eax,%eax > mov %rax,0($ctx)# initialize hash value > mov %rax,8($ctx) > mov %rax,16($ctx) > @@ -3947,7 +3947,7 @@ xor128_decrypt_n_pad: > mov \$16,$len > sub %r10,$len > xor %eax,%eax > - xor %r11,%r11 > + xor %r11d,%r11d > .Loop_dec_byte: > mov ($inp,$otp),%r11b > mov ($otp),%al > @@ -4085,7 +4085,7 @@ avx_handler: > .long 0xa548f3fc # cld; rep movsq > > mov $disp,%rsi > - xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER > + xor %ecx,%ecx # arg1, UNW_FLAG_NHANDLER > mov 8(%rsi),%rdx# arg2, disp->ImageBase > mov 0(%rsi),%r8 # arg3, disp->ControlPc > mov 16(%rsi),%r9# arg4, disp->FunctionEntry > -- > 2.26.2 > Per the discussion elsewhere, Acked-by: Jason A. Donenfeld for cryptodev-2.6.git, rather than crypto-2.6.git Thanks, Jason
Re: [PATCH] crypto/x86: Use XORL r32,32 in curve25519-x86_64.c
On Wed, Sep 2, 2020 at 1:42 PM Uros Bizjak wrote: > > On Wed, Sep 2, 2020 at 11:17 AM wrote: > > > > On Wed, Sep 02, 2020 at 07:50:36AM +0200, Uros Bizjak wrote: > > > On Tue, Sep 1, 2020 at 9:12 PM Jason A. Donenfeld wrote: > > > > > > > > On Tue, Sep 1, 2020 at 8:13 PM Jason A. Donenfeld > > > > wrote: > > > > > operands are the same. Also, have you seen any measurable differences > > > > > when benching this? I can stick it into kbench9000 to see if you > > > > > haven't looked yet. > > > > > > > > On a Skylake server (Xeon Gold 5120), I'm unable to see any measurable > > > > difference with this, at all, no matter how much I median or mean or > > > > reduce noise by disabling interrupts. > > > > > > > > One thing that sticks out is that all the replacements of r8-r15 by > > > > their %r8d-r15d counterparts still have the REX prefix, as is > > > > necessary to access those registers. The only ones worth changing, > > > > then, are the legacy registers, and on a whole, this amounts to only > > > > 48 bytes of difference. > > > > > > The patch implements one of x86 target specific optimizations, > > > performed by gcc. The optimization results in code size savings of one > > > byte, where REX prefix is omitted with legacy registers, but otherwise > > > should have no measurable runtime effect. Since gcc applies this > > > optimization universally to all integer registers, I took the same > > > approach and implemented the same change to legacy and REX registers. > > > As measured above, 48 bytes saved is a good result for such a trivial > > > optimization. > > > > Could we instead implement this optimization in GAS ? Then we can leave > > the code as-is. > > I don't think that e.g. "xorq %rax,%rax" should universally be > translated to "xorl %eax,%eax" in the assembler. Perhaps the author > expected exactly 3 bytes (to align the code or similar), and the > assembler would change the length to 2 bytes behind his back, breaking > the expectations. Are you sure that's something that GAS actually provides now? Seems like a lot of mnemonics have ambiguous/injective encodings, and this wouldn't make things any different. Most authors use .byte or .align when they care about the actual bytes, no?
Re: [PATCH] crypto/x86: Use XORL r32,32 in curve25519-x86_64.c
On Wed, Sep 2, 2020 at 11:44 AM wrote: > > On Wed, Sep 02, 2020 at 07:50:36AM +0200, Uros Bizjak wrote: > > On Tue, Sep 1, 2020 at 9:12 PM Jason A. Donenfeld wrote: > > > > > > On Tue, Sep 1, 2020 at 8:13 PM Jason A. Donenfeld wrote: > > > > operands are the same. Also, have you seen any measurable differences > > > > when benching this? I can stick it into kbench9000 to see if you > > > > haven't looked yet. > > > > > > On a Skylake server (Xeon Gold 5120), I'm unable to see any measurable > > > difference with this, at all, no matter how much I median or mean or > > > reduce noise by disabling interrupts. > > > > > > One thing that sticks out is that all the replacements of r8-r15 by > > > their %r8d-r15d counterparts still have the REX prefix, as is > > > necessary to access those registers. The only ones worth changing, > > > then, are the legacy registers, and on a whole, this amounts to only > > > 48 bytes of difference. > > > > The patch implements one of x86 target specific optimizations, > > performed by gcc. The optimization results in code size savings of one > > byte, where REX prefix is omitted with legacy registers, but otherwise > > should have no measurable runtime effect. Since gcc applies this > > optimization universally to all integer registers, I took the same > > approach and implemented the same change to legacy and REX registers. > > As measured above, 48 bytes saved is a good result for such a trivial > > optimization. > > Could we instead implement this optimization in GAS ? Then we can leave > the code as-is. I rather like that idea. Though I wonder if some would balk at it for smelling a bit like the MIPS assembler with its optimization pass...
Re: [PATCH] crypto/x86: Use XORL r32,32 in poly1305-x86_64-cryptogams.pl
Hi Uros, Any benchmarks for this? Seems like it's all in initialization code, right? I'm CC'ing Andy into this. Jason On Thu, Aug 27, 2020 at 07:38:31PM +0200, Uros Bizjak wrote: > x86_64 zero extends 32bit operations, so for 64bit operands, > XORL r32,r32 is functionally equal to XORQ r64,r64, but avoids > a REX prefix byte when legacy registers are used. > > Signed-off-by: Uros Bizjak > Cc: Herbert Xu > Cc: "David S. Miller" > --- > arch/x86/crypto/poly1305-x86_64-cryptogams.pl | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl > b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl > index 137edcf038cb..7d568012cc15 100644 > --- a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl > +++ b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl > @@ -246,7 +246,7 @@ $code.=<<___ if (!$kernel); > ___ > &declare_function("poly1305_init_x86_64", 32, 3); > $code.=<<___; > - xor %rax,%rax > + xor %eax,%eax > mov %rax,0($ctx)# initialize hash value > mov %rax,8($ctx) > mov %rax,16($ctx) > @@ -2853,7 +2853,7 @@ $code.=<<___; > .typepoly1305_init_base2_44,\@function,3 > .align 32 > poly1305_init_base2_44: > - xor %rax,%rax > + xor %eax,%eax > mov %rax,0($ctx)# initialize hash value > mov %rax,8($ctx) > mov %rax,16($ctx) > @@ -3947,7 +3947,7 @@ xor128_decrypt_n_pad: > mov \$16,$len > sub %r10,$len > xor %eax,%eax > - xor %r11,%r11 > + xor %r11d,%r11d > .Loop_dec_byte: > mov ($inp,$otp),%r11b > mov ($otp),%al > @@ -4085,7 +4085,7 @@ avx_handler: > .long 0xa548f3fc # cld; rep movsq > > mov $disp,%rsi > - xor %rcx,%rcx # arg1, UNW_FLAG_NHANDLER > + xor %ecx,%ecx # arg1, UNW_FLAG_NHANDLER > mov 8(%rsi),%rdx# arg2, disp->ImageBase > mov 0(%rsi),%r8 # arg3, disp->ControlPc > mov 16(%rsi),%r9# arg4, disp->FunctionEntry > -- > 2.26.2 > -- Jason A. Donenfeld Deep Space Explorer fr: +33 6 51 90 82 66 us: +1 513 476 1200 www.jasondonenfeld.com www.zx2c4.com zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc
Re: [PATCH] crypto/x86: Use XORL r32,32 in curve25519-x86_64.c
On Tue, Sep 1, 2020 at 8:13 PM Jason A. Donenfeld wrote: > operands are the same. Also, have you seen any measurable differences > when benching this? I can stick it into kbench9000 to see if you > haven't looked yet. On a Skylake server (Xeon Gold 5120), I'm unable to see any measurable difference with this, at all, no matter how much I median or mean or reduce noise by disabling interrupts. One thing that sticks out is that all the replacements of r8-r15 by their %r8d-r15d counterparts still have the REX prefix, as is necessary to access those registers. The only ones worth changing, then, are the legacy registers, and on a whole, this amounts to only 48 bytes of difference.
Re: [PATCH] crypto/x86: Use XORL r32,32 in curve25519-x86_64.c
Hi Uros, Thanks for this patch. This seems correct to me, since indeed those clear the top bits anyway, and having smaller code seems good. But first -- I'm wondering -- have you stuck this into Vale/Hacl to see if it still checks out there? I'm CC'ing Karthik/Aymeric/Chris/Jonathan who might be interested in taking a look at that. Seems like it might be easy to just special case this in Xor64 for the case where the operands are the same. Also, have you seen any measurable differences when benching this? I can stick it into kbench9000 to see if you haven't looked yet. Jason On Tue, Sep 1, 2020 at 5:46 PM Ard Biesheuvel wrote: > > (+ Jason) > > On Thu, 27 Aug 2020 at 20:31, Uros Bizjak wrote: > > > > x86_64 zero extends 32bit operations, so for 64bit operands, > > XORL r32,r32 is functionally equal to XORL r64,r64, but avoids > > a REX prefix byte when legacy registers are used. > > > > Signed-off-by: Uros Bizjak > > Cc: Herbert Xu > > Cc: "David S. Miller" > > --- > > arch/x86/crypto/curve25519-x86_64.c | 68 ++--- > > 1 file changed, 34 insertions(+), 34 deletions(-) > > > > diff --git a/arch/x86/crypto/curve25519-x86_64.c > > b/arch/x86/crypto/curve25519-x86_64.c > > index 8acbb6584a37..a9edb6f8a0ba 100644 > > --- a/arch/x86/crypto/curve25519-x86_64.c > > +++ b/arch/x86/crypto/curve25519-x86_64.c > > @@ -45,11 +45,11 @@ static inline u64 add_scalar(u64 *out, const u64 *f1, > > u64 f2) > > > > asm volatile( > > /* Clear registers to propagate the carry bit */ > > - " xor %%r8, %%r8;" > > - " xor %%r9, %%r9;" > > - " xor %%r10, %%r10;" > > - " xor %%r11, %%r11;" > > - " xor %1, %1;" > > + " xor %%r8d, %%r8d;" > > + " xor %%r9d, %%r9d;" > > + " xor %%r10d, %%r10d;" > > + " xor %%r11d, %%r11d;" > > + " xor %k1, %k1;" > > > > /* Begin addition chain */ > > " addq 0(%3), %0;" > > @@ -93,7 +93,7 @@ static inline void fadd(u64 *out, const u64 *f1, const > > u64 *f2) > > " cmovc %0, %%rax;" > > > > /* Step 2: Add carry*38 to the original sum */ > > - " xor %%rcx, %%rcx;" > > + " xor %%ecx, %%ecx;" > > " add %%rax, %%r8;" > > " adcx %%rcx, %%r9;" > > " movq %%r9, 8(%1);" > > @@ -165,28 +165,28 @@ static inline void fmul(u64 *out, const u64 *f1, > > const u64 *f2, u64 *tmp) > > > > /* Compute src1[0] * src2 */ > > " movq 0(%1), %%rdx;" > > - " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" > > " movq %%r8, 0(%0);" > > + " mulxq 0(%3), %%r8, %%r9;" " xor %%r10d, %%r10d;" > > " movq %%r8, 0(%0);" > > " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" > > " movq %%r10, 8(%0);" > > " mulxq 16(%3), %%rbx, %%r13;"" adox %%r11, %%rbx;" > > " mulxq 24(%3), %%r14, %%rdx;"" adox %%r13, %%r14;" > > " mov $0, %%rax;" > >" adox %%rdx, %%rax;" > > /* Compute src1[1] * src2 */ > > " movq 8(%1), %%rdx;" > > - " mulxq 0(%3), %%r8, %%r9;" " xor %%r10, %%r10;" > > " adcxq 8(%0), %%r8;"" movq %%r8, 8(%0);" > > + " mulxq 0(%3), %%r8, %%r9;" " xor %%r10d, %%r10d;" > > " adcxq 8(%0), %%r8;"" movq %%r8, 8(%0);" > > " mulxq 8(%3), %%r10, %%r11;" " adox %%r9, %%r10;" > > " adcx %%rbx, %%r10;"" movq %%r10, 16(%0);" > > " mulxq 16(%3), %%rbx, %%r13;"" adox %%r11, %%rbx;" > > " adcx %%r14, %%rbx;"" mov $0, %%r8;" > > " mulxq 24(%3), %%r14, %%rdx;"" adox %%r13, %%r14;" > > " adcx %%rax, %%r1
Re: [PATCH] crypto: arm/poly1305 - Add prototype for poly1305_blocks_neon
On Tue, Aug 25, 2020 at 3:23 AM Herbert Xu wrote: > > This patch adds a prototype for poly1305_blocks_neon to slience > a compiler warning: > > CC [M] arch/arm/crypto/poly1305-glue.o > ../arch/arm/crypto/poly1305-glue.c:25:13: warning: no previous prototype for > `poly1305_blocks_neon' [-Wmissing-prototypes] > void __weak poly1305_blocks_neon(void *state, const u8 *src, u32 len, u32 > hibit) > ^~~~ > > Signed-off-by: Herbert Xu > > diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c > index 13cfef4ae22e..3023c1acfa19 100644 > --- a/arch/arm/crypto/poly1305-glue.c > +++ b/arch/arm/crypto/poly1305-glue.c > @@ -20,6 +20,7 @@ > > void poly1305_init_arm(void *state, const u8 *key); > void poly1305_blocks_arm(void *state, const u8 *src, u32 len, u32 hibit); > +void poly1305_blocks_neon(void *state, const u8 *src, u32 len, u32 hibit); LGTM. Acked-by: Jason A. Donenfeld
Re: [PATCH] crypto: arm/curve25519 - include
On Mon, Aug 24, 2020 at 4:13 PM Fabio Estevam wrote: > > Building ARM allmodconfig leads to the following warnings: > > arch/arm/crypto/curve25519-glue.c:73:12: error: implicit declaration of > function 'sg_copy_to_buffer' [-Werror=implicit-function-declaration] > arch/arm/crypto/curve25519-glue.c:74:9: error: implicit declaration of > function 'sg_nents_for_len' [-Werror=implicit-function-declaration] > arch/arm/crypto/curve25519-glue.c:88:11: error: implicit declaration of > function 'sg_copy_from_buffer' [-Werror=implicit-function-declaration] > > Include to fix such warnings This patch seems correct to me -- sg_copy_to_buffer, sg_nents_for_len. I wonder what header dependency chain caused us to miss this before. Either way, Acked-by: Jason A. Donenfeld
Re: [PATCH 12/26] netfilter: switch nf_setsockopt to sockptr_t
On Tue, Jul 28, 2020 at 10:07 AM David Laight wrote: > > From: Christoph Hellwig > > Sent: 27 July 2020 17:24 > > > > On Mon, Jul 27, 2020 at 06:16:32PM +0200, Jason A. Donenfeld wrote: > > > Maybe sockptr_advance should have some safety checks and sometimes > > > return -EFAULT? Or you should always use the implementation where > > > being a kernel address is an explicit bit of sockptr_t, rather than > > > being implicit? > > > > I already have a patch to use access_ok to check the whole range in > > init_user_sockptr. > > That doesn't make (much) difference to the code paths that ignore > the user-supplied length. > OTOH doing the user/kernel check on the base address (not an > incremented one) means that the correct copy function is always > selected. Right, I had the same reaction in reading this, but actually, his code gets rid of the sockptr_advance stuff entirely and never mutates, so even though my point about attacking those pointers was missed, the code does the better thing now -- checking the base address and never mutating the pointer. So I think we're good. > > Perhaps the functions should all be passed a 'const sockptr_t'. > The typedef could be made 'const' - requiring non-const items > explicitly use the union/struct itself. I was thinking the same, but just by making the pointers inside the struct const. However, making the whole struct const via the typedef is a much better idea. That'd probably require changing the signature of init_user_sockptr a bit, which would be fine, but indeed I think this would be a very positive change. Jason
Re: [PATCH 12/26] netfilter: switch nf_setsockopt to sockptr_t
> From cce2d2e1b43ecee5f4af7cf116808b74b330080f Mon Sep 17 00:00:00 2001 > From: Christoph Hellwig > Date: Mon, 27 Jul 2020 17:42:27 +0200 > Subject: net: remove sockptr_advance > > sockptr_advance never properly worked. Replace it with _offset variants > of copy_from_sockptr and copy_to_sockptr. > > Fixes: ba423fdaa589 ("net: add a new sockptr_t type") > Reported-by: Jason A. Donenfeld > Reported-by: Ido Schimmel > Signed-off-by: Christoph Hellwig > --- > drivers/crypto/chelsio/chtls/chtls_main.c | 12 +- > include/linux/sockptr.h | 27 +++ > net/dccp/proto.c | 5 ++--- > net/ipv4/netfilter/arp_tables.c | 8 +++ > net/ipv4/netfilter/ip_tables.c| 8 +++ > net/ipv4/tcp.c| 5 +++-- > net/ipv6/ip6_flowlabel.c | 11 - > net/ipv6/netfilter/ip6_tables.c | 8 +++ > net/netfilter/x_tables.c | 7 +++--- > net/tls/tls_main.c| 6 ++--- > 10 files changed, 49 insertions(+), 48 deletions(-) > > diff --git a/drivers/crypto/chelsio/chtls/chtls_main.c > b/drivers/crypto/chelsio/chtls/chtls_main.c > index c3058dcdb33c5c..66d247efd5615b 100644 > --- a/drivers/crypto/chelsio/chtls/chtls_main.c > +++ b/drivers/crypto/chelsio/chtls/chtls_main.c > @@ -525,9 +525,9 @@ static int do_chtls_setsockopt(struct sock *sk, int > optname, > /* Obtain version and type from previous copy */ > crypto_info[0] = tmp_crypto_info; > /* Now copy the following data */ > - sockptr_advance(optval, sizeof(*crypto_info)); > - rc = copy_from_sockptr((char *)crypto_info + > sizeof(*crypto_info), > - optval, > + rc = copy_from_sockptr_offset((char *)crypto_info + > + sizeof(*crypto_info), > + optval, sizeof(*crypto_info), > sizeof(struct tls12_crypto_info_aes_gcm_128) > - sizeof(*crypto_info)); > > @@ -542,9 +542,9 @@ static int do_chtls_setsockopt(struct sock *sk, int > optname, > } > case TLS_CIPHER_AES_GCM_256: { > crypto_info[0] = tmp_crypto_info; > - sockptr_advance(optval, sizeof(*crypto_info)); > - rc = copy_from_sockptr((char *)crypto_info + > sizeof(*crypto_info), > - optval, > + rc = copy_from_sockptr_offset((char *)crypto_info + > + sizeof(*crypto_info), > + optval, sizeof(*crypto_info), > sizeof(struct tls12_crypto_info_aes_gcm_256) > - sizeof(*crypto_info)); > > diff --git a/include/linux/sockptr.h b/include/linux/sockptr.h > index b13ea1422f93a5..9e6c81d474cba8 100644 > --- a/include/linux/sockptr.h > +++ b/include/linux/sockptr.h > @@ -69,19 +69,26 @@ static inline bool sockptr_is_null(sockptr_t sockptr) > return !sockptr.user; > } > > -static inline int copy_from_sockptr(void *dst, sockptr_t src, size_t size) > +static inline int copy_from_sockptr_offset(void *dst, sockptr_t src, > + size_t offset, size_t size) > { > if (!sockptr_is_kernel(src)) > - return copy_from_user(dst, src.user, size); > - memcpy(dst, src.kernel, size); > + return copy_from_user(dst, src.user + offset, size); > + memcpy(dst, src.kernel + offset, size); > return 0; > } > > -static inline int copy_to_sockptr(sockptr_t dst, const void *src, size_t > size) > +static inline int copy_from_sockptr(void *dst, sockptr_t src, size_t size) > +{ > + return copy_from_sockptr_offset(dst, src, 0, size); > +} > + > +static inline int copy_to_sockptr_offset(sockptr_t dst, size_t offset, > + const void *src, size_t size) > { > if (!sockptr_is_kernel(dst)) > - return copy_to_user(dst.user, src, size); > - memcpy(dst.kernel, src, size); > + return copy_to_user(dst.user + offset, src, size); > + memcpy(dst.kernel + offset, src, size); > return 0; > } > > @@ -112,14 +119,6 @@ static inline void *memdup_sockptr_nul(sockptr_t src, > size_t len) > return p; > } > > -static inline void sockptr_advance(sockptr_t sockptr, size_t len) > -{ > - if (sockptr_is_kernel(sockptr)) > - sockptr.kernel += len; > - else > - sockptr.user += len; > -} > - > static i
Re: [PATCH 12/26] netfilter: switch nf_setsockopt to sockptr_t
On Mon, Jul 27, 2020 at 5:06 PM Christoph Hellwig wrote: > > On Mon, Jul 27, 2020 at 05:03:10PM +0200, Jason A. Donenfeld wrote: > > Hi Christoph, > > > > On Thu, Jul 23, 2020 at 08:08:54AM +0200, Christoph Hellwig wrote: > > > diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c > > > index da933f99b5d517..42befbf12846c0 100644 > > > --- a/net/ipv4/ip_sockglue.c > > > +++ b/net/ipv4/ip_sockglue.c > > > @@ -1422,7 +1422,8 @@ int ip_setsockopt(struct sock *sk, int level, > > > optname != IP_IPSEC_POLICY && > > > optname != IP_XFRM_POLICY && > > > !ip_mroute_opt(optname)) > > > - err = nf_setsockopt(sk, PF_INET, optname, optval, optlen); > > > + err = nf_setsockopt(sk, PF_INET, optname, > > > USER_SOCKPTR(optval), > > > + optlen); > > > #endif > > > return err; > > > } > > > diff --git a/net/ipv4/netfilter/ip_tables.c > > > b/net/ipv4/netfilter/ip_tables.c > > > index 4697d09c98dc3e..f2a9680303d8c0 100644 > > > --- a/net/ipv4/netfilter/ip_tables.c > > > +++ b/net/ipv4/netfilter/ip_tables.c > > > @@ -1102,7 +1102,7 @@ __do_replace(struct net *net, const char *name, > > > unsigned int valid_hooks, > > > } > > > > > > static int > > > -do_replace(struct net *net, const void __user *user, unsigned int len) > > > +do_replace(struct net *net, sockptr_t arg, unsigned int len) > > > { > > > int ret; > > > struct ipt_replace tmp; > > > @@ -1110,7 +1110,7 @@ do_replace(struct net *net, const void __user > > > *user, unsigned int len) > > > void *loc_cpu_entry; > > > struct ipt_entry *iter; > > > > > > - if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) > > > + if (copy_from_sockptr(&tmp, arg, sizeof(tmp)) != 0) > > > return -EFAULT; > > > > > > /* overflow check */ > > > @@ -1126,8 +1126,8 @@ do_replace(struct net *net, const void __user > > > *user, unsigned int len) > > > return -ENOMEM; > > > > > > loc_cpu_entry = newinfo->entries; > > > - if (copy_from_user(loc_cpu_entry, user + sizeof(tmp), > > > - tmp.size) != 0) { > > > + sockptr_advance(arg, sizeof(tmp)); > > > + if (copy_from_sockptr(loc_cpu_entry, arg, tmp.size) != 0) { > > > ret = -EFAULT; > > > goto free_newinfo; > > > } > > > > Something along this path seems to have broken with this patch. An > > invocation of `iptables -A INPUT -m length --length 1360 -j DROP` now > > fails, with > > > > nf_setsockopt->do_replace->translate_table->check_entry_size_and_hooks: > > (unsigned char *)e + e->next_offset > limit ==> TRUE > > > > resulting in the whole call chain returning -EINVAL. It bisects back to > > this commit. This is on net-next. > > This is another use o sockptr_advance that Ido already found a problem > in. I'm looking into this at the moment.. I haven't seen Ido's patch, but it seems clear the issue is that you want to call `sockptr_advance(&arg, sizeof(tmp))`, and adjust sockptr_advance to take a pointer. Slight concern about the whole concept: Things are defined as typedef union { void*kernel; void __user *user; } sockptr_t; static inline bool sockptr_is_kernel(sockptr_t sockptr) { return (unsigned long)sockptr.kernel >= TASK_SIZE; } So what happens if we have some code like: sockptr_t sp; init_user_sockptr(&sp, user_controlled_struct.extra_user_ptr); sockptr_advance(&sp, user_controlled_struct.some_big_offset); copy_to_sockptr(&sp, user_controlled_struct.a_few_bytes, sizeof(user_controlled_struct.a_few_bytes)); With the user controlling some_big_offset, he can convert the user sockptr into a kernel sockptr, causing the subsequent copy_to_sockptr to be a vanilla memcpy, after which a security disaster ensues. Maybe sockptr_advance should have some safety checks and sometimes return -EFAULT? Or you should always use the implementation where being a kernel address is an explicit bit of sockptr_t, rather than being implicit?
Re: [PATCH] crypto: x86/curve25519 - Remove unused carry variables
On Mon, Jul 27, 2020 at 5:08 PM Karthik Bhargavan wrote: > > Removing unused variables is harmless. (GCC would do this automaticelly.) > So this change seems fine. Thanks for confirming. Hopefully we can get that change upstream in HACL* too, so that the code comes out the same. Acked-by: Jason A. Donenfeld
Re: [PATCH 12/26] netfilter: switch nf_setsockopt to sockptr_t
Hi Christoph, On Thu, Jul 23, 2020 at 08:08:54AM +0200, Christoph Hellwig wrote: > diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c > index da933f99b5d517..42befbf12846c0 100644 > --- a/net/ipv4/ip_sockglue.c > +++ b/net/ipv4/ip_sockglue.c > @@ -1422,7 +1422,8 @@ int ip_setsockopt(struct sock *sk, int level, > optname != IP_IPSEC_POLICY && > optname != IP_XFRM_POLICY && > !ip_mroute_opt(optname)) > - err = nf_setsockopt(sk, PF_INET, optname, optval, optlen); > + err = nf_setsockopt(sk, PF_INET, optname, USER_SOCKPTR(optval), > + optlen); > #endif > return err; > } > diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c > index 4697d09c98dc3e..f2a9680303d8c0 100644 > --- a/net/ipv4/netfilter/ip_tables.c > +++ b/net/ipv4/netfilter/ip_tables.c > @@ -1102,7 +1102,7 @@ __do_replace(struct net *net, const char *name, > unsigned int valid_hooks, > } > > static int > -do_replace(struct net *net, const void __user *user, unsigned int len) > +do_replace(struct net *net, sockptr_t arg, unsigned int len) > { > int ret; > struct ipt_replace tmp; > @@ -1110,7 +1110,7 @@ do_replace(struct net *net, const void __user *user, > unsigned int len) > void *loc_cpu_entry; > struct ipt_entry *iter; > > - if (copy_from_user(&tmp, user, sizeof(tmp)) != 0) > + if (copy_from_sockptr(&tmp, arg, sizeof(tmp)) != 0) > return -EFAULT; > > /* overflow check */ > @@ -1126,8 +1126,8 @@ do_replace(struct net *net, const void __user *user, > unsigned int len) > return -ENOMEM; > > loc_cpu_entry = newinfo->entries; > - if (copy_from_user(loc_cpu_entry, user + sizeof(tmp), > -tmp.size) != 0) { > + sockptr_advance(arg, sizeof(tmp)); > + if (copy_from_sockptr(loc_cpu_entry, arg, tmp.size) != 0) { > ret = -EFAULT; > goto free_newinfo; > } Something along this path seems to have broken with this patch. An invocation of `iptables -A INPUT -m length --length 1360 -j DROP` now fails, with nf_setsockopt->do_replace->translate_table->check_entry_size_and_hooks: (unsigned char *)e + e->next_offset > limit ==> TRUE resulting in the whole call chain returning -EINVAL. It bisects back to this commit. This is on net-next. Jason
Re: [PATCH] crypto: x86/curve25519 - Remove unused carry variables
Hi Herbert, On Thu, Jul 23, 2020 at 9:51 AM Herbert Xu wrote: > > The carry variables are assigned but never used, which upsets > the compiler. This patch removes them. > > Signed-off-by: Herbert Xu > > diff --git a/arch/x86/crypto/curve25519-x86_64.c > b/arch/x86/crypto/curve25519-x86_64.c > index 8a17621f7d3a..8acbb6584a37 100644 > --- a/arch/x86/crypto/curve25519-x86_64.c > +++ b/arch/x86/crypto/curve25519-x86_64.c > @@ -948,10 +948,8 @@ static void store_felem(u64 *b, u64 *f) > { > u64 f30 = f[3U]; > u64 top_bit0 = f30 >> (u32)63U; > - u64 carry0; > u64 f31; > u64 top_bit; > - u64 carry; > u64 f0; > u64 f1; > u64 f2; > @@ -970,11 +968,11 @@ static void store_felem(u64 *b, u64 *f) > u64 o2; > u64 o3; > f[3U] = f30 & (u64)0x7fffU; > - carry0 = add_scalar(f, f, (u64)19U * top_bit0); > + add_scalar(f, f, (u64)19U * top_bit0); > f31 = f[3U]; > top_bit = f31 >> (u32)63U; > f[3U] = f31 & (u64)0x7fffU; > - carry = add_scalar(f, f, (u64)19U * top_bit); > + add_scalar(f, f, (u64)19U * top_bit); > f0 = f[0U]; > f1 = f[1U]; > f2 = f[2U]; > -- That seems obvious and reasonable, and so I'm inclined to ack this, but I first wanted to give Karthik (CC'd) a chance to chime in here, as it's his HACL* project that's responsible, and he might have some curious insight. Jason
Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()
On Tue, Jun 16, 2020 at 12:54 PM Joe Perches wrote: > > On Mon, 2020-06-15 at 21:57 -0400, Waiman Long wrote: > > v4: > > - Break out the memzero_explicit() change as suggested by Dan Carpenter > > so that it can be backported to stable. > > - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for > > now as there can be a bit more discussion on what is best. It will be > > introduced as a separate patch later on after this one is merged. > > To this larger audience and last week without reply: > https://lore.kernel.org/lkml/573b3fbd5927c643920e1364230c296b23e7584d.ca...@perches.com/ > > Are there _any_ fastpath uses of kfree or vfree? The networking stack has various places where there will be a quick kmalloc followed by a kfree for an incoming or outgoing packet. One place that comes to mind would be esp_alloc_tmp, which does a quick allocation of some temporary kmalloc memory, processes some packet things inside of that, and then frees it, sometimes in the same function, and sometimes later in an async callback. I don't know how "fastpath" you consider this, but usually packet processing is something people want to do with minimal overhead, considering how fast NICs are these days.
Re: [PATCH 1/2] crypto: virtio: fix src/dst scatterlist calculation
On 2020/5/25 上午8:56, Longpeng(Mike) wrote: The system will crash when we insmod crypto/tcrypt.ko whit mode=38. Usually the next entry of one sg will be @sg@ + 1, but if this sg element is part of a chained scatterlist, it could jump to the start of a new scatterlist array. Let's fix it by sg_next() on calculation of src/dst scatterlist. BTW I add a check for sg_nents_for_len() its return value since sg_nents_for_len() function could fail. Cc: Gonglei Cc: Herbert Xu Cc: "Michael S. Tsirkin" Cc: Jason Wang Cc: "David S. Miller" Cc: virtualizat...@lists.linux-foundation.org Cc: linux-ker...@vger.kernel.org Reported-by: LABBE Corentin Signed-off-by: Gonglei Signed-off-by: Longpeng(Mike) --- drivers/crypto/virtio/virtio_crypto_algs.c | 14 ++ 1 file changed, 10 insertions(+), 4 deletions(-) diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c b/drivers/crypto/virtio/virtio_crypto_algs.c index 372babb44112..2fa1129f96d6 100644 --- a/drivers/crypto/virtio/virtio_crypto_algs.c +++ b/drivers/crypto/virtio/virtio_crypto_algs.c @@ -359,8 +359,14 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req, unsigned int num_out = 0, num_in = 0; int sg_total; uint8_t *iv; + struct scatterlist *sg; src_nents = sg_nents_for_len(req->src, req->cryptlen); + if (src_nents < 0) { + pr_err("Invalid number of src SG.\n"); + return src_nents; + } + dst_nents = sg_nents(req->dst); pr_debug("virtio_crypto: Number of sgs (src_nents: %d, dst_nents: %d)\n", @@ -446,12 +452,12 @@ __virtio_crypto_skcipher_do_req(struct virtio_crypto_sym_request *vc_sym_req, vc_sym_req->iv = iv; /* Source data */ - for (i = 0; i < src_nents; i++) - sgs[num_out++] = &req->src[i]; + for (sg = req->src, i = 0; sg && i < src_nents; sg = sg_next(sg), i++) Any reason sg is checked here? I believe it should be checked in sg_nents_for_len(). + sgs[num_out++] = sg; /* Destination data */ - for (i = 0; i < dst_nents; i++) - sgs[num_out + num_in++] = &req->dst[i]; + for (sg = req->dst, i = 0; sg && i < dst_nents; sg = sg_next(sg), i++) + sgs[num_out + num_in++] = sg; I believe sg should be checked in sg_nents(). Thanks /* Status */ sg_init_one(&status_sg, &vc_req->status, sizeof(vc_req->status));
Re: [PATCH] Kbuild: disable FORTIFY_SOURCE on clang-10
On Tue, May 5, 2020 at 5:22 PM Jason A. Donenfeld wrote: > > On Tue, May 5, 2020 at 5:19 PM Kees Cook wrote: > > > > On Tue, May 05, 2020 at 04:37:38PM -0600, Jason A. Donenfeld wrote: > > > On Tue, May 5, 2020 at 4:25 PM Nathan Chancellor > > > wrote: > > > > I believe these issues are one in the same. I did a reverse bisect with > > > > Arnd's test case and converged on George's first patch: > > > > > > > > https://github.com/llvm/llvm-project/commit/2dd17ff08165e6118e70f00e22b2c36d2d4e0a9a > > > > > > > > I think that in lieu of this patch, we should have that patch and its > > > > follow-up fix merged into 10.0.1. > > > > > > If this is fixed in 10.0.1, do we even need to patch the kernel at > > > all? Or can we just leave it be, considering most organizations using > > > clang know what they're getting into? I'd personally prefer the > > > latter, so that we don't clutter things. > > > > I agree: I'd rather this was fixed in 10.0.1 (but if we do want a > > kernel-side work-around for 10.0.0, I would suggest doing the version > > check in the Kconfig for FORTIFY_SOURCE instead of in the Makefile, > > as that's where these things are supposed to live these days). > > Indeed this belongs in the Makefile. I can send a patch adjusting er, Kconfig, is what I meant to type.
Re: [PATCH] Kbuild: disable FORTIFY_SOURCE on clang-10
On Tue, May 5, 2020 at 5:19 PM Kees Cook wrote: > > On Tue, May 05, 2020 at 04:37:38PM -0600, Jason A. Donenfeld wrote: > > On Tue, May 5, 2020 at 4:25 PM Nathan Chancellor > > wrote: > > > I believe these issues are one in the same. I did a reverse bisect with > > > Arnd's test case and converged on George's first patch: > > > > > > https://github.com/llvm/llvm-project/commit/2dd17ff08165e6118e70f00e22b2c36d2d4e0a9a > > > > > > I think that in lieu of this patch, we should have that patch and its > > > follow-up fix merged into 10.0.1. > > > > If this is fixed in 10.0.1, do we even need to patch the kernel at > > all? Or can we just leave it be, considering most organizations using > > clang know what they're getting into? I'd personally prefer the > > latter, so that we don't clutter things. > > I agree: I'd rather this was fixed in 10.0.1 (but if we do want a > kernel-side work-around for 10.0.0, I would suggest doing the version > check in the Kconfig for FORTIFY_SOURCE instead of in the Makefile, > as that's where these things are supposed to live these days). Indeed this belongs in the Makefile. I can send a patch adjusting that, if you want, but I think I'd rather do nothing and have a fix be rolled out in 10.0.1. Clang users should know what to expect in that regard. > (Though as was mentioned, it's likely that FORTIFY_SOURCE isn't working > _at all_ under Clang, so I may still send a patch to depend on !clang > just to avoid surprises until it's fixed, but I haven't had time to > chase down a solution yet.) That might be the most coherent thing to do, at least so that people don't get some false sense of mitigation. Jason
Re: [PATCH] Kbuild: disable FORTIFY_SOURCE on clang-10
On Tue, May 5, 2020 at 4:25 PM Nathan Chancellor wrote: > I believe these issues are one in the same. I did a reverse bisect with > Arnd's test case and converged on George's first patch: > > https://github.com/llvm/llvm-project/commit/2dd17ff08165e6118e70f00e22b2c36d2d4e0a9a > > I think that in lieu of this patch, we should have that patch and its > follow-up fix merged into 10.0.1. If this is fixed in 10.0.1, do we even need to patch the kernel at all? Or can we just leave it be, considering most organizations using clang know what they're getting into? I'd personally prefer the latter, so that we don't clutter things.
[PATCH] Kbuild: disable FORTIFY_SOURCE on clang-10
clang-10 has a broken optimization stage that doesn't enable the compiler to prove at compile time that certain memcpys are within bounds, and thus the outline memcpy is always called, resulting in horrific performance, and in some cases, excessive stack frame growth. Here's a simple reproducer: typedef unsigned long size_t; void *c(void *dest, const void *src, size_t n) __asm__("memcpy"); extern inline __attribute__((gnu_inline)) void *memcpy(void *dest, const void *src, size_t n) { return c(dest, src, n); } void blah(char *a) { unsigned long long b[10], c[10]; int i; memcpy(b, a, sizeof(b)); for (i = 0; i < 10; ++i) c[i] = b[i] ^ b[9 - i]; for (i = 0; i < 10; ++i) b[i] = c[i] ^ a[i]; memcpy(a, b, sizeof(b)); } Compile this with clang-9 and clang-10 and observe: zx2c4@thinkpad /tmp/curve25519-hacl64-stack-frame-size-test $ clang-10 -Wframe-larger-than=0 -O3 -c b.c -o c10.o b.c:5:6: warning: stack frame size of 104 bytes in function 'blah' [-Wframe-larger-than=] void blah(char *a) ^ 1 warning generated. zx2c4@thinkpad /tmp/curve25519-hacl64-stack-frame-size-test $ clang-9 -Wframe-larger-than=0 -O3 -c b.c -o c9.o Looking at the disassembly of c10.o and c9.o, one can see that c9.o is properly optimized in the obvious way one would expect, while c10.o has blown up and includes extern calls to memcpy. This is present on the most trivial bits of code. Thus, for clang-10, we just set __NO_FORTIFY globally, so that this issue won't be incurred. Cc: Arnd Bergmann Cc: LKML Cc: clang-built-linux Cc: Kees Cook Cc: George Burgess Cc: Nick Desaulniers Link: https://bugs.llvm.org/show_bug.cgi?id=45802 Signed-off-by: Jason A. Donenfeld --- Makefile | 7 +++ 1 file changed, 7 insertions(+) diff --git a/Makefile b/Makefile index 49b2709ff44e..f022f077591d 100644 --- a/Makefile +++ b/Makefile @@ -768,6 +768,13 @@ KBUILD_CFLAGS += -Wno-gnu # source of a reference will be _MergedGlobals and not on of the whitelisted names. # See modpost pattern 2 KBUILD_CFLAGS += -mno-global-merge + +# clang-10 has a broken optimization stage that causes memcpy to always be +# outline, resulting in excessive stack frame growth and poor performance. +ifeq ($(shell test $(CONFIG_CLANG_VERSION) -ge 10 && test $(CONFIG_CLANG_VERSION) -lt 11; echo $$?),0) +KBUILD_CFLAGS += -D__NO_FORTIFY +endif + else # These warnings generated too much noise in a regular build. -- 2.26.2
Re: [PATCH] crypto: curve25519-hacl64 - Disable fortify-source for clang-10
On Tue, May 5, 2020 at 3:48 PM Nick Desaulniers wrote: > > + Kees, George, who have started looking into this, too. I have a smaller reproducer and analysis I'll send out very shortly.
Re: [PATCH] crypto: curve25519-hacl64 - Disable fortify-source for clang-10
As discussed on IRC, this issue here isn't specific to this file, but rather fortify source has some serious issues on clang-10, everywhere in the kernel, and we should probably disable it globally for this clang version. I'll follow up with some more info in a different patch.
Re: [PATCH] net: wireguard: avoid unused variable warning
On Tue, May 5, 2020 at 8:13 AM Arnd Bergmann wrote: > > clang points out a harmless use of uninitialized variables that > get passed into a local function but are ignored there: > > In file included from drivers/net/wireguard/ratelimiter.c:223: > drivers/net/wireguard/selftest/ratelimiter.c:173:34: error: variable 'skb6' > is uninitialized when used here [-Werror,-Wuninitialized] > ret = timings_test(skb4, hdr4, skb6, hdr6, &test_count); >^~~~ > drivers/net/wireguard/selftest/ratelimiter.c:123:29: note: initialize the > variable 'skb6' to silence this warning > struct sk_buff *skb4, *skb6; >^ > = NULL > drivers/net/wireguard/selftest/ratelimiter.c:173:40: error: variable 'hdr6' > is uninitialized when used here [-Werror,-Wuninitialized] > ret = timings_test(skb4, hdr4, skb6, hdr6, &test_count); > ^~~~ > drivers/net/wireguard/selftest/ratelimiter.c:125:22: note: initialize the > variable 'hdr6' to silence this warning > struct ipv6hdr *hdr6; > ^ Seems like the code is a bit easier to read and is more uniform looking by just initializing those two variables to NULL, like the warning suggests. If you don't mind, I'll queue something up in my tree to this effect. By the way, I'm having a bit of a hard time reproducing the warning with either clang-10 or clang-9. Just for my own curiosity, would you mind sending the .config that results in this? Jason
Re: [PATCH 0/7] sha1 library cleanup
Thanks for this series. I like the general idea. I think it might make sense, though, to separate things out into sha1.h and sha256.h. That will be nice preparation work for when we eventually move obsolete primitives into some subdirectory.
Re: [PATCH v2] crypto: lib/sha256 - return void
On Fri, May 01, 2020 at 09:42:29AM -0700, Eric Biggers wrote: > From: Eric Biggers > > The SHA-256 / SHA-224 library functions can't fail, so remove the > useless return value. Looks good to me, and also matches the signatures of the blake2s library functions, which return null. Reviewed-by: Jason A. Donenfeld
Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks
Hi Herbert, This v3 patchset has a Reviewed-by from Ard for 1/2 and from Eric for 2/2, from last week. Could you submit this to Linus for rc4? Thanks, Jason
Re: [PATCH v4 25/35] crypto: BLAKE2s - x86_64 SIMD implementation
On Wed, Oct 23, 2019 at 6:55 AM Eric Biggers wrote: > There are no comments in this 685-line assembly language file. > Is this the original version, or is it a generated/stripped version? It looks like Ard forgot to import the latest one from Zinc, which is significantly shorter and has other improvements too: https://git.zx2c4.com/WireGuard/tree/src/crypto/zinc/blake2s/blake2s-x86_64.S
Re: [PATCH v3 00/29] crypto: crypto API library interfaces for WireGuard
Hi Ard, Just to keep track of it in public, here are the things that we're deferring from my original series for after this one is (if it is) merged: - Zinc's generic C implementation of poly1305, which is faster and has separate implementations for u64 and u128. Should be uncontroversial, but it's a code replacement, so not appropriate for this series here. - x86_64 ChaCha20 from Zinc, which will spark interesting discussions with Martin and might prove to be a bit controversial. - x86_64 Poly1305 from Zinc, which I think Martin will be on board with, but is also a code replacement. - The big_keys patch, showing the simplification the function call interface offers. - WireGuard - things are quasi-ready, so when the time comes, I look forward to submitting that to Dave's net tree as a single boring [PATCH 1/1]. Did I leave anything out? Jason
Re: [PATCH v3 24/29] crypto: lib/curve25519 - work around Clang stack spilling issue
Hi Ard, On Mon, Oct 7, 2019 at 6:46 PM Ard Biesheuvel wrote: > Arnd reports that the 32-bit generic library code for Curve25119 ends > up using an excessive amount of stack space when built with Clang: > > lib/crypto/curve25519-fiat32.c:756:6: error: stack frame size > of 1384 bytes in function 'curve25519_generic' > [-Werror,-Wframe-larger-than=] > > Let's give some hints to the compiler regarding which routines should > not be inlined, to prevent it from running out of registers and spilling > to the stack. The resulting code performs identically under both GCC > and Clang, and makes the warning go away. Are you *sure* about that? Couldn't we fix clang instead? I'd rather fixes go there instead of gimping this. The reason is that I noticed before that this code, performance-wise, was very inlining sensitive. Can you benchmark this on ARM32-noneon and on MIPS32? If there's a performance difference there, then maybe you can defer this part of the series until after the rest lands, and then we'll discuss at length various strategies? Alternatively, if you benchmark those and it also makes no difference, then it indeed makes no difference. Jason
Re: [PATCH v3 21/29] crypto: BLAKE2s - generic C library implementation and selftest
On Thu, Oct 10, 2019 at 11:02:32PM -0700, Eric Biggers wrote: > FYI, I had left a few review comments on Jason's last version of this patch > (https://lkml.kernel.org/linux-crypto/20190326173759.GA607@zzz.localdomain/), > some of which Jason addressed in the Wireguard repository > (https://git.zx2c4.com/WireGuard) but they didn't make it into this patch. > I'd suggest taking a look at the version there. Indeed I hadn't updated the Zinc patchset since then, but you can see the changes since ~March here: https://git.zx2c4.com/WireGuard/log/src/crypto There are actually quite a few interesting Blake changes.
Re: [PATCH v2 04/20] crypto: arm/chacha - expose ARM ChaCha routine as library function
On Fri, Oct 4, 2019 at 5:36 PM Ard Biesheuvel wrote: > > Just checking for Cortex-A7 being the boot CPU is probably > > sufficient, that takes care of the common case of all the > > A7-only embedded chips that people definitely are going to care > > about for a long time. > > > > But do you agree that disabling kernel mode NEON altogether for these > systems is probably more sensible than testing for CPU part IDs in an > arbitrary crypto driver? No. That NEON code is _still_ faster than the generic C code. But it is not as fast as the scalar code. There might be another primitive that has a fast NEON implementation but does not have a fast scalar implementation. The choice there would be between fast NEON and slow generic. In that case, we want fast NEON. Also, different algorithms lend themselves to different implementation strategies. Leave this up to the chacha code, as Zinc does it, since this is the place that has the most information to decide what it should be running.
Re: [PATCH v2 05/20] crypto: mips/chacha - import accelerated 32r2 code from Zinc
On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel wrote: > The round count is passed via the fifth function parameter, so it is > already on the stack. Reloading it for every block doesn't sound like > a huge deal to me. Please benchmark it to indicate that, if it really isn't a big deal. I recall finding that memory accesses on common mips32r2 commodity router hardware was extremely inefficient. The whole thing is designed to minimize memory accesses, which are the primary bottleneck on that platform. Seems like this thing might be best deferred for after this all lands. IOW, let's get this in with the 20 round original now, and later you can submit a change for the 12 round and René and I can spend time dusting off our test rigs and seeing which strategy works best. I very nearly tossed out a bunch of old router hardware last night when cleaning up. Glad I saved it!
Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard
On Fri, Oct 4, 2019 at 4:53 PM Andy Lutomirski wrote: > I think it might be better to allow two different modules to export the same > symbol but only allow one of them to be loaded. Or use static calls. Static calls perform well and are well understood. This would be my preference.
Re: [PATCH v2 04/20] crypto: arm/chacha - expose ARM ChaCha routine as library function
On Fri, Oct 4, 2019 at 4:23 PM Ard Biesheuvel wrote: > How is it relevant whether the boot CPU is A5 or A7? These are bL > little cores that only implement NEON for feature parity with their bl > big counterparts, but CPU intensive tasks are scheduled on big cores, > where NEON performance is much better than scalar. Yea big-little might confuse things indeed. Though the performance difference between the NEON code and the scalar code is not that huge, and I suspect that big-little machines might benefit from unconditionally using the scalar code, given that sometimes they might wind up doing things on the little cores. Eric - what did you guys wind up doing on Android with the fast scalar implementation?
Re: [PATCH v2 04/20] crypto: arm/chacha - expose ARM ChaCha routine as library function
On Fri, Oct 4, 2019 at 4:23 PM Ard Biesheuvel wrote: > > Did these changes make it into the existing tree? > > I'd like to keep Eric's code, but if it is really that much faster, we > might drop it in arch/arm/lib so it supersedes the builtin code that > /dev/random uses as well. That was the idea with Zinc. For things like ARM and MIPS, the optimized scalar code is really quite fast, and is worth using all the time and not compiling the generic code at all.
Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard
On Wed, Oct 2, 2019 at 4:17 PM Ard Biesheuvel wrote: > This is a followup to RFC 'crypto: wireguard with crypto API library > interface' > [0]. Since no objections were raised to my approach, I've proceeded to fix up > some minor issues, and incorporate [most of] the missing MIPS code. Could you get the mips64 code into the next version? WireGuard is widely used on the Ubiquiti EdgeRouter, which runs on the Octeon, a 64-bit mips chips.
Re: [PATCH v2 20/20] crypto: lib/chacha20poly1305 - reimplement crypt_from_sg() routine
On Wed, Oct 02, 2019 at 04:17:13PM +0200, Ard Biesheuvel wrote: > Reimplement the library routines to perform chacha20poly1305 en/decryption > on scatterlists, without [ab]using the [deprecated] blkcipher interface, > which is rather heavyweight and does things we don't really need. > > Instead, we use the sg_miter API in a novel and clever way, to iterate > over the scatterlist in-place (i.e., source == destination, which is the > only way this library is expected to be used). That way, we don't have to > iterate over two scatterlists in parallel. Nice idea. Probably this will result in a real speedup, as I suspect those extra prior kmaps weren't free. Looking forward to benching it.
Re: [PATCH v2 18/20] crypto: arm/Curve25519 - wire up NEON implementation
On Wed, Oct 02, 2019 at 04:17:11PM +0200, Ard Biesheuvel wrote: > +bool curve25519_arch(u8 out[CURVE25519_KEY_SIZE], > + const u8 scalar[CURVE25519_KEY_SIZE], > + const u8 point[CURVE25519_KEY_SIZE]) > +{ > + if (!have_neon || !crypto_simd_usable()) > + return false; > + kernel_neon_begin(); > + curve25519_neon(out, scalar, point); > + kernel_neon_end(); > + return true; > +} > +EXPORT_SYMBOL(curve25519_arch); This now looks more like the Zinc way of doing things, with the _arch function returning true or false based on whether or not the implementation was available, allowing the generic implementation to run. I thought, from looking at the chacha commits earlier, you had done away with that style of things in favor of weak symbols, whereby the arch implementation would fall back to the generic one, instead of the other way around?
Re: [PATCH v2 14/20] crypto: Curve25519 - generic C library implementations and selftest
On Wed, Oct 02, 2019 at 04:17:07PM +0200, Ard Biesheuvel wrote: >- replace .c #includes with Kconfig based object selection Cool! > +config CRYPTO_ARCH_HAVE_LIB_CURVE25519 > + tristate > + > +config CRYPTO_ARCH_HAVE_LIB_CURVE25519_BASE > + bool > + > +config CRYPTO_LIB_CURVE25519 > + tristate "Curve25519 scalar multiplication library" > + depends on CRYPTO_ARCH_HAVE_LIB_CURVE25519 || > !CRYPTO_ARCH_HAVE_LIB_CURVE25519 a || !a ==> true Did you mean for one of these to be _BASE? Or is this a Kconfig trick of a different variety that's intentional? > +libcurve25519-y := curve25519-fiat32.o > +libcurve25519-$(CONFIG_ARCH_SUPPORTS_INT128) := curve25519-hacl64.o Nice idea.
Re: [PATCH v2 04/20] crypto: arm/chacha - expose ARM ChaCha routine as library function
On Wed, Oct 2, 2019 at 4:17 PM Ard Biesheuvel wrote: > Expose the accelerated NEON ChaCha routine directly as a symbol > export so that users of the ChaCha library can use it directly. Eric had some nice code for ChaCha for certain ARM cores that lived in Zinc as chacha20-unrolled-arm.S. This code became active for certain cores where NEON was bad and for cores with no NEON. The condition for it was: switch (read_cpuid_part()) { case ARM_CPU_PART_CORTEX_A7: case ARM_CPU_PART_CORTEX_A5: /* The Cortex-A7 and Cortex-A5 do not perform well with the NEON * implementation but do incredibly with the scalar one and use * less power. */ break; default: chacha20_use_neon = elf_hwcap & HWCAP_NEON; } ... for (;;) { if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && chacha20_use_neon && len >= CHACHA20_BLOCK_SIZE * 3 && simd_use(simd_context)) { const size_t bytes = min_t(size_t, len, PAGE_SIZE); chacha20_neon(dst, src, bytes, ctx->key, ctx->counter); ctx->counter[0] += (bytes + 63) / 64; len -= bytes; if (!len) break; dst += bytes; src += bytes; simd_relax(simd_context); } else { chacha20_arm(dst, src, len, ctx->key, ctx->counter); ctx->counter[0] += (len + 63) / 64; break; } } It's another instance in which the generic code was totally optimized out of Zinc builds. Did these changes make it into the existing tree?
Re: [PATCH v2 10/20] crypto: mips/poly1305 - import accelerated 32r2 code from Zinc
On Wed, Oct 02, 2019 at 04:17:03PM +0200, Ard Biesheuvel wrote: > This integrates the accelerated MIPS 32r2 implementation of ChaCha > into both the API and library interfaces of the kernel crypto stack. > > The significance of this is that, in addition to becoming available > as an accelerated library implementation, it can also be used by > existing crypto API code such as IPsec, using chacha20poly1305. Same thing here: can you please split the assembly into two commits so I can see what you changed from the original?
Re: [PATCH v2 05/20] crypto: mips/chacha - import accelerated 32r2 code from Zinc
On Wed, Oct 02, 2019 at 04:16:58PM +0200, Ard Biesheuvel wrote: > This integrates the accelerated MIPS 32r2 implementation of ChaCha > into both the API and library interfaces of the kernel crypto stack. > > The significance of this is that, in addition to becoming available > as an accelerated library implementation, it can also be used by > existing crypto API code such as Adiantum (for block encryption on > ultra low performance cores) or IPsec using chacha20poly1305. These > are use cases that have already opted into using the abstract crypto > API. In order to support Adiantum, the core assembler routine has > been adapted to take the round count as a function argument rather > than hardcoding it to 20. Could you resubmit this with first my original commit and then with your changes on top? I'd like to see and be able to review exactly what's changed. If I recall correctly, René and I were really starved for registers and tried pretty hard to avoid spilling to the stack, so I'm interested to learn how you crammed a bit more sauce in there. I also wonder if maybe it'd be better to just leave this as is with 20 rounds, which it was previously optimized for, and just not do accelerated Adiantum for MIPS. Android has long since given up on the ISA entirely.
Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard
On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote: > On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel wrote: > > > ... > > > > In the future, I would like to extend these interfaces to use static calls, > > so that the accelerated implementations can be [un]plugged at runtime. For > > the time being, we rely on weak aliases and conditional exports so that the > > users of the library interfaces link directly to the accelerated versions, > > but without the ability to unplug them. > > > > As it turns out, we don't actually need static calls for this. > Instead, we can simply permit weak symbol references to go unresolved > between modules (as we already do in the kernel itself, due to the > fact that ELF permits it), and have the accelerated code live in > separate modules that may not be loadable on certain systems, or be > blacklisted by the user. You're saying that at module insertion time, the kernel will override weak symbols with those provided by the module itself? At runtime? Do you know offhand how this patching works? Is there a PLT that gets patched, and so the calls all go through a layer of function pointer indirection? Or are all call sites fixed up at insertion time and the call instructions rewritten with some runtime patching magic? Unless the method is the latter, I would assume that static calls are much faster in general? Or the approach already in this series is perhaps fine enough, and we don't need to break this apart into individual modules complicating everything?
Re: [PATCH v2 02/20] crypto: x86/chacha - expose SIMD ChaCha routine as library function
On Wed, Oct 02, 2019 at 04:16:55PM +0200, Ard Biesheuvel wrote: > Wire the existing x86 SIMD ChaCha code into the new ChaCha library > interface. > > Signed-off-by: Ard Biesheuvel > --- > arch/x86/crypto/chacha_glue.c | 36 > crypto/Kconfig| 1 + > include/crypto/chacha.h | 6 > 3 files changed, 43 insertions(+) > > diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c > index bc62daa8dafd..fd9ef42842cf 100644 > --- a/arch/x86/crypto/chacha_glue.c > +++ b/arch/x86/crypto/chacha_glue.c > @@ -123,6 +123,42 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 > *src, > } > } > > +void hchacha_block(const u32 *state, u32 *stream, int nrounds) > +{ > + state = PTR_ALIGN(state, CHACHA_STATE_ALIGN); > + > + if (!crypto_simd_usable()) { > + hchacha_block_generic(state, stream, nrounds); > + } else { > + kernel_fpu_begin(); > + hchacha_block_ssse3(state, stream, nrounds); > + kernel_fpu_end(); > + } > +} > +EXPORT_SYMBOL(hchacha_block); Please correct me if I'm wrong: The approach here is slightly different from Zinc. In Zinc, I had one entry point that conditionally called into the architecture-specific implementation, and I did it inline using #includes so that in some cases it could be optimized out. Here, you override the original symbol defined by the generic module from the architecture-specific implementation, and in there you decide which way to branch. Your approach has the advantage that you don't need to #include a .c file like I did, an ugly yet very effective approach. But it has two disadvantages: 1. For architecture-specific code that _always_ runs, such as the MIPS32r2 implementation of chacha, the compiler no longer has an opportunity to remove the generic code entirely from the binary, which under Zinc resulted in a smaller module. 2. The inliner can't make optimizations for that call. Disadvantage (2) might not make much of a difference. Disadvantage (1) seems like a bigger deal. However, perhaps the linker is smart and can remove the code and symbol? Or if not, is there a way to make the linker smart? Or would all this require crazy LTO which isn't going to happen any time soon?
Re: [PATCH v2 01/20] crypto: chacha - move existing library code into lib/crypto
On Wed, Oct 02, 2019 at 04:16:54PM +0200, Ard Biesheuvel wrote: > > chacha_permute(x, nrounds); Interested in porting my single-statement unrolled implementation from Zinc to this? I did see performance improvements on various platforms.
Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard
Hi Ard, On Wed, Oct 02, 2019 at 04:16:53PM +0200, Ard Biesheuvel wrote: > This is a followup to RFC 'crypto: wireguard with crypto API library > interface' > [0]. Since no objections were raised to my approach, I've proceeded to fix up > some minor issues, and incorporate [most of] the missing MIPS code. > > Changes since RFC/v1: > - dropped the WireGuard patch itself, and the followup patches - since the > purpose was to illustrate the extent of the required changes, there is no > reason to keep including them. > - import the MIPS 32r2 versions of ChaCha and Poly1305, but expose both the > crypto API and library interfaces so that not only WireGuard but also IPsec > and Adiantum can benefit immediately. (The latter required adding support > for > the reduced round version of ChaCha to the MIPS asm code) > - fix up various minor kconfig/build issues found in randconfig testing > (thanks Arnd!) Thanks for working on this. By wiring up the accelerated primitives, you're essentially implementing Zinc, and I expect that by the end of this, we'll wind up with something pretty close to where I started, with the critical difference that the directory names are different. Despite my initial email about WireGuard moving to the crypto API, it sounds like in the end it is likely to stay with Zinc, just without that name. Jason
Re: [RFC PATCH 11/20] crypto: BLAKE2s - x86_64 implementation
Hi Sebastian, Thomas, Take a look at the below snippet from this patch. I had previously put quite some effort into the simd_get, simd_put, simd_relax mechanism, so that the simd state could be persisted during both several calls to the same function and within long loops like below, with simd_relax existing to reenable preemption briefly if things were getting out of hand. Ard got rid of this and has moved the kernel_fpu_begin and kernel_fpu_end calls into the inner loop: On Sun, Sep 29, 2019 at 7:39 PM Ard Biesheuvel wrote: > + for (;;) { > + const size_t blocks = min_t(size_t, nblocks, > + PAGE_SIZE / BLAKE2S_BLOCK_SIZE); > + > + kernel_fpu_begin(); > + if (IS_ENABLED(CONFIG_AS_AVX512) && blake2s_use_avx512) > + blake2s_compress_avx512(state, block, blocks, inc); > + else > + blake2s_compress_avx(state, block, blocks, inc); > + kernel_fpu_end(); > + > + nblocks -= blocks; > + if (!nblocks) > + break; > + block += blocks * BLAKE2S_BLOCK_SIZE; > + } > + return true; > +} I'm wondering if on modern kernels this is actually fine and whether my simd_get/put/relax thing no longer has a good use case. Specifically, I recall last year there were a lot of patches and discussions about doing FPU register restoration lazily -- on context switch or the like. Did those land? Did the theory of action work out in the end? Regards, Jason
Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
Hey Andy, Thanks for weighing in. > inlining. I'd be surprised for chacha20. If you really want inlining > to dictate the overall design, I think you need some real numbers for > why it's necessary. There also needs to be a clear story for how > exactly making everything inline plays with the actual decision of > which implementation to use. Take a look at my description for the MIPS case: when on MIPS, the arch code is *always* used since it's just straight up scalar assembly. In this case, the chacha20_arch function *never* returns false [1], which means it's always included [2], so the generic implementation gets optimized out, saving disk and memory, which I assume MIPS people care about. [1] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-mips-glue.c?h=jd/wireguard#n13 [2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n118 I'm fine with considering this a form of "premature optimization", though, and ditching the motivation there. On Thu, Sep 26, 2019 at 11:37 PM Andy Lutomirski wrote: > My suggestion from way back, which is at > least a good deal of the way toward being doable, is to do static > calls. This means that the common code will call out to the arch code > via a regular CALL instruction and will *not* inline the arch code. > This means that the arch code could live in its own module, it can be > selected at boot time, etc. Alright, let's do static calls, then, to deal with the case of going from the entry point implementation in lib/zinc (or lib/crypto, if you want, Ard) to the arch-specific implementation in arch/${ARCH}/crypto. And then within each arch, we can keep it simple, since everything is already in the same directory. Sound good? Jason
Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
e MIPS case, the advantage is clear. Here's how Zinc handles it: [3]. Some simple ifdefs and includes. Easy and straightforward. Some people might object, though, and it sounds like you might. So let's talk about some alternative mechanisms with their pros and cons: - The zinc method: straightforward, but not everybody likes ifdefs. - Stick the filename to be included into a Kconfig variable and let the configuration system do the logic: also straightforward. Not sure it's kosher, but it would work. - Weak symbols: we don't get inlining or the dead code elimination. - Function pointers: ditto, plus spectre. - Other ideas you might have? I'm open to suggestions here. [2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54 [3] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n19 > What I *don't* want is to merge WireGuard with its own library based > crypto now, and extend that later for async accelerators once people > realize that we really do need that as well. I wouldn't worry so much about that. Zinc/library-based-crypto is just trying to fulfill the boring synchronous pure-code part of crypto implementations. For the async stuff, we can work together on improving the existing crypto API to be more appealing, in tandem with some interesting research into packet queuing systems. From the other thread, you might have seen that already Toke has cool ideas that I hope we can all sit down and talk about. I'm certainly not interested in "bolting" anything on to Zinc/library-based-crypto, and I'm happy to work to improve the asynchronous API for doing asynchronous crypto. Jason
chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]
[CC +willy, toke, dave, netdev] Hi Pascal On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen wrote: > Actually, that assumption is factually wrong. I don't know if anything > is *publicly* available, but I can assure you the silicon is running in > labs already. And something will be publicly available early next year > at the latest. Which could nicely coincide with having Wireguard support > in the kernel (which I would also like to see happen BTW) ... > > Not "at some point". It will. Very soon. Maybe not in consumer or server > CPUs, but definitely in the embedded (networking) space. > And it *will* be much faster than the embedded CPU next to it, so it will > be worth using it for something like bulk packet encryption. Super! I was wondering if you could speak a bit more about the interface. My biggest questions surround latency. Will it be synchronous or asynchronous? If the latter, why? What will its latencies be? How deep will its buffers be? The reason I ask is that a lot of crypto acceleration hardware of the past has been fast and having very deep buffers, but at great expense of latency. In the networking context, keeping latency low is pretty important. Already WireGuard is multi-threaded which isn't super great all the time for latency (improvements are a work in progress). If you're involved with the design of the hardware, perhaps this is something you can help ensure winds up working well? For example, AES-NI is straightforward and good, but Intel can do that because they are the CPU. It sounds like your silicon will be adjacent. How do you envision this working in a low latency environment? Jason
Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen wrote: > Actually, that assumption is factually wrong. I don't know if anything > is *publicly* available, but I can assure you the silicon is running in > labs already. Great to hear, and thanks for the information. I'll follow up with some questions on this in another thread.
Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API
preferences, and we can see what that looks like. Less appealing would be to do several iterations of you reworking Zinc from scratch and going through the exercises all over again, but if you prefer that I guess I could cope. Alternatively, maybe this is a lot to chew on, and we should just throw caution into the wind, implement cryptoapi.c for WireGuard (as described at the top), and add C functions to the crypto API sometime later? This is what I had envisioned in [1]. And for the avoidance of doubt, or in case any of the above message belied something different, I really am happy and relieved to have an opportunity to work on this _with you_, and I am much more open than before to compromise and finding practical solutions to the past political issues. Also, if you’re into chat, we can always spec some of the nitty-gritty aspects out over IRC or even the old-fashioned telephone. Thanks again for pushing this forward. Regards, Jason [1] https://lore.kernel.org/wireguard/cahmme9pmfzap5zd9bdlfc2fwuhtzzcjyzc2attptynffmed...@mail.gmail.com/ [2] https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54
Re: [PATCH v2] net: ipv4: move tcp_fastopen server side code to SipHash library
On 6/14/19 11:34 AM, Eric Dumazet wrote: > On Fri, Jun 14, 2019 at 7:01 AM Ard Biesheuvel > wrote: >> >> Using a bare block cipher in non-crypto code is almost always a bad idea, >> not only for security reasons (and we've seen some examples of this in >> the kernel in the past), but also for performance reasons. >> >> In the TCP fastopen case, we call into the bare AES block cipher one or >> two times (depending on whether the connection is IPv4 or IPv6). On most >> systems, this results in a call chain such as >> >> crypto_cipher_encrypt_one(ctx, dst, src) >> crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...); >> aesni_encrypt >> kernel_fpu_begin(); >> aesni_enc(ctx, dst, src); // asm routine >> kernel_fpu_end(); >> >> It is highly unlikely that the use of special AES instructions has a >> benefit in this case, especially since we are doing the above twice >> for IPv6 connections, instead of using a transform which can process >> the entire input in one go. >> >> We could switch to the cbcmac(aes) shash, which would at least get >> rid of the duplicated overhead in *some* cases (i.e., today, only >> arm64 has an accelerated implementation of cbcmac(aes), while x86 will >> end up using the generic cbcmac template wrapping the AES-NI cipher, >> which basically ends up doing exactly the above). However, in the given >> context, it makes more sense to use a light-weight MAC algorithm that >> is more suitable for the purpose at hand, such as SipHash. >> >> Since the output size of SipHash already matches our chosen value for >> TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input >> sizes, this greatly simplifies the code as well. >> >> Signed-off-by: Ard Biesheuvel > > While the patch looks fine (I yet have to run our tests with it), it > might cause some deployment issues > for server farms. > > They usually share a common fastopen key, so that clients can reuse > the same token for different sessions. > > Changing some servers in the pool will lead to inconsistencies. The inconsistencies coming from kernel version skew with some servers being on the old hash and some on the newer one? Or is there another source for the inconsistency you are referring to? Thanks, -Jason > > Probably not a too big deal, but worth mentioning. >
Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
On Fri, Nov 23, 2018 at 04:02:42PM +0800, Kenneth Lee wrote: > It is already part of Jean's patchset. And that's why I built my solution on > VFIO in the first place. But I think the concept of SVA and PASID is not > compatible with the original VFIO concept space. You would not share your > whole > address space to a device at all in a virtual machine manager, > wouldn't you? Why not? That seems to fit VFIO's space just fine to me.. You might need a new upcall to create a full MM registration, but that doesn't seem unsuited. Part of the point here is you should try to make sensible revisions to existing subsystems before just inventing a new thing... VFIO is deeply connected to the IOMMU, so enabling more general IOMMU based approache seems perfectly fine to me.. > > Once the VFIO driver knows about this as a generic capability then the > > device it exposes to userspace would use CPU addresses instead of DMA > > addresses. > > > > The question is if your driver needs much more than the device > > agnostic generic services VFIO provides. > > > > I'm not sure what you have in mind with resource management.. It is > > hard to revoke resources from userspace, unless you are doing > > kernel syscalls, but then why do all this? > > Say, I have 1024 queues in my accelerator. I can get one by opening the device > and attach it with the fd. If the process exit by any means, the queue can be > returned with the release of the fd. But if it is mdev, it will still be there > and some one should tell the allocator it is available again. This is not easy > to design in user space. ?? why wouldn't the mdev track the queues assigned using the existing open/close/ioctl callbacks? That is basic flow I would expect: open(/dev/vfio) ioctl(unity map entire process MM to mdev with IOMMU) // Create a HQ queue and link the PASID in the HW to this HW queue struct hw queue[..]; ioctl(create HW queue) // Get BAR doorbell memory for the queue bar = mmap() // Submit work to the queue using CPU addresses queue[0] = ... writel(bar [..], &queue); // Queue, SVA, etc is cleaned up when the VFIO closes close() Presumably the kernel has to handle the PASID and related for security reasons, so they shouldn't go to userspace? If there is something missing in vfio to do this is it looks pretty small to me.. Jason
Re: [PATCH net-next] cxgb4: use new fw interface to get the VIN and smt index
On Thu, Nov 22, 2018 at 10:53:49AM -0800, David Miller wrote: > From: Jason Gunthorpe > Date: Wed, 21 Nov 2018 19:46:24 -0700 > > > On Wed, Nov 21, 2018 at 01:40:24PM +0530, Ganesh Goudar wrote: > >> From: Santosh Rastapur > >> > >> If the fw supports returning VIN/VIVLD in FW_VI_CMD save it > >> in port_info structure else retrieve these from viid and save > >> them in port_info structure. Do the same for smt_idx from > >> FW_VI_MAC_CMD > >> > >> Signed-off-by: Santosh Rastapur > >> Signed-off-by: Ganesh Goudar > >> drivers/crypto/chelsio/chtls/chtls_cm.c | 3 +- > >> drivers/infiniband/hw/cxgb4/cm.c| 6 +-- > >> drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 12 - > >> drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 58 > >> - > >> drivers/net/ethernet/chelsio/cxgb4/l2t.c| 13 +++--- > >> drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 46 ++-- > >> drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h | 20 + > >> drivers/scsi/cxgbi/cxgb4i/cxgb4i.c | 3 +- > >> drivers/target/iscsi/cxgbit/cxgbit_cm.c | 8 ++-- > >> 9 files changed, 114 insertions(+), 55 deletions(-) > > > > Applied to for-next, but please try to write better commit messages in > > future, explain what benifit your change is bringing. > > The subject line indicates this is targetting my net-next tree, therefore > why did you apply it to your's? It is my mistake, it ended up in RDMA patchworks next to other RDMA chelsio patches, and contained an IB component.. It is dropped from rdma.git now. Thanks, Jason
Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
On Wed, Nov 21, 2018 at 02:08:05PM +0800, Kenneth Lee wrote: > > But considering Jean's SVA stuff seems based on mmu notifiers, I have > > a hard time believing that it has any different behavior from RDMA's > > ODP, and if it does have different behavior, then it is probably just > > a bug in the ODP implementation. > > As Jean has explained, his solution is based on page table sharing. I think > ODP > should also consider this new feature. Shared page tables would require the HW to walk the page table format of the CPU directly, not sure how that would be possible for ODP? Presumably the implementation for ARM relies on the IOMMU hardware doing this? > > > > If all your driver needs is to mmap some PCI bar space, route > > > > interrupts and do DMA mapping then mediated VFIO is probably a good > > > > choice. > > > > > > Yes. That is what is done in our RFCv1/v2. But we accepted Jerome's > > > opinion and > > > try not to add complexity to the mm subsystem. > > > > Why would a mediated VFIO driver touch the mm subsystem? Sounds like > > you don't have a VFIO driver if it needs to do stuff like that... > > VFIO has no ODP-like solution, and if we want to solve the fork problem, we > have > to make some change to iommu and the fork procedure. Further, VFIO takes every > queue as a independent device. This create a lot of trouble on resource > management. For example, you will need a manager process to withdraw the > unused > device and you need to let the user process know about PASID of the queue, and > so on. Well, I would think you'd add SVA support to the VFIO driver as a generic capability - it seems pretty useful for any VFIO user as it avoids all the kernel upcalls to do memory pinning and DMA address translation. Once the VFIO driver knows about this as a generic capability then the device it exposes to userspace would use CPU addresses instead of DMA addresses. The question is if your driver needs much more than the device agnostic generic services VFIO provides. I'm not sure what you have in mind with resource management.. It is hard to revoke resources from userspace, unless you are doing kernel syscalls, but then why do all this? Jason
Re: [PATCH net-next] cxgb4: use new fw interface to get the VIN and smt index
On Wed, Nov 21, 2018 at 01:40:24PM +0530, Ganesh Goudar wrote: > From: Santosh Rastapur > > If the fw supports returning VIN/VIVLD in FW_VI_CMD save it > in port_info structure else retrieve these from viid and save > them in port_info structure. Do the same for smt_idx from > FW_VI_MAC_CMD > > Signed-off-by: Santosh Rastapur > Signed-off-by: Ganesh Goudar > drivers/crypto/chelsio/chtls/chtls_cm.c | 3 +- > drivers/infiniband/hw/cxgb4/cm.c| 6 +-- > drivers/net/ethernet/chelsio/cxgb4/cxgb4.h | 12 - > drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 58 > - > drivers/net/ethernet/chelsio/cxgb4/l2t.c| 13 +++--- > drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 46 ++-- > drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h | 20 + > drivers/scsi/cxgbi/cxgb4i/cxgb4i.c | 3 +- > drivers/target/iscsi/cxgbit/cxgbit_cm.c | 8 ++-- > 9 files changed, 114 insertions(+), 55 deletions(-) Applied to for-next, but please try to write better commit messages in future, explain what benifit your change is bringing. Jason
Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements
Hi Martin, On Tue, Nov 20, 2018 at 5:29 PM Martin Willi wrote: > Thanks for the offer, no need at this time. But I certainly would > welcome if you could do some (Wireguard) benching with that code to see > if it works for you. I certainly will test it in a few different network circumstances, especially since real testing like this is sometimes more telling than busy-loop benchmarks. > > Actually, similarly here, a 10nm Cannon Lake machine should be > > arriving at my house this week, which should make for some > > interesting testing ground for non-throttled zmm, if you'd like to > > play with it. > > Maybe in a future iteration, thanks. In fact would it be interesting to > know if Cannon Lake can handle that throttling better. Everything I've read on the Internet seems to indicate that's the case, so one of the first things I'll be doing is seeing if that's true. There are also the AVX512 IFMA instructions to play with! Jason
Re: [RFC PATCH v2 0/4] Exporting existing crypto API code through zinc
On Tue, Nov 20, 2018 at 3:19 PM Herbert Xu wrote: > Yes. In fact it's used for FIPS certification testing. > Sure, nobody sane should be doing it. But when it comes to > government certification... :) The kernel does not aim toward any FIPS certification, and we're not going to start bloating our designs to fulfill this. It's never been a goal. Maybe ask Ted to add a FIPS mode to random.c and see what happens... When you start arguing "because FIPS!" as your justification, you really hit a head scratcher. > They've already paid for the indirect > function call so why make them go through yet another run-time > branch? The indirect function call in the crypto API is the performance hit. The branch in Zinc is not, as the predictor does the correct thing every single time. I'm not able to distinguish between the two looking at the performance measurements between it being there and the branch being commented out. Give me a few more days to finish v9's latest required changes for chacha12, and then I'll submit a revision that I think should address the remaining technical objections raised over the last several months we've been discussing this. Regards, Jason
Re: [RFC PATCH v2 0/4] Exporting existing crypto API code through zinc
Hi Herbert, On Tue, Nov 20, 2018 at 7:02 AM Herbert Xu wrote: > Here is an updated version demonstrating how we could access the > accelerated versions of chacha20. It also includes a final patch > to deposit the new zinc version of x86-64 chacha20 into > arch/x86/crypto where it can be used by both the crypto API as well > as zinc. N'ack. As I mentioned in the last email, this really isn't okay, and mostly defeats the purpose of Zinc in the first place, adding complexity in a discombobulated half-baked way. Your proposal seems to be the worst of all worlds. Having talked to lots of people about the design of Zinc at this point to very positive reception, I'm going to move forward with submitting v9, once I rebase with the required changes for Eric's latest merge. Jason
Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
On Tue, Nov 20, 2018 at 11:07:02AM +0800, Kenneth Lee wrote: > On Mon, Nov 19, 2018 at 11:49:54AM -0700, Jason Gunthorpe wrote: > > Date: Mon, 19 Nov 2018 11:49:54 -0700 > > From: Jason Gunthorpe > > To: Kenneth Lee > > CC: Leon Romanovsky , Kenneth Lee , > > Tim Sell , linux-...@vger.kernel.org, Alexander > > Shishkin , Zaibo Xu > > , zhangfei@foxmail.com, linux...@huawei.com, > > haojian.zhu...@linaro.org, Christoph Lameter , Hao Fang > > , Gavin Schenk , RDMA mailing > > list , Zhou Wang , > > Doug Ledford , Uwe Kleine-König > > , David Kershner > > , Johan Hovold , Cyrille > > Pitchen , Sagar Dharia > > , Jens Axboe , > > guodong...@linaro.org, linux-netdev , Randy Dunlap > > , linux-ker...@vger.kernel.org, Vinod Koul > > , linux-crypto@vger.kernel.org, Philippe Ombredanne > > , Sanyog Kale , "David S. > > Miller" , linux-accelerat...@lists.ozlabs.org > > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce > > User-Agent: Mutt/1.9.4 (2018-02-28) > > Message-ID: <20181119184954.gb4...@ziepe.ca> > > > > On Mon, Nov 19, 2018 at 05:14:05PM +0800, Kenneth Lee wrote: > > > > > If the hardware cannot share page table with the CPU, we then need to have > > > some way to change the device page table. This is what happen in ODP. It > > > invalidates the page table in device upon mmu_notifier call back. But > > > this cannot > > > solve the COW problem: if the user process A share a page P with device, > > > and A > > > forks a new process B, and it continue to write to the page. By COW, the > > > process B will keep the page P, while A will get a new page P'. But you > > > have > > > no way to let the device know it should use P' rather than P. > > > > Is this true? I thought mmu_notifiers covered all these cases. > > > > The mm_notifier for A should fire if B causes the physical address of > > A's pages to change via COW. > > > > And this causes the device page tables to re-synchronize. > > I don't see such code. The current do_cow_fault() implemenation has nothing to > do with mm_notifer. Well, that sure sounds like it would be a bug in mmu_notifiers.. But considering Jean's SVA stuff seems based on mmu notifiers, I have a hard time believing that it has any different behavior from RDMA's ODP, and if it does have different behavior, then it is probably just a bug in the ODP implementation. > > > In WarpDrive/uacce, we make this simple. If you support IOMMU and it > > > support > > > SVM/SVA. Everything will be fine just like ODP implicit mode. And you > > > don't need > > > to write any code for that. Because it has been done by IOMMU framework. > > > If it > > > > Looks like the IOMMU code uses mmu_notifier, so it is identical to > > IB's ODP. The only difference is that IB tends to have the IOMMU page > > table in the device, not in the CPU. > > > > The only case I know if that is different is the new-fangled CAPI > > stuff where the IOMMU can directly use the CPU's page table and the > > IOMMU page table (in device or CPU) is eliminated. > > Yes. We are not focusing on the current implementation. As mentioned in the > cover letter. We are expecting Jean Philips' SVA patch: > git://linux-arm.org/linux-jpb. This SVA stuff does not look comparable to CAPI as it still requires maintaining seperate IOMMU page tables. Also, those patches from Jean have a lot of references to mmu_notifiers (ie look at iommu_mmu_notifier). Are you really sure it is actually any different at all? > > Anyhow, I don't think a single instance of hardware should justify an > > entire new subsystem. Subsystems are hard to make and without multiple > > hardware examples there is no way to expect that it would cover any > > future use cases. > > Yes. That's our first expectation. We can keep it with our driver. But because > there is no user driver support for any accelerator in mainline kernel. Even > the > well known QuickAssit has to be maintained out of tree. So we try to see if > people is interested in working together to solve the problem. Well, you should come with patches ack'ed by these other groups. > > If all your driver needs is to mmap some PCI bar space, route > > interrupts and do DMA mapping then mediated VFIO is probably a good > > choice. > > Yes. That is what is done in our RFCv1/v2. But we accepted Jerome's opinion > and > try not to add complexity to the mm subsystem. Why would
Re: [RFC PATCH] zinc chacha20 generic implementation using crypto API code
Hi Herbert, On Tue, Nov 20, 2018 at 4:06 AM Herbert Xu wrote: > > I'd still prefer to see the conversion patches included. Skipping them > > would be > > kicking the can down the road and avoiding issues that will need to be > > addressed > > anyway. Like you, I don't want a "half-baked concoction that will be maybe > > possibly be replaced 'later'" :-) > > Are you guys talking about the conversion patches to eliminate the > two copies of chacha code that would otherwise exist in crypto as > well as in zinc? > > If so I think that's not negotiable. Having two copies of the same > code is just not acceptable. Yes, that's the topic. Eric already expressed his preference to keep them, and I agreed, so the plan is to keep them. Jason
Re: [RFC PATCH] zinc chacha20 generic implementation using crypto API code
Hi Eric, On Tue, Nov 20, 2018 at 12:23 AM Eric Biggers wrote: > It's much better to have the documentation in a permanent location. Agreed. > I actually did add ChaCha12 support to most of the Zinc assembly in > "[WIP] crypto: assembly support for ChaCha12" > (https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/commit/?h=adiantum-zinc&id=0a7787a515a977e11b680f1752b430ca1744e399). > But I skipped AVX-512 and MIPS since I didn't have a way to test those yet, > and I haven't ported the changes to your new perl scripts yet. Oh, great. If you're playing with this in the context of the WireGuard repo, you can run `ARCH=mips make test-qemu -j$(nproc)` (which is what's running on https://www.wireguard.com/build-status/ ). For AVX-512 testing, I've got a box I'm happy to give you access to, if you want to send an SSH key and your employer allows for that kind of thing etc. I can have a stab at all of this too if you like. Jason
Re: [RFC PATCH] zinc chacha20 generic implementation using crypto API code
Hi Eric, On Mon, Nov 19, 2018 at 11:54 PM Eric Biggers wrote: > Will v9 include a documentation file for Zinc in Documentation/crypto/? > That's been suggested several times. I had started writing that there, but then thought that the requested information could go in the commit message instead. But I'm guessing you're asking again now because you poked into the repo and didn't find the Documentation/, so presumably you still want it. I can reorganize the presentation of that to be more suitable for Documentation/, and I'll have that for v9. > I'd still prefer to see the conversion patches included. Skipping them would > be > kicking the can down the road and avoiding issues that will need to be > addressed > anyway. Like you, I don't want a "half-baked concoction that will be maybe > possibly be replaced 'later'" :-) Okay, fair enough. Will do. > Either way though, it would make things much easier if you at least named the > files, structures, constants, etc. "ChaCha" rather than "ChaCha20" from the > start where appropriate. For an example, see the commit "crypto: chacha - > prepare for supporting non-20-round variants" on my "adiantum-zinc" branch: > https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/commit/?h=adiantum-zinc&id=754af8d7d39f31238114426e39786c84d7cc0f98 > Then the actual introduction of the 12-round variant is much less noisy. That's a good idea. I'll do it like that. I'll likely order it as what we have now (renamed to omit the 20), and then put the 12 stuff on top of that, so it's easier to see what's changed in the process. I noticed in that branch, you didn't port the assembly to support fewer rounds. Shall I follow suite, and then expect patches from you later doing that? Or were you expecting me to also port the architecture implementations to chacha12 as well? Regards, Jason
Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
On Mon, Nov 19, 2018 at 04:33:20PM -0500, Jerome Glisse wrote: > On Mon, Nov 19, 2018 at 02:26:38PM -0700, Jason Gunthorpe wrote: > > On Mon, Nov 19, 2018 at 03:26:15PM -0500, Jerome Glisse wrote: > > > On Mon, Nov 19, 2018 at 01:11:56PM -0700, Jason Gunthorpe wrote: > > > > On Mon, Nov 19, 2018 at 02:46:32PM -0500, Jerome Glisse wrote: > > > > > > > > > > ?? How can O_DIRECT be fine but RDMA not? They use exactly the same > > > > > > get_user_pages flow, right? Can we do what O_DIRECT does in RDMA and > > > > > > be fine too? > > > > > > > > > > > > AFAIK the only difference is the length of the race window. You'd > > > > > > have > > > > > > to fork and fault during the shorter time O_DIRECT has > > > > > > get_user_pages > > > > > > open. > > > > > > > > > > Well in O_DIRECT case there is only one page table, the CPU > > > > > page table and it gets updated during fork() so there is an > > > > > ordering there and the race window is small. > > > > > > > > Not really, in O_DIRECT case there is another 'page table', we just > > > > call it a DMA scatter/gather list and it is sent directly to the block > > > > device's DMA HW. The sgl plays exactly the same role as the various HW > > > > page list data structures that underly RDMA MRs. > > > > > > > > It is not a page table that matters here, it is if the DMA address of > > > > the page is active for DMA on HW. > > > > > > > > Like you say, the only difference is that the race is hopefully small > > > > with O_DIRECT (though that is not really small, NVMeof for instance > > > > has windows as large as connection timeouts, if you try hard enough) > > > > > > > > So we probably can trigger this trouble with O_DIRECT and fork(), and > > > > I would call it a bug :( > > > > > > I can not think of any scenario that would be a bug with O_DIRECT. > > > Do you have one in mind ? When you fork() and do other syscall that > > > affect the memory of your process in another thread you should > > > expect non consistant results. Kernel is not here to provide a fully > > > safe environement to user, user can shoot itself in the foot and > > > that's fine as long as it only affect the process itself and no one > > > else. We should not be in the business of making everything baby > > > proof :) > > > > Sure, I setup AIO with O_DIRECT and launch a read. > > > > Then I fork and dirty the READ target memory using the CPU in the > > child. > > > > As you described in this case the fork will retain the physical page > > that is undergoing O_DIRECT DMA, and the parent gets a new copy'd page. > > > > The DMA completes, and the child gets the DMA'd to page. The parent > > gets an unchanged copy'd page. > > > > The parent gets the AIO completion, but can't see the data. > > > > I'd call that a bug with O_DIRECT. The only correct outcome is that > > the parent will always see the O_DIRECT data. Fork should not cause > > the *parent* to malfunction. I agree the child cannot make any > > prediction what memory it will see. > > > > I assume the same flow is possible using threads and read().. > > > > It is really no different than the RDMA bug with fork. > > > > Yes and that's expected behavior :) If you fork() and have anything > still in flight at time of fork that can change your process address > space (including data in it) then all bets are of. > > At least this is my reading of fork() syscall. Not mine.. I can't think of anything else that would have this behavior. All traditional syscalls, will properly dirty the pages of the parent. ie if I call read() in a thread and do fork in another thread, then not seeing the data after read() completes is clearly a bug. All other syscalls are the same. It is bonkers that opening the file with O_DIRECT would change this basic behavior. I'm calling it a bug :) Jason
Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
On Mon, Nov 19, 2018 at 03:26:15PM -0500, Jerome Glisse wrote: > On Mon, Nov 19, 2018 at 01:11:56PM -0700, Jason Gunthorpe wrote: > > On Mon, Nov 19, 2018 at 02:46:32PM -0500, Jerome Glisse wrote: > > > > > > ?? How can O_DIRECT be fine but RDMA not? They use exactly the same > > > > get_user_pages flow, right? Can we do what O_DIRECT does in RDMA and > > > > be fine too? > > > > > > > > AFAIK the only difference is the length of the race window. You'd have > > > > to fork and fault during the shorter time O_DIRECT has get_user_pages > > > > open. > > > > > > Well in O_DIRECT case there is only one page table, the CPU > > > page table and it gets updated during fork() so there is an > > > ordering there and the race window is small. > > > > Not really, in O_DIRECT case there is another 'page table', we just > > call it a DMA scatter/gather list and it is sent directly to the block > > device's DMA HW. The sgl plays exactly the same role as the various HW > > page list data structures that underly RDMA MRs. > > > > It is not a page table that matters here, it is if the DMA address of > > the page is active for DMA on HW. > > > > Like you say, the only difference is that the race is hopefully small > > with O_DIRECT (though that is not really small, NVMeof for instance > > has windows as large as connection timeouts, if you try hard enough) > > > > So we probably can trigger this trouble with O_DIRECT and fork(), and > > I would call it a bug :( > > I can not think of any scenario that would be a bug with O_DIRECT. > Do you have one in mind ? When you fork() and do other syscall that > affect the memory of your process in another thread you should > expect non consistant results. Kernel is not here to provide a fully > safe environement to user, user can shoot itself in the foot and > that's fine as long as it only affect the process itself and no one > else. We should not be in the business of making everything baby > proof :) Sure, I setup AIO with O_DIRECT and launch a read. Then I fork and dirty the READ target memory using the CPU in the child. As you described in this case the fork will retain the physical page that is undergoing O_DIRECT DMA, and the parent gets a new copy'd page. The DMA completes, and the child gets the DMA'd to page. The parent gets an unchanged copy'd page. The parent gets the AIO completion, but can't see the data. I'd call that a bug with O_DIRECT. The only correct outcome is that the parent will always see the O_DIRECT data. Fork should not cause the *parent* to malfunction. I agree the child cannot make any prediction what memory it will see. I assume the same flow is possible using threads and read().. It is really no different than the RDMA bug with fork. Jason