Re: [PATCH] crypto: virtio/akcipher - Fix stack overflow on memcpy

2024-01-31 Thread Jason Wang
On Tue, Jan 30, 2024 at 7:28 PM zhenwei pi  wrote:
>
> sizeof(struct virtio_crypto_akcipher_session_para) is less than
> sizeof(struct virtio_crypto_op_ctrl_req::u), copying more bytes from
> stack variable leads stack overflow. Clang reports this issue by
> commands:
> make -j CC=clang-14 mrproper >/dev/null 2>&1
> make -j O=/tmp/crypto-build CC=clang-14 allmodconfig >/dev/null 2>&1
> make -j O=/tmp/crypto-build W=1 CC=clang-14 drivers/crypto/virtio/
>   virtio_crypto_akcipher_algs.o
>
> Fixes: 59ca6c93387d ("virtio-crypto: implement RSA algorithm")
> Link: 
> https://lore.kernel.org/all/0a194a79-e3a3-45e7-be98-83abd3e1c...@roeck-us.net/
> Signed-off-by: zhenwei pi 

Acked-by: Jason Wang 

Thanks




Re: [PATCH] crypto: arm/curve25519 - Move '.fpu' after '.arch'

2021-04-09 Thread Jason A. Donenfeld
Seems reasonable to me.

Acked-by: Jason A. Donenfeld 


Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode

2021-04-09 Thread Jason A. Donenfeld
On Fri, Apr 9, 2021 at 6:47 AM Simo Sorce  wrote:
> >   depends on m || !CRYPTO_FIPS
> >
> > but I am a bit concerned that the rather intricate kconfig
> > dependencies between the generic and arch-optimized versions of those
> > drivers get complicated even further.
>
> Actually this is the opposite direction we are planning to go for
> future fips certifications.
>
> Due to requirements about crypto module naming and versioning in the
> new FIPS-140-3 standard we are planning to always build all the CRYPTO
> as bultin (and maybe even forbid loading additional crypto modules in
> FIPS mode). This is clearly just a vendor choice and has no bearing on
> what upstream ultimately will do, but just throwing it here as a data
> point.

I'm wondering: do you intend to apply similar patches to all the other
uses of "non-FIPS-certified" crypto in the kernel? I've already
brought up big_key.c, for example. Also if you're intent on adding
this check to WireGuard, because it tunnels packets without using
FIPS-certified crypto primitives, do you also plan on adding this
check to other network tunnels that don't tunnel packets using
FIPS-certified crypto primitives? For example, GRE, VXLAN, GENEVE? I'd
be inclined to take this patch more seriously if it was exhaustive and
coherent for your use case. The targeted hit on WireGuard seems
incoherent as a standalone patch, making it hard to even evaluate. So
I think either you should send an exhaustive patch series that forbids
all use of non-FIPS crypto anywhere in the kernel (another example:
net/core/secure_seq.c) in addition to all tunneling modules that don't
use FIPS-certified crypto, or figure out how to disable the lib/crypto
primitives that you want to be disabled in "fips mode". With a
coherent patchset for either of these, we can then evaluate it.

Jason


Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode

2021-04-09 Thread Jason A. Donenfeld
On Fri, Apr 9, 2021 at 2:08 AM Hangbin Liu  wrote:
> After offline discussion with Herbert, here is
> what he said:
>
> """
> This is not a problem in RHEL8 because the Crypto API RNG replaces /dev/random
> in FIPS mode.
> """

So far as I can see, this isn't the case in the kernel sources I'm
reading? Maybe you're doing some userspace hack with CUSE? But at
least get_random_bytes doesn't behave this way...


Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode

2021-04-08 Thread Jason A. Donenfeld
On Fri, Apr 09, 2021 at 10:49:07AM +0800, Hangbin Liu wrote:
> On Thu, Apr 08, 2021 at 08:44:35PM -0600, Jason A. Donenfeld wrote:
> > Since it's just a normal module library, you can simply do this in the
> > module_init function, rather than deep within registration
> > abstractions.
> 
> I did a try but looks it's not that simple. Not sure if it's because wireguard
> calls the library directly. Need to check more...

Something like the below should work...

diff --git a/arch/arm/crypto/chacha-glue.c b/arch/arm/crypto/chacha-glue.c
index a408f4bcfd62..47212f9421c1 100644
--- a/arch/arm/crypto/chacha-glue.c
+++ b/arch/arm/crypto/chacha-glue.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -297,6 +298,9 @@ static int __init chacha_simd_mod_init(void)
 {
int err = 0;

+   if (fips_enabled)
+   return -EOPNOTSUPP;
+
if (IS_REACHABLE(CONFIG_CRYPTO_BLKCIPHER)) {
err = crypto_register_skciphers(arm_algs, ARRAY_SIZE(arm_algs));
if (err)
diff --git a/arch/arm/crypto/curve25519-glue.c 
b/arch/arm/crypto/curve25519-glue.c
index 31eb75b6002f..d03f810fdaf3 100644
--- a/arch/arm/crypto/curve25519-glue.c
+++ b/arch/arm/crypto/curve25519-glue.c
@@ -14,6 +14,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -114,6 +115,9 @@ static struct kpp_alg curve25519_alg = {

 static int __init mod_init(void)
 {
+   if (fips_enabled)
+   return -EOPNOTSUPP;
+
if (elf_hwcap & HWCAP_NEON) {
static_branch_enable(&have_neon);
return IS_REACHABLE(CONFIG_CRYPTO_KPP) ?
diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c
index 3023c1acfa19..30d6c6de7a27 100644
--- a/arch/arm/crypto/poly1305-glue.c
+++ b/arch/arm/crypto/poly1305-glue.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 

 void poly1305_init_arm(void *state, const u8 *key);
 void poly1305_blocks_arm(void *state, const u8 *src, u32 len, u32 hibit);
@@ -240,6 +241,9 @@ static struct shash_alg arm_poly1305_algs[] = {{

 static int __init arm_poly1305_mod_init(void)
 {
+   if (fips_enabled)
+   return -EOPNOTSUPP;
+
if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) &&
(elf_hwcap & HWCAP_NEON))
static_branch_enable(&have_neon);
diff --git a/arch/arm64/crypto/chacha-neon-glue.c 
b/arch/arm64/crypto/chacha-neon-glue.c
index 1d9824c4ae43..1696993326b5 100644
--- a/arch/arm64/crypto/chacha-neon-glue.c
+++ b/arch/arm64/crypto/chacha-neon-glue.c
@@ -26,6 +26,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -214,6 +215,9 @@ static struct skcipher_alg algs[] = {

 static int __init chacha_simd_mod_init(void)
 {
+   if (fips_enabled)
+   return -EOPNOTSUPP;
+
if (!cpu_have_named_feature(ASIMD))
return 0;

diff --git a/arch/arm64/crypto/poly1305-glue.c 
b/arch/arm64/crypto/poly1305-glue.c
index f33ada70c4ed..ac257a52be4d 100644
--- a/arch/arm64/crypto/poly1305-glue.c
+++ b/arch/arm64/crypto/poly1305-glue.c
@@ -17,6 +17,7 @@
 #include 
 #include 
 #include 
+#include 

 asmlinkage void poly1305_init_arm64(void *state, const u8 *key);
 asmlinkage void poly1305_blocks(void *state, const u8 *src, u32 len, u32 
hibit);
@@ -208,6 +209,9 @@ static struct shash_alg neon_poly1305_alg = {

 static int __init neon_poly1305_mod_init(void)
 {
+   if (fips_enabled)
+   return -EOPNOTSUPP;
+
if (!cpu_have_named_feature(ASIMD))
return 0;

diff --git a/arch/mips/crypto/chacha-glue.c b/arch/mips/crypto/chacha-glue.c
index 90896029d0cd..31f8294f2a31 100644
--- a/arch/mips/crypto/chacha-glue.c
+++ b/arch/mips/crypto/chacha-glue.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 

 asmlinkage void chacha_crypt_arch(u32 *state, u8 *dst, const u8 *src,
  unsigned int bytes, int nrounds);
@@ -128,6 +129,9 @@ static struct skcipher_alg algs[] = {

 static int __init chacha_simd_mod_init(void)
 {
+   if (fips_enabled)
+   return -EOPNOTSUPP;
+
return IS_REACHABLE(CONFIG_CRYPTO_BLKCIPHER) ?
crypto_register_skciphers(algs, ARRAY_SIZE(algs)) : 0;
 }
diff --git a/arch/mips/crypto/poly1305-glue.c b/arch/mips/crypto/poly1305-glue.c
index fc881b46d911..f5edec10cef8 100644
--- a/arch/mips/crypto/poly1305-glue.c
+++ b/arch/mips/crypto/poly1305-glue.c
@@ -12,6 +12,7 @@
 #include 
 #include 
 #include 
+#include 

 asmlinkage void poly1305_init_mips(void *state, const u8 *key);
 asmlinkage void poly1305_blocks_mips(void *state, const u8 *src, u32 len, u32 
hibit);
@@ -173,6 +174,9 @@ static struct shash_alg mips_poly1305_alg = {

 static int __init mips_poly1305_mod_init(void)
 {
+   if (fips_enabled)
+   return -EOPNOTSUPP;
+
return IS_REACHABLE(CONFIG_CRYPTO_HASH) ?
 

Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode

2021-04-08 Thread Jason A. Donenfeld
Hi Hangbin,

On Thu, Apr 8, 2021 at 8:41 PM Hangbin Liu  wrote:
> I agree that the best way is to disable the crypto modules in FIPS mode.
> But the code in lib/crypto looks not the same with crypto/. For modules
> in crypto, there is an alg_test() to check if the crytpo is FIPS allowed
> when do register.
>
> - crypto_register_alg()
>   - crypto_wait_for_test()
> - crypto_probing_notify(CRYPTO_MSG_ALG_REGISTER, larval->adult)
>   - cryptomgr_schedule_test()
> - cryptomgr_test()
>   - alg_test()
>
> But in lib/crypto the code are more like a library. We can call it anytime
> and there is no register. Maybe we should add a similar check in lib/crypto.
> But I'm not familiar with crypto code... Not sure if anyone in linux-crypto@
> would like help do that.

Since it's just a normal module library, you can simply do this in the
module_init function, rather than deep within registration
abstractions.

> > diff --git a/lib/crypto/curve25519.c b/lib/crypto/curve25519.c
> > index 288a62cd29b2..b794f49c291a 100644
> > --- a/lib/crypto/curve25519.c
> > +++ b/lib/crypto/curve25519.c
> > @@ -12,11 +12,15 @@
> >  #include 
> >  #include 
> >  #include 
> > +#include 
> >
> >  bool curve25519_selftest(void);
> >
> >  static int __init mod_init(void)
> >  {
> > + if (!fips_enabled)
> > + return -EOPNOTSUPP;
>
> Question here, why it is !fips_enabled? Shouldn't we return error when
> fips_enabled?

Er, just not thinking straight today. `if (fips_enabled)` is probably
what you want indeed.

Jason


Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode

2021-04-08 Thread Jason A. Donenfeld
On Thu, Apr 8, 2021 at 7:55 AM Simo Sorce  wrote:
> > I'm not sure this makes so much sense to do _in wireguard_. If you
> > feel like the FIPS-allergic part is actually blake, 25519, chacha, and
> > poly1305, then wouldn't it make most sense to disable _those_ modules
> > instead? And then the various things that rely on those (such as
> > wireguard, but maybe there are other things too, like
> > security/keys/big_key.c) would be naturally disabled transitively?
>
> The reason why it is better to disable the whole module is that it
> provide much better feedback to users. Letting init go through and then
> just fail operations once someone tries to use it is much harder to
> deal with in terms of figuring out what went wrong.
> Also wireguard seem to be poking directly into the algorithms
> implementations and not use the crypto API, so disabling individual
> algorithm via the regular fips_enabled mechanism at runtime doesn't
> work.

What I'm suggesting _would_ work in basically the exact same way as
this patch. Namely, something like:

diff --git a/lib/crypto/curve25519.c b/lib/crypto/curve25519.c
index 288a62cd29b2..b794f49c291a 100644
--- a/lib/crypto/curve25519.c
+++ b/lib/crypto/curve25519.c
@@ -12,11 +12,15 @@
 #include 
 #include 
 #include 
+#include 

 bool curve25519_selftest(void);

 static int __init mod_init(void)
 {
+ if (!fips_enabled)
+ return -EOPNOTSUPP;
+
  if (!IS_ENABLED(CONFIG_CRYPTO_MANAGER_DISABLE_TESTS) &&
  WARN_ON(!curve25519_selftest()))
  return -ENODEV;

Making the various lib/crypto/* modules return EOPNOTSUPP will in turn
mean that wireguard will refuse to load, due to !fips_enabled. It has
the positive effect that all other things that use it will also be
EOPNOTSUPP.

For example, what are you doing about big_key.c? Right now, I assume
nothing. But this way, you'd get all of the various effects for free.
Are you going to continuously audit all uses of non-FIPS crypto and
add `if (!fips_enabled)` to every new use case, always, everywhere,
from now into the boundless future? By adding `if (!fips_enabled)` to
wireguard, that's what you're signing yourself up for. Instead, by
restricting the lib/crypto/* modules to !fips_enabled, you can get all
of those transitive effects without having to do anything additional.

Alternatively, I agree with Eric - why not just consider this outside
your boundary?

Jason


Re: [PATCH net-next] [RESEND] wireguard: disable in FIPS mode

2021-04-07 Thread Jason A. Donenfeld
Hi Hangbin,

On Wed, Apr 7, 2021 at 5:39 AM Hangbin Liu  wrote:
>
> As the cryptos(BLAKE2S, Curve25519, CHACHA20POLY1305) in WireGuard are not
> FIPS certified, the WireGuard module should be disabled in FIPS mode.

I'm not sure this makes so much sense to do _in wireguard_. If you
feel like the FIPS-allergic part is actually blake, 25519, chacha, and
poly1305, then wouldn't it make most sense to disable _those_ modules
instead? And then the various things that rely on those (such as
wireguard, but maybe there are other things too, like
security/keys/big_key.c) would be naturally disabled transitively?

[As an aside, I don't think any of this fips-flag-in-the-kernel makes
much sense at all for anything, but that seems like a different
discussion, maybe?]

Jason


Re: [PATCH v2] crypto: lib/chacha20poly1305 - define empty module exit function

2021-01-15 Thread Jason A. Donenfeld
On Sat, Jan 16, 2021 at 1:30 AM John Donnelly
 wrote:
>
>
>
> > On Jan 15, 2021, at 1:30 PM, Jason A. Donenfeld  wrote:
> >
> > With no mod_exit function, users are unable to unload the module after
> > use. I'm not aware of any reason why module unloading should be
> > prohibited for this one, so this commit simply adds an empty exit
> > function.
> >
> > Reported-and-tested-by: John Donnelly 
> > Acked-by: Ard Biesheuvel 
> > Signed-off-by: Jason A. Donenfeld 
>
> Thanks!
>
> Would someone be kind enough to remind when this appears and I will apply it 
> to our product ? We like to use published commits when possible.
>
> JD

It'll show up in one of these two repos in a week or two:
https://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6.git/
https://git.kernel.org/pub/scm/linux/kernel/git/herbert/cryptodev-2.6.git/


[PATCH v2] crypto: lib/chacha20poly1305 - define empty module exit function

2021-01-15 Thread Jason A. Donenfeld
With no mod_exit function, users are unable to unload the module after
use. I'm not aware of any reason why module unloading should be
prohibited for this one, so this commit simply adds an empty exit
function.

Reported-and-tested-by: John Donnelly 
Acked-by: Ard Biesheuvel 
Signed-off-by: Jason A. Donenfeld 
---
v1->v2:
- Fix typo in commit message.

 lib/crypto/chacha20poly1305.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/lib/crypto/chacha20poly1305.c b/lib/crypto/chacha20poly1305.c
index 5850f3b87359..c2fcdb98cc02 100644
--- a/lib/crypto/chacha20poly1305.c
+++ b/lib/crypto/chacha20poly1305.c
@@ -362,7 +362,12 @@ static int __init mod_init(void)
return 0;
 }
 
+static void __exit mod_exit(void)
+{
+}
+
 module_init(mod_init);
+module_exit(mod_exit);
 MODULE_LICENSE("GPL v2");
 MODULE_DESCRIPTION("ChaCha20Poly1305 AEAD construction");
 MODULE_AUTHOR("Jason A. Donenfeld ");
-- 
2.30.0



[PATCH] crypto: lib/chacha20poly1305 - define empty module exit function

2021-01-15 Thread Jason A. Donenfeld
With no mod_exit function, users are unable to load the module after
use. I'm not aware of any reason why module unloading should be
prohibited for this one, so this commit simply adds an empty exit
function.

Reported-by: John Donnelly 
Signed-off-by: Jason A. Donenfeld 
---
 lib/crypto/chacha20poly1305.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/lib/crypto/chacha20poly1305.c b/lib/crypto/chacha20poly1305.c
index 5850f3b87359..c2fcdb98cc02 100644
--- a/lib/crypto/chacha20poly1305.c
+++ b/lib/crypto/chacha20poly1305.c
@@ -362,7 +362,12 @@ static int __init mod_init(void)
return 0;
 }
 
+static void __exit mod_exit(void)
+{
+}
+
 module_init(mod_init);
+module_exit(mod_exit);
 MODULE_LICENSE("GPL v2");
 MODULE_DESCRIPTION("ChaCha20Poly1305 AEAD construction");
 MODULE_AUTHOR("Jason A. Donenfeld ");
-- 
2.30.0



Re: drivers/char/random.c needs a (new) maintainer

2020-12-23 Thread Jason A. Donenfeld
On Wed, Dec 23, 2020 at 5:03 PM Jason A. Donenfeld  wrote:
>
> Hi Peter,
>
> On Wed, Dec 23, 2020 at 5:01 PM Petr Tesarik  wrote:
> > I never suggested that this should serve as a supportive argument. I was 
> > just trying to be honest about our motivations.
> >
> > I'm a bit sad that this discussion has quickly gone back to the choice of 
> > algorithms and how they can be implemented.
>
> Why are you sad? You are interested in FIPS. FIPS indicates a certain
> set of algorithms. The ones most suitable to the task seem like they'd
> run into real practical problems in the kernel's RNG.
>
> That's not the _only_ reason I'm not keen on FIPS, but it does seem
> like a very basic one.
>
> Jason

And just to add to that: in working through Nicholai's patches (an
ongoing process), I'm reminded of his admonishment in the 00 cover
letter that at some point chacha20 will have to be replaced, due to
FIPS. So it seems like that's very much on the table. I brought it up
here as an example ("For example, " is how I began that sentence), but
it is a concern. If you want to make lots of changes for cryptographic
or technical reasons, that seems like a decent way to engage. But if
the motivation for each of these is the bean counting, then again, I'm
pretty wary of churn for nothing. And if that bean counting will
eventually lead us into bad corners, like the concerns I brought up
about FPU usage in the kernel, then I'm even more hesitant. However, I
think there may be good arguments to be made that some of Nicholai's
patches stand on their own, without the FIPS motivation. And that's
the set of arguments that are compelling.

Jason


Re: drivers/char/random.c needs a (new) maintainer

2020-12-23 Thread Jason A. Donenfeld
Hi Peter,

On Wed, Dec 23, 2020 at 5:01 PM Petr Tesarik  wrote:
> I never suggested that this should serve as a supportive argument. I was just 
> trying to be honest about our motivations.
>
> I'm a bit sad that this discussion has quickly gone back to the choice of 
> algorithms and how they can be implemented.

Why are you sad? You are interested in FIPS. FIPS indicates a certain
set of algorithms. The ones most suitable to the task seem like they'd
run into real practical problems in the kernel's RNG.

That's not the _only_ reason I'm not keen on FIPS, but it does seem
like a very basic one.

Jason


Re: drivers/char/random.c needs a (new) maintainer

2020-12-23 Thread Jason A. Donenfeld
On Wed, Dec 23, 2020 at 4:26 PM Stephan Mueller  wrote:
>
> Am Mittwoch, dem 23.12.2020 um 15:32 +0100 schrieb Jason A. Donenfeld:
> >
> > I would, however, be interested in a keccak-based construction. But
> > just using the keccak permutation does not automatically make it
> > "SHA-3", so we're back at the same issue again. FIPS is simply not
> > interesting for our requirements.
>
> Using non-assessed cryptography? Sounds dangerous to me even though it may be
> based on some well-known construction.

"assessed" is not necessarily the same as FIPS. Don't conflate the
two. I don't appreciate that kind of dishonest argumentation.

And new constructions that I'm interested in would be formally
verified (like the other crypto work I've done) with review and buy-in
from the cryptographic community, both engineering and academic. I
have no interest in submitting "non-assessed" things developed in a
vacuum, and I'm displeased with your attempting to make that
characterization.

Similarly, any other new design proposed I would expect a similar
amount of rigor. The current RNG is admittedly a bit of a mess, but at
least it's a design that's evolved. Something that's "revolutionary",
rather than evolutionary, needs considerably more argumentation.

So, please, don't strawman this into the "non-assessed" rhetoric.


Re: drivers/char/random.c needs a (new) maintainer

2020-12-23 Thread Jason A. Donenfeld
On Wed, Dec 23, 2020 at 3:17 PM Petr Tesarik  wrote:
> Upfront, let me admit that SUSE has a vested interest in a FIPS-certifiable 
> Linux kernel.

Sorry, but just because you have a "vested interest", or a financial
interest, or because you want it does not suddenly make it a good
idea. The idea is to have good crypto, not to merely check some boxes
for the bean counters.

For example, it's very unlikely that future kernel RNGs will move to
using AES, due to the performance overhead involved on non-table-based
implementations, and the lack of availability of FPU/AES-NI in all the
contexts we need. NT's fortuna machine can use AES, because NT allows
the FPU in all contexts. We don't have that luxury (or associated
performance penalty).

I would, however, be interested in a keccak-based construction. But
just using the keccak permutation does not automatically make it
"SHA-3", so we're back at the same issue again. FIPS is simply not
interesting for our requirements.

Jason


Re: [PATCH v2 09/11] crypto: blake2s - share the "shash" API boilerplate code

2020-12-18 Thread Jason A. Donenfeld
Hey Eric,

The solution you've proposed at the end of your email is actually kind
of similar to what we do with curve25519. Check out
include/crypto/curve25519.h. The critical difference between that and
the blake proposal is that it's in the header for curve25519, so the
indirection disappears.

Could we do that with headers for blake?

Jason


Re: [PATCH v2 09/11] crypto: blake2s - share the "shash" API boilerplate code

2020-12-18 Thread Jason A. Donenfeld
On Thu, Dec 17, 2020 at 11:25 PM Eric Biggers  wrote:
>
> From: Eric Biggers 
>
> Move the boilerplate code for setkey(), init(), update(), and final()
> from blake2s_generic.ko into a new module blake2s_helpers.ko, and export
> it so that it can be used by other shash implementations of BLAKE2s.
>
> setkey() and init() are exported as-is, while update() and final() have
> a blake2s_compress_t function pointer argument added.  This allows the
> implementation of the compression function to be overridden, which is
> the only part that optimized implementations really care about.
>
> The helper functions are defined in a separate module blake2s_helpers.ko
> (rather than just than in blake2s_generic.ko) because we can't simply
> select CRYPTO_BLAKE2B from CRYPTO_BLAKE2S_X86.  Doing this selection
> unconditionally would make the library API select the shash API, while
> doing it conditionally on CRYPTO_HASH would create a recursive kconfig
> dependency on CRYPTO_HASH.  As a bonus, using a separate module also
> allows the generic implementation to be omitted when unneeded.
>
> These helper functions very closely match the ones I defined for
> BLAKE2b, except the BLAKE2b ones didn't go in a separate module yet
> because BLAKE2b isn't exposed through the library API yet.
>
> Finally, use these new helper functions in the x86 implementation of
> BLAKE2s.  (This part should be a separate patch, but unfortunately the
> x86 implementation used the exact same function names like
> "crypto_blake2s_update()", so it had to be updated at the same time.)

There's a similar situation happening with chacha20poly1305 and with
curve25519. Each of these uses a mildly different approach to how we
split things up between library and crypto api code. The _helpers.ko
is another one. There any opportunity here to take a more
unified/consistant approach? Or is shash slightly different than the
others and benefits from a third way?

Jason


Re: [PATCH v2 11/11] wireguard: Kconfig: select CRYPTO_BLAKE2S_ARM

2020-12-18 Thread Jason A. Donenfeld
On Thu, Dec 17, 2020 at 11:25 PM Eric Biggers  wrote:
>
> From: Eric Biggers 
>
> When available, select the new implementation of BLAKE2s for 32-bit ARM.
> This is faster than the generic C implementation.
>
> Signed-off-by: Eric Biggers 

Reviewed-by: Jason A. Donenfeld 

When this series is ready, feel free to pull this commit in via
Herbert's tree, rather than my usual wireguard-linux.git->net-next.git
route. That will ensure the blake2s stuff lands all at once and we
won't have to synchronize anything.


Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b

2020-12-17 Thread Jason A. Donenfeld
On Thu, Dec 17, 2020 at 4:54 AM Eric Biggers  wrote:
>
> On Wed, Dec 16, 2020 at 11:32:44PM +0100, Jason A. Donenfeld wrote:
> > Hi Eric,
> >
> > On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers  wrote:
> > > By the way, if people are interested in having my ARM scalar 
> > > implementation of
> > > BLAKE2s in the kernel too, I can send a patchset for that too.  It just 
> > > ended up
> > > being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
> > > mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:
> >
> > I'd certainly be interested in this. Any rough idea how it performs
> > for pretty small messages compared to the generic implementation?
> > 100-140 byte ranges? Is the speedup about the same as for longer
> > messages because this doesn't parallelize across multiple blocks?
> >
>
> It does one block at a time, and there isn't much overhead, so yes the speedup
> on short messages should be about the same as on long messages.
>
> I did a couple quick userspace benchmarks and got (still on Cortex-A7):
>
> 100-byte messages:
> BLAKE2s ARM: 28.9 cpb
> BLAKE2s generic: 42.4 cpb
>
> 140-byte messages:
> BLAKE2s ARM: 29.5 cpb
> BLAKE2s generic: 44.0 cpb
>
> The results in the kernel may differ a bit, but probably not by much.

That's certainly a nice improvement though, and I'd very much welcome
the faster implementation.

Jason


Re: [PATCH 0/5] crypto: add NEON-optimized BLAKE2b

2020-12-16 Thread Jason A. Donenfeld
Hi Eric,

On Wed, Dec 16, 2020 at 9:48 PM Eric Biggers  wrote:
> By the way, if people are interested in having my ARM scalar implementation of
> BLAKE2s in the kernel too, I can send a patchset for that too.  It just ended 
> up
> being slower than BLAKE2b and SHA-1, so it wasn't as good for the use case
> mentioned above.  If it were to be added as "blake2s-256-arm", we'd have:

I'd certainly be interested in this. Any rough idea how it performs
for pretty small messages compared to the generic implementation?
100-140 byte ranges? Is the speedup about the same as for longer
messages because this doesn't parallelize across multiple blocks?

Jason


Re: drivers/char/random.c needs a (new) maintainer

2020-12-01 Thread Jason A. Donenfeld
On Mon, Nov 30, 2020 at 5:56 PM Theodore Y. Ts'o  wrote:
> patches this cycle.  One thing that would help me is if folks
> (especially Jason, if you would) could start with a detailed review of
> Nicolai's patches.

Sure, I'll take a look.


Re: drivers/char/random.c needs a (new) maintainer

2020-11-30 Thread Jason A. Donenfeld
I am willing to maintain random.c and have intentions to have a
formally verified RNG. I've mentioned this to Ted before.

But I think Ted's reluctance to not accept the recent patches sent to
this list is mostly justified, and I have no desire to see us rush
into replacing random.c with something suboptimal or FIPSy.


Re: [PATCH 000/141] Fix fall-through warnings for Clang

2020-11-23 Thread Jason Gunthorpe
On Fri, Nov 20, 2020 at 12:21:39PM -0600, Gustavo A. R. Silva wrote:

>   IB/hfi1: Fix fall-through warnings for Clang
>   IB/mlx4: Fix fall-through warnings for Clang
>   IB/qedr: Fix fall-through warnings for Clang
>   RDMA/mlx5: Fix fall-through warnings for Clang

I picked these four to the rdma tree, thanks

Jason


Re: [PATCH cryptodev] crypto: lib/chacha20poly1305 - allow users to specify 96bit nonce

2020-11-17 Thread Jason A. Donenfeld
On Tue, Nov 17, 2020 at 9:32 AM Ard Biesheuvel  wrote:
> If you are going back to the drawing board with in-kernel acceleration
> for OpenVPN

As far as I can tell, they're mostly after compatibility with their
existing userspace stuff. Otherwise, if they were going back to the
drawing board, they could just make openvpn userspace set up xfrm or
wg tunnels to achieve basically the same design. And actually, the
xfrm approach kind of makes a lot of sense for what they're doing; it
was designed for that type of split-daemon tunneling design.


Re: [PATCH cryptodev] crypto: lib/chacha20poly1305 - allow users to specify 96bit nonce

2020-11-17 Thread Jason A. Donenfeld
Nack.

This API is meant to take simple integers, so that programmers can use
atomic64_t with it and have safe nonces. I'm also interested in
preserving the API's ability to safely encrypt more than 4 gigs of
data at once. Passing a buffer also encourages people to use
randomized nonces, which isn't really safe. Finally, there are no
in-tree users of 96bit nonces for this interface. If you're after a
cornucopia of compatibility primitives, the ipsec stuff might be more
to your fitting. Or, add a new simple function/api. But adding
complexity to users of the existing one and confusing future users of
it is a non-starter. It's supposed to be deliberately non-awful to
use.


Re: [PATCH crypto] crypto: Kconfig - CRYPTO_MANAGER_EXTRA_TESTS requires the manager

2020-11-13 Thread Jason A. Donenfeld
On Fri, Nov 13, 2020 at 8:58 PM Herbert Xu  wrote:
>
> On Fri, Nov 13, 2020 at 03:34:36PM +0100, Jason A. Donenfeld wrote:
> > Thanks. FYI, I intended this for crypto-2.6.git rather than
> > cryptodev-2.6.git, since it fixes a build failure and is a trivial
> > fix.
>
> Well this has been broken since January so I don't see the urgency
> in it going in right away.  It'll be backported eventually.

Okay, no problem.


Re: [PATCH crypto] crypto: Kconfig - CRYPTO_MANAGER_EXTRA_TESTS requires the manager

2020-11-13 Thread Jason A. Donenfeld
Thanks. FYI, I intended this for crypto-2.6.git rather than
cryptodev-2.6.git, since it fixes a build failure and is a trivial
fix.


Re: [PATCH] crypto: sha - split sha.h into sha1.h and sha2.h

2020-11-13 Thread Jason A. Donenfeld
Thanks for doing this.

Acked-by: Jason A. Donenfeld 


[PATCH crypto] crypto: Kconfig - CRYPTO_MANAGER_EXTRA_TESTS requires the manager

2020-11-02 Thread Jason A. Donenfeld
The extra tests in the manager actually require the manager to be
selected too. Otherwise the linker gives errors like:

ld: arch/x86/crypto/chacha_glue.o: in function `chacha_simd_stream_xor':
chacha_glue.c:(.text+0x422): undefined reference to 
`crypto_simd_disabled_for_test'

Fixes: 2343d1529aff ("crypto: Kconfig - allow tests to be disabled when manager 
is disabled")
Signed-off-by: Jason A. Donenfeld 
---
 crypto/Kconfig | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 094ef56ab7b4..37de7d006858 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -145,7 +145,7 @@ config CRYPTO_MANAGER_DISABLE_TESTS
 
 config CRYPTO_MANAGER_EXTRA_TESTS
bool "Enable extra run-time crypto self tests"
-   depends on DEBUG_KERNEL && !CRYPTO_MANAGER_DISABLE_TESTS
+   depends on DEBUG_KERNEL && !CRYPTO_MANAGER_DISABLE_TESTS && 
CRYPTO_MANAGER
help
  Enable extra run-time self tests of registered crypto algorithms,
  including randomized fuzz tests.
-- 
2.29.1



Re: [PATCH] crypto: arm/chacha-neon - optimize for non-block size multiples

2020-11-01 Thread Jason A. Donenfeld
Cool patch! I look forward to getting out the old arm32 rig and
benching this. One question:

On Sun, Nov 1, 2020 at 5:33 PM Ard Biesheuvel  wrote:
> On out-of-order microarchitectures such as Cortex-A57, this results in
> a speedup for 1420 byte blocks of about 21%, without any signficant
> performance impact of the power-of-2 block sizes. On lower end cores
> such as Cortex-A53, the speedup for 1420 byte blocks is only about 2%,
> but also without impacting other input sizes.

A57 and A53 are 64-bit, but this is code for 32-bit arm, right? So the
comparison is more like A15 vs A5? Or are you running 32-bit kernels
on armv8 hardware?


Re: [RFC] treewide: cleanup unreachable breaks

2020-10-19 Thread Jason Gunthorpe
On Mon, Oct 19, 2020 at 12:42:15PM -0700, Nick Desaulniers wrote:
> On Sat, Oct 17, 2020 at 10:43 PM Greg KH  wrote:
> >
> > On Sat, Oct 17, 2020 at 09:09:28AM -0700, t...@redhat.com wrote:
> > > From: Tom Rix 
> > >
> > > This is a upcoming change to clean up a new warning treewide.
> > > I am wondering if the change could be one mega patch (see below) or
> > > normal patch per file about 100 patches or somewhere half way by 
> > > collecting
> > > early acks.
> >
> > Please break it up into one-patch-per-subsystem, like normal, and get it
> > merged that way.
> >
> > Sending us a patch, without even a diffstat to review, isn't going to
> > get you very far...
> 
> Tom,
> If you're able to automate this cleanup, I suggest checking in a
> script that can be run on a directory.  Then for each subsystem you
> can say in your commit "I ran scripts/fix_whatever.py on this subdir."
>  Then others can help you drive the tree wide cleanup.  Then we can
> enable -Wunreachable-code-break either by default, or W=2 right now
> might be a good idea.

I remember using clang-modernize in the past to fix issues very
similar to this, if clang machinery can generate the warning, can't
something like clang-tidy directly generate the patch?

You can send me a patch for drivers/infiniband/* as well

Thanks,
Jason


Re: [DISCUSSION PATCH 00/41] random: possible ways towards NIST SP800-90B compliance

2020-09-21 Thread Jason A. Donenfeld
I haven't looked into the details of this patchset yet, but your
description here indicates to me that this is motivated by FIPS
certification desires, which...worries me. I would like to rewrite the
RNG at some point, and I've started to work on a bunch of designs for
this (and proving them correct, too), but going about this via FIPS
certification or trying to implement some NIST specs is most certainly
the wrong way to go about this, will lock us into subpar crypto for
years, and is basically a waste of time.


Re: [trivial PATCH] treewide: Convert switch/case fallthrough; to break;

2020-09-09 Thread Jason Gunthorpe
On Wed, Sep 09, 2020 at 01:06:39PM -0700, Joe Perches wrote:
> fallthrough to a separate case/default label break; isn't very readable.
> 
> Convert pseudo-keyword fallthrough; statements to a simple break; when
> the next label is case or default and the only statement in the next
> label block is break;
> 
> Found using:
> 
> $ grep-2.5.4 -rP --include=*.[ch] -n 
> "fallthrough;(\s*(case\s+\w+|default)\s*:\s*){1,7}break;" *
> 
> Miscellanea:
> 
> o Move or coalesce a couple label blocks above a default: block.
> 
> Signed-off-by: Joe Perches 
> ---
> 
> Compiled allyesconfig x86-64 only.
> A few files for other arches were not compiled.

IB part looks OK, I prefer it like this

You could do the same for continue as well, I saw a few of those..

Thanks,
Jason


Re: [PATCH] crypto/x86: Use XORL r32,32 in curve25519-x86_64.c

2020-09-07 Thread Jason A. Donenfeld
rbx, %%r10;""  movq %%r10, 96(%0);"
> + "  mulxq 32(%3), %%r8, %%r9;"  "  xor %%r10d, %%r10d;"   "  
> adcxq 88(%0), %%r8;"   "  movq %%r8, 88(%0);"
> + "  mulxq 40(%3), %%r10, %%r11;""  adox %%r9, %%r10;" "  
> adcx %%rbx, %%r10;""  movq %%r10, 96(%0);"
>   "  mulxq 48(%3), %%rbx, %%r13;""  adox %%r11, %%rbx;""  
> adcx %%r14, %%rbx;""  movq %%rbx, 104(%0);""  mov $0, %%r8;"
>   "  mulxq 56(%3), %%r14, %%rdx;""  adox %%r13, %%r14;""  
> adcx %%rax, %%r14;""  movq %%r14, 112(%0);""  mov $0, %%rax;"
>  "  adox %%rdx, %%rax;""  
> adcx %%r8, %%rax;" "  movq %%rax, 120(%0);"
> @@ -312,7 +312,7 @@ static inline void fmul2(u64 *out, const u64 *f1, const 
> u64 *f2, u64 *tmp)
>   /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */
>   "  mov $38, %%rdx;"
>   "  mulxq 32(%1), %%r8, %%r13;"
> - "  xor %3, %3;"
> + "  xor %k3, %k3;"
>   "  adoxq 0(%1), %%r8;"
>   "  mulxq 40(%1), %%r9, %%rbx;"
>   "  adcx %%r13, %%r9;"
> @@ -345,7 +345,7 @@ static inline void fmul2(u64 *out, const u64 *f1, const 
> u64 *f2, u64 *tmp)
>   /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */
>   "  mov $38, %%rdx;"
>   "  mulxq 96(%1), %%r8, %%r13;"
> - "  xor %3, %3;"
> + "  xor %k3, %k3;"
>   "  adoxq 64(%1), %%r8;"
>   "  mulxq 104(%1), %%r9, %%rbx;"
>   "  adcx %%r13, %%r9;"
> @@ -516,7 +516,7 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp)
>  
>   /* Step 1: Compute all partial products */
>   "  movq 0(%1), %%rdx;"   /* 
> f[0] */
> - "  mulxq 8(%1), %%r8, %%r14;"  "  xor %%r15, %%r15;" /* 
> f[1]*f[0] */
> + "  mulxq 8(%1), %%r8, %%r14;"  "  xor %%r15d, %%r15d;"   /* 
> f[1]*f[0] */
>   "  mulxq 16(%1), %%r9, %%r10;" "  adcx %%r14, %%r9;" /* 
> f[2]*f[0] */
>   "  mulxq 24(%1), %%rax, %%rcx;""  adcx %%rax, %%r10;"/* 
> f[3]*f[0] */
>   "  movq 24(%1), %%rdx;"  /* 
> f[3] */
> @@ -526,7 +526,7 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp)
>   "  mulxq 16(%1), %%rax, %%rcx;""  mov $0, %%r14;"/* 
> f[2]*f[1] */
>  
>   /* Step 2: Compute two parallel carry chains */
> - "  xor %%r15, %%r15;"
> + "  xor %%r15d, %%r15d;"
>   "  adox %%rax, %%r10;"
>   "  adcx %%r8, %%r8;"
>   "  adox %%rcx, %%r11;"
> @@ -563,7 +563,7 @@ static inline void fsqr(u64 *out, const u64 *f, u64 *tmp)
>   /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */
>   "  mov $38, %%rdx;"
>   "  mulxq 32(%1), %%r8, %%r13;"
> - "  xor %%rcx, %%rcx;"
> + "  xor %%ecx, %%ecx;"
>   "  adoxq 0(%1), %%r8;"
>   "  mulxq 40(%1), %%r9, %%rbx;"
>   "  adcx %%r13, %%r9;"
> @@ -607,7 +607,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp)
>   asm volatile(
>   /* Step 1: Compute all partial products */
>   "  movq 0(%1), %%rdx;"   /* 
> f[0] */
> - "  mulxq 8(%1), %%r8, %%r14;"  "  xor %%r15, %%r15;" /* 
> f[1]*f[0] */
> + "  mulxq 8(%1), %%r8, %%r14;"  "  xor %%r15d, %%r15d;"   /* 
> f[1]*f[0] */
>   "  mulxq 16(%1), %%r9, %%r10;" "  adcx %%r14, %%r9;" /* 
> f[2]*f[0] */
>   "  mulxq 24(%1), %%rax, %%rcx;""  adcx %%rax, %%r10;"/* 
> f[3]*f[0] */
>   "  movq 24(%1), %%rdx;"  /* 
> f[3] */
> @@ -617,7 +617,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp)
>   "  mulxq 16(%1), %%rax, %%rcx;""  mov $0, %%r14;"/* 
> f[2]*f[1] */
>  
>   /* Step 2: Compute two parallel carry chains */
> - "  xor %%r15, %%r15;"
> + "  xor %%r15d, %%r15d;"
>   "  adox %%rax, %%r10;"
>   "  adcx %%r8, %%r8;"
>   "  adox %%rcx, %%r11;"
> @@ -647,7 +647,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp)
>  
>   /* Step 1: Compute all partial products */
>   "  movq 32(%1), %%rdx;"   
> /* f[0] */
> - "  mulxq 40(%1), %%r8, %%r14;"  "  xor %%r15, %%r15;" 
> /* f[1]*f[0] */
> + "  mulxq 40(%1), %%r8, %%r14;" "  xor %%r15d, %%r15d;"   /* 
> f[1]*f[0] */
>   "  mulxq 48(%1), %%r9, %%r10;" "  adcx %%r14, %%r9;" /* 
> f[2]*f[0] */
>   "  mulxq 56(%1), %%rax, %%rcx;""  adcx %%rax, %%r10;"/* 
> f[3]*f[0] */
>   "  movq 56(%1), %%rdx;"  /* 
> f[3] */
> @@ -657,7 +657,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp)
>   "  mulxq 48(%1), %%rax, %%rcx;""  mov $0, %%r14;"/* 
> f[2]*f[1] */
>  
>   /* Step 2: Compute two parallel carry chains */
> - "  xor %%r15, %%r15;"
> + "  xor %%r15d, %%r15d;"
>   "  adox %%rax, %%r10;"
>   "  adcx %%r8, %%r8;"
>   "  adox %%rcx, %%r11;"
> @@ -692,7 +692,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp)
>   /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */
>   "  mov $38, %%rdx;"
>   "  mulxq 32(%1), %%r8, %%r13;"
> - "  xor %%rcx, %%rcx;"
> + "  xor %%ecx, %%ecx;"
>   "  adoxq 0(%1), %%r8;"
>   "  mulxq 40(%1), %%r9, %%rbx;"
>   "  adcx %%r13, %%r9;"
> @@ -725,7 +725,7 @@ static inline void fsqr2(u64 *out, const u64 *f, u64 *tmp)
>   /* Step 1: Compute dst + carry == tmp_hi * 38 + tmp_lo */
>   "  mov $38, %%rdx;"
>   "  mulxq 96(%1), %%r8, %%r13;"
> - "  xor %%rcx, %%rcx;"
> + "  xor %%ecx, %%ecx;"
>   "  adoxq 64(%1), %%r8;"
>   "  mulxq 104(%1), %%r9, %%rbx;"
>   "  adcx %%r13, %%r9;"
> -- 
> 2.26.2
> 

Looks like this is going into HACL in the end, so:

Acked-by: Jason A. Donenfeld 

for cryptodev-2.6.git, rather than crypto-2.6.git

Thanks,
Jason


Re: [PATCH] crypto/x86: Use XORL r32,32 in poly1305-x86_64-cryptogams.pl

2020-09-07 Thread Jason A. Donenfeld
Hi Uros, Herbert,

On Thu, Aug 27, 2020 at 07:38:31PM +0200, Uros Bizjak wrote:
> x86_64 zero extends 32bit operations, so for 64bit operands,
> XORL r32,r32 is functionally equal to XORQ r64,r64, but avoids
> a REX prefix byte when legacy registers are used.
> 
> Signed-off-by: Uros Bizjak 
> Cc: Herbert Xu 
> Cc: "David S. Miller" 
> ---
>  arch/x86/crypto/poly1305-x86_64-cryptogams.pl | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl 
> b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl
> index 137edcf038cb..7d568012cc15 100644
> --- a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl
> +++ b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl
> @@ -246,7 +246,7 @@ $code.=<<___ if (!$kernel);
>  ___
>  &declare_function("poly1305_init_x86_64", 32, 3);
>  $code.=<<___;
> - xor %rax,%rax
> + xor %eax,%eax
>   mov %rax,0($ctx)# initialize hash value
>   mov %rax,8($ctx)
>   mov %rax,16($ctx)
> @@ -2853,7 +2853,7 @@ $code.=<<___;
>  .typepoly1305_init_base2_44,\@function,3
>  .align   32
>  poly1305_init_base2_44:
> - xor %rax,%rax
> + xor %eax,%eax
>   mov %rax,0($ctx)# initialize hash value
>   mov %rax,8($ctx)
>   mov %rax,16($ctx)
> @@ -3947,7 +3947,7 @@ xor128_decrypt_n_pad:
>   mov \$16,$len
>   sub %r10,$len
>   xor %eax,%eax
> - xor %r11,%r11
> + xor %r11d,%r11d
>  .Loop_dec_byte:
>   mov ($inp,$otp),%r11b
>   mov ($otp),%al
> @@ -4085,7 +4085,7 @@ avx_handler:
>   .long   0xa548f3fc  # cld; rep movsq
>  
>   mov $disp,%rsi
> - xor %rcx,%rcx   # arg1, UNW_FLAG_NHANDLER
> + xor %ecx,%ecx   # arg1, UNW_FLAG_NHANDLER
>   mov 8(%rsi),%rdx# arg2, disp->ImageBase
>   mov 0(%rsi),%r8 # arg3, disp->ControlPc
>   mov 16(%rsi),%r9# arg4, disp->FunctionEntry
> -- 
> 2.26.2
> 

Per the discussion elsewhere,

Acked-by: Jason A. Donenfeld 

for cryptodev-2.6.git, rather than crypto-2.6.git

Thanks,
Jason


Re: [PATCH] crypto/x86: Use XORL r32,32 in curve25519-x86_64.c

2020-09-02 Thread Jason A. Donenfeld
On Wed, Sep 2, 2020 at 1:42 PM Uros Bizjak  wrote:
>
> On Wed, Sep 2, 2020 at 11:17 AM  wrote:
> >
> > On Wed, Sep 02, 2020 at 07:50:36AM +0200, Uros Bizjak wrote:
> > > On Tue, Sep 1, 2020 at 9:12 PM Jason A. Donenfeld  wrote:
> > > >
> > > > On Tue, Sep 1, 2020 at 8:13 PM Jason A. Donenfeld  
> > > > wrote:
> > > > > operands are the same. Also, have you seen any measurable differences
> > > > > when benching this? I can stick it into kbench9000 to see if you
> > > > > haven't looked yet.
> > > >
> > > > On a Skylake server (Xeon Gold 5120), I'm unable to see any measurable
> > > > difference with this, at all, no matter how much I median or mean or
> > > > reduce noise by disabling interrupts.
> > > >
> > > > One thing that sticks out is that all the replacements of r8-r15 by
> > > > their %r8d-r15d counterparts still have the REX prefix, as is
> > > > necessary to access those registers. The only ones worth changing,
> > > > then, are the legacy registers, and on a whole, this amounts to only
> > > > 48 bytes of difference.
> > >
> > > The patch implements one of x86 target specific optimizations,
> > > performed by gcc. The optimization results in code size savings of one
> > > byte, where REX prefix is omitted with legacy registers, but otherwise
> > > should have no measurable runtime effect. Since gcc applies this
> > > optimization universally to all integer registers, I took the same
> > > approach and implemented the same change to legacy and REX registers.
> > > As measured above, 48 bytes saved is a good result for such a trivial
> > > optimization.
> >
> > Could we instead implement this optimization in GAS ? Then we can leave
> > the code as-is.
>
> I don't think that e.g. "xorq %rax,%rax" should universally be
> translated to "xorl %eax,%eax" in the assembler. Perhaps the author
> expected exactly 3 bytes (to align the code or similar), and the
> assembler would change the length to 2 bytes behind his back, breaking
> the expectations.

Are you sure that's something that GAS actually provides now? Seems
like a lot of mnemonics have ambiguous/injective encodings, and this
wouldn't make things any different. Most authors use .byte or .align
when they care about the actual bytes, no?


Re: [PATCH] crypto/x86: Use XORL r32,32 in curve25519-x86_64.c

2020-09-02 Thread Jason A. Donenfeld
On Wed, Sep 2, 2020 at 11:44 AM  wrote:
>
> On Wed, Sep 02, 2020 at 07:50:36AM +0200, Uros Bizjak wrote:
> > On Tue, Sep 1, 2020 at 9:12 PM Jason A. Donenfeld  wrote:
> > >
> > > On Tue, Sep 1, 2020 at 8:13 PM Jason A. Donenfeld  wrote:
> > > > operands are the same. Also, have you seen any measurable differences
> > > > when benching this? I can stick it into kbench9000 to see if you
> > > > haven't looked yet.
> > >
> > > On a Skylake server (Xeon Gold 5120), I'm unable to see any measurable
> > > difference with this, at all, no matter how much I median or mean or
> > > reduce noise by disabling interrupts.
> > >
> > > One thing that sticks out is that all the replacements of r8-r15 by
> > > their %r8d-r15d counterparts still have the REX prefix, as is
> > > necessary to access those registers. The only ones worth changing,
> > > then, are the legacy registers, and on a whole, this amounts to only
> > > 48 bytes of difference.
> >
> > The patch implements one of x86 target specific optimizations,
> > performed by gcc. The optimization results in code size savings of one
> > byte, where REX prefix is omitted with legacy registers, but otherwise
> > should have no measurable runtime effect. Since gcc applies this
> > optimization universally to all integer registers, I took the same
> > approach and implemented the same change to legacy and REX registers.
> > As measured above, 48 bytes saved is a good result for such a trivial
> > optimization.
>
> Could we instead implement this optimization in GAS ? Then we can leave
> the code as-is.

I rather like that idea. Though I wonder if some would balk at it for
smelling a bit like the MIPS assembler with its optimization pass...


Re: [PATCH] crypto/x86: Use XORL r32,32 in poly1305-x86_64-cryptogams.pl

2020-09-01 Thread Jason A. Donenfeld
Hi Uros,

Any benchmarks for this? Seems like it's all in initialization code,
right? I'm CC'ing Andy into this.

Jason

On Thu, Aug 27, 2020 at 07:38:31PM +0200, Uros Bizjak wrote:
> x86_64 zero extends 32bit operations, so for 64bit operands,
> XORL r32,r32 is functionally equal to XORQ r64,r64, but avoids
> a REX prefix byte when legacy registers are used.
> 
> Signed-off-by: Uros Bizjak 
> Cc: Herbert Xu 
> Cc: "David S. Miller" 
> ---
>  arch/x86/crypto/poly1305-x86_64-cryptogams.pl | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl 
> b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl
> index 137edcf038cb..7d568012cc15 100644
> --- a/arch/x86/crypto/poly1305-x86_64-cryptogams.pl
> +++ b/arch/x86/crypto/poly1305-x86_64-cryptogams.pl
> @@ -246,7 +246,7 @@ $code.=<<___ if (!$kernel);
>  ___
>  &declare_function("poly1305_init_x86_64", 32, 3);
>  $code.=<<___;
> - xor %rax,%rax
> + xor %eax,%eax
>   mov %rax,0($ctx)# initialize hash value
>   mov %rax,8($ctx)
>   mov %rax,16($ctx)
> @@ -2853,7 +2853,7 @@ $code.=<<___;
>  .typepoly1305_init_base2_44,\@function,3
>  .align   32
>  poly1305_init_base2_44:
> - xor %rax,%rax
> + xor %eax,%eax
>   mov %rax,0($ctx)# initialize hash value
>   mov %rax,8($ctx)
>   mov %rax,16($ctx)
> @@ -3947,7 +3947,7 @@ xor128_decrypt_n_pad:
>   mov \$16,$len
>   sub %r10,$len
>   xor %eax,%eax
> - xor %r11,%r11
> + xor %r11d,%r11d
>  .Loop_dec_byte:
>   mov ($inp,$otp),%r11b
>   mov ($otp),%al
> @@ -4085,7 +4085,7 @@ avx_handler:
>   .long   0xa548f3fc  # cld; rep movsq
>  
>   mov $disp,%rsi
> - xor %rcx,%rcx   # arg1, UNW_FLAG_NHANDLER
> + xor %ecx,%ecx   # arg1, UNW_FLAG_NHANDLER
>   mov 8(%rsi),%rdx# arg2, disp->ImageBase
>   mov 0(%rsi),%r8 # arg3, disp->ControlPc
>   mov 16(%rsi),%r9# arg4, disp->FunctionEntry
> -- 
> 2.26.2
> 

-- 
Jason A. Donenfeld
Deep Space Explorer
fr: +33 6 51 90 82 66
us: +1 513 476 1200
www.jasondonenfeld.com
www.zx2c4.com
zx2c4.com/keys/AB9942E6D4A4CFC3412620A749FC7012A5DE03AE.asc


Re: [PATCH] crypto/x86: Use XORL r32,32 in curve25519-x86_64.c

2020-09-01 Thread Jason A. Donenfeld
On Tue, Sep 1, 2020 at 8:13 PM Jason A. Donenfeld  wrote:
> operands are the same. Also, have you seen any measurable differences
> when benching this? I can stick it into kbench9000 to see if you
> haven't looked yet.

On a Skylake server (Xeon Gold 5120), I'm unable to see any measurable
difference with this, at all, no matter how much I median or mean or
reduce noise by disabling interrupts.

One thing that sticks out is that all the replacements of r8-r15 by
their %r8d-r15d counterparts still have the REX prefix, as is
necessary to access those registers. The only ones worth changing,
then, are the legacy registers, and on a whole, this amounts to only
48 bytes of difference.


Re: [PATCH] crypto/x86: Use XORL r32,32 in curve25519-x86_64.c

2020-09-01 Thread Jason A. Donenfeld
Hi Uros,

Thanks for this patch. This seems correct to me, since indeed those
clear the top bits anyway, and having smaller code seems good. But
first -- I'm wondering -- have you stuck this into Vale/Hacl to see if
it still checks out there? I'm CC'ing Karthik/Aymeric/Chris/Jonathan
who might be interested in taking a look at that. Seems like it might
be easy to just special case this in Xor64 for the case where the
operands are the same. Also, have you seen any measurable differences
when benching this? I can stick it into kbench9000 to see if you
haven't looked yet.

Jason

On Tue, Sep 1, 2020 at 5:46 PM Ard Biesheuvel  wrote:
>
> (+ Jason)
>
> On Thu, 27 Aug 2020 at 20:31, Uros Bizjak  wrote:
> >
> > x86_64 zero extends 32bit operations, so for 64bit operands,
> > XORL r32,r32 is functionally equal to XORL r64,r64, but avoids
> > a REX prefix byte when legacy registers are used.
> >
> > Signed-off-by: Uros Bizjak 
> > Cc: Herbert Xu 
> > Cc: "David S. Miller" 
> > ---
> >  arch/x86/crypto/curve25519-x86_64.c | 68 ++---
> >  1 file changed, 34 insertions(+), 34 deletions(-)
> >
> > diff --git a/arch/x86/crypto/curve25519-x86_64.c 
> > b/arch/x86/crypto/curve25519-x86_64.c
> > index 8acbb6584a37..a9edb6f8a0ba 100644
> > --- a/arch/x86/crypto/curve25519-x86_64.c
> > +++ b/arch/x86/crypto/curve25519-x86_64.c
> > @@ -45,11 +45,11 @@ static inline u64 add_scalar(u64 *out, const u64 *f1, 
> > u64 f2)
> >
> > asm volatile(
> > /* Clear registers to propagate the carry bit */
> > -   "  xor %%r8, %%r8;"
> > -   "  xor %%r9, %%r9;"
> > -   "  xor %%r10, %%r10;"
> > -   "  xor %%r11, %%r11;"
> > -   "  xor %1, %1;"
> > +   "  xor %%r8d, %%r8d;"
> > +   "  xor %%r9d, %%r9d;"
> > +   "  xor %%r10d, %%r10d;"
> > +   "  xor %%r11d, %%r11d;"
> > +   "  xor %k1, %k1;"
> >
> > /* Begin addition chain */
> > "  addq 0(%3), %0;"
> > @@ -93,7 +93,7 @@ static inline void fadd(u64 *out, const u64 *f1, const 
> > u64 *f2)
> > "  cmovc %0, %%rax;"
> >
> > /* Step 2: Add carry*38 to the original sum */
> > -   "  xor %%rcx, %%rcx;"
> > +   "  xor %%ecx, %%ecx;"
> > "  add %%rax, %%r8;"
> > "  adcx %%rcx, %%r9;"
> > "  movq %%r9, 8(%1);"
> > @@ -165,28 +165,28 @@ static inline void fmul(u64 *out, const u64 *f1, 
> > const u64 *f2, u64 *tmp)
> >
> > /* Compute src1[0] * src2 */
> > "  movq 0(%1), %%rdx;"
> > -   "  mulxq 0(%3), %%r8, %%r9;"   "  xor %%r10, %%r10;"
> >  "  movq %%r8, 0(%0);"
> > +   "  mulxq 0(%3), %%r8, %%r9;"   "  xor %%r10d, %%r10d;"  
> >  "  movq %%r8, 0(%0);"
> > "  mulxq 8(%3), %%r10, %%r11;" "  adox %%r9, %%r10;"
> >  "  movq %%r10, 8(%0);"
> > "  mulxq 16(%3), %%rbx, %%r13;""  adox %%r11, %%rbx;"
> > "  mulxq 24(%3), %%r14, %%rdx;""  adox %%r13, %%r14;"   
> >  "  mov $0, %%rax;"
> >"  adox %%rdx, %%rax;"
> > /* Compute src1[1] * src2 */
> > "  movq 8(%1), %%rdx;"
> > -   "  mulxq 0(%3), %%r8, %%r9;"   "  xor %%r10, %%r10;"
> >  "  adcxq 8(%0), %%r8;""  movq %%r8, 8(%0);"
> > +   "  mulxq 0(%3), %%r8, %%r9;"   "  xor %%r10d, %%r10d;"  
> >  "  adcxq 8(%0), %%r8;""  movq %%r8, 8(%0);"
> > "  mulxq 8(%3), %%r10, %%r11;" "  adox %%r9, %%r10;"
> >  "  adcx %%rbx, %%r10;""  movq %%r10, 16(%0);"
> > "  mulxq 16(%3), %%rbx, %%r13;""  adox %%r11, %%rbx;"   
> >  "  adcx %%r14, %%rbx;""  mov $0, %%r8;"
> > "  mulxq 24(%3), %%r14, %%rdx;""  adox %%r13, %%r14;"   
> >  "  adcx %%rax, %%r1

Re: [PATCH] crypto: arm/poly1305 - Add prototype for poly1305_blocks_neon

2020-08-25 Thread Jason A. Donenfeld
On Tue, Aug 25, 2020 at 3:23 AM Herbert Xu  wrote:
>
> This patch adds a prototype for poly1305_blocks_neon to slience
> a compiler warning:
>
>   CC [M]  arch/arm/crypto/poly1305-glue.o
> ../arch/arm/crypto/poly1305-glue.c:25:13: warning: no previous prototype for 
> `poly1305_blocks_neon' [-Wmissing-prototypes]
>  void __weak poly1305_blocks_neon(void *state, const u8 *src, u32 len, u32 
> hibit)
>  ^~~~
>
> Signed-off-by: Herbert Xu 
>
> diff --git a/arch/arm/crypto/poly1305-glue.c b/arch/arm/crypto/poly1305-glue.c
> index 13cfef4ae22e..3023c1acfa19 100644
> --- a/arch/arm/crypto/poly1305-glue.c
> +++ b/arch/arm/crypto/poly1305-glue.c
> @@ -20,6 +20,7 @@
>
>  void poly1305_init_arm(void *state, const u8 *key);
>  void poly1305_blocks_arm(void *state, const u8 *src, u32 len, u32 hibit);
> +void poly1305_blocks_neon(void *state, const u8 *src, u32 len, u32 hibit);

LGTM.

Acked-by: Jason A. Donenfeld 


Re: [PATCH] crypto: arm/curve25519 - include

2020-08-24 Thread Jason A. Donenfeld
On Mon, Aug 24, 2020 at 4:13 PM Fabio Estevam  wrote:
>
> Building ARM allmodconfig leads to the following warnings:
>
> arch/arm/crypto/curve25519-glue.c:73:12: error: implicit declaration of 
> function 'sg_copy_to_buffer' [-Werror=implicit-function-declaration]
> arch/arm/crypto/curve25519-glue.c:74:9: error: implicit declaration of 
> function 'sg_nents_for_len' [-Werror=implicit-function-declaration]
> arch/arm/crypto/curve25519-glue.c:88:11: error: implicit declaration of 
> function 'sg_copy_from_buffer' [-Werror=implicit-function-declaration]
>
> Include  to fix such warnings

This patch seems correct to me -- sg_copy_to_buffer, sg_nents_for_len.
I wonder what header dependency chain caused us to miss this before.
Either way,

Acked-by: Jason A. Donenfeld 


Re: [PATCH 12/26] netfilter: switch nf_setsockopt to sockptr_t

2020-07-28 Thread Jason A. Donenfeld
On Tue, Jul 28, 2020 at 10:07 AM David Laight  wrote:
>
> From: Christoph Hellwig
> > Sent: 27 July 2020 17:24
> >
> > On Mon, Jul 27, 2020 at 06:16:32PM +0200, Jason A. Donenfeld wrote:
> > > Maybe sockptr_advance should have some safety checks and sometimes
> > > return -EFAULT? Or you should always use the implementation where
> > > being a kernel address is an explicit bit of sockptr_t, rather than
> > > being implicit?
> >
> > I already have a patch to use access_ok to check the whole range in
> > init_user_sockptr.
>
> That doesn't make (much) difference to the code paths that ignore
> the user-supplied length.
> OTOH doing the user/kernel check on the base address (not an
> incremented one) means that the correct copy function is always
> selected.

Right, I had the same reaction in reading this, but actually, his code
gets rid of the sockptr_advance stuff entirely and never mutates, so
even though my point about attacking those pointers was missed, the
code does the better thing now -- checking the base address and never
mutating the pointer. So I think we're good.

>
> Perhaps the functions should all be passed a 'const sockptr_t'.
> The typedef could be made 'const' - requiring non-const items
> explicitly use the union/struct itself.

I was thinking the same, but just by making the pointers inside the
struct const. However, making the whole struct const via the typedef
is a much better idea. That'd probably require changing the signature
of init_user_sockptr a bit, which would be fine, but indeed I think
this would be a very positive change.

Jason


Re: [PATCH 12/26] netfilter: switch nf_setsockopt to sockptr_t

2020-07-27 Thread Jason A. Donenfeld
> From cce2d2e1b43ecee5f4af7cf116808b74b330080f Mon Sep 17 00:00:00 2001
> From: Christoph Hellwig 
> Date: Mon, 27 Jul 2020 17:42:27 +0200
> Subject: net: remove sockptr_advance
>
> sockptr_advance never properly worked.  Replace it with _offset variants
> of copy_from_sockptr and copy_to_sockptr.
>
> Fixes: ba423fdaa589 ("net: add a new sockptr_t type")
> Reported-by: Jason A. Donenfeld 
> Reported-by: Ido Schimmel 
> Signed-off-by: Christoph Hellwig 
> ---
>  drivers/crypto/chelsio/chtls/chtls_main.c | 12 +-
>  include/linux/sockptr.h   | 27 +++
>  net/dccp/proto.c  |  5 ++---
>  net/ipv4/netfilter/arp_tables.c   |  8 +++
>  net/ipv4/netfilter/ip_tables.c|  8 +++
>  net/ipv4/tcp.c|  5 +++--
>  net/ipv6/ip6_flowlabel.c  | 11 -
>  net/ipv6/netfilter/ip6_tables.c   |  8 +++
>  net/netfilter/x_tables.c  |  7 +++---
>  net/tls/tls_main.c|  6 ++---
>  10 files changed, 49 insertions(+), 48 deletions(-)
>
> diff --git a/drivers/crypto/chelsio/chtls/chtls_main.c 
> b/drivers/crypto/chelsio/chtls/chtls_main.c
> index c3058dcdb33c5c..66d247efd5615b 100644
> --- a/drivers/crypto/chelsio/chtls/chtls_main.c
> +++ b/drivers/crypto/chelsio/chtls/chtls_main.c
> @@ -525,9 +525,9 @@ static int do_chtls_setsockopt(struct sock *sk, int 
> optname,
> /* Obtain version and type from previous copy */
> crypto_info[0] = tmp_crypto_info;
> /* Now copy the following data */
> -   sockptr_advance(optval, sizeof(*crypto_info));
> -   rc = copy_from_sockptr((char *)crypto_info + 
> sizeof(*crypto_info),
> -   optval,
> +   rc = copy_from_sockptr_offset((char *)crypto_info +
> +   sizeof(*crypto_info),
> +   optval, sizeof(*crypto_info),
> sizeof(struct tls12_crypto_info_aes_gcm_128)
> - sizeof(*crypto_info));
>
> @@ -542,9 +542,9 @@ static int do_chtls_setsockopt(struct sock *sk, int 
> optname,
> }
> case TLS_CIPHER_AES_GCM_256: {
> crypto_info[0] = tmp_crypto_info;
> -   sockptr_advance(optval, sizeof(*crypto_info));
> -   rc = copy_from_sockptr((char *)crypto_info + 
> sizeof(*crypto_info),
> -   optval,
> +   rc = copy_from_sockptr_offset((char *)crypto_info +
> +   sizeof(*crypto_info),
> +   optval, sizeof(*crypto_info),
> sizeof(struct tls12_crypto_info_aes_gcm_256)
> - sizeof(*crypto_info));
>
> diff --git a/include/linux/sockptr.h b/include/linux/sockptr.h
> index b13ea1422f93a5..9e6c81d474cba8 100644
> --- a/include/linux/sockptr.h
> +++ b/include/linux/sockptr.h
> @@ -69,19 +69,26 @@ static inline bool sockptr_is_null(sockptr_t sockptr)
> return !sockptr.user;
>  }
>
> -static inline int copy_from_sockptr(void *dst, sockptr_t src, size_t size)
> +static inline int copy_from_sockptr_offset(void *dst, sockptr_t src,
> +   size_t offset, size_t size)
>  {
> if (!sockptr_is_kernel(src))
> -   return copy_from_user(dst, src.user, size);
> -   memcpy(dst, src.kernel, size);
> +   return copy_from_user(dst, src.user + offset, size);
> +   memcpy(dst, src.kernel + offset, size);
> return 0;
>  }
>
> -static inline int copy_to_sockptr(sockptr_t dst, const void *src, size_t 
> size)
> +static inline int copy_from_sockptr(void *dst, sockptr_t src, size_t size)
> +{
> +   return copy_from_sockptr_offset(dst, src, 0, size);
> +}
> +
> +static inline int copy_to_sockptr_offset(sockptr_t dst, size_t offset,
> +   const void *src, size_t size)
>  {
> if (!sockptr_is_kernel(dst))
> -   return copy_to_user(dst.user, src, size);
> -   memcpy(dst.kernel, src, size);
> +   return copy_to_user(dst.user + offset, src, size);
> +   memcpy(dst.kernel + offset, src, size);
> return 0;
>  }
>
> @@ -112,14 +119,6 @@ static inline void *memdup_sockptr_nul(sockptr_t src, 
> size_t len)
> return p;
>  }
>
> -static inline void sockptr_advance(sockptr_t sockptr, size_t len)
> -{
> -   if (sockptr_is_kernel(sockptr))
> -   sockptr.kernel += len;
> -   else
> -   sockptr.user += len;
> -}
> -
>  static i

Re: [PATCH 12/26] netfilter: switch nf_setsockopt to sockptr_t

2020-07-27 Thread Jason A. Donenfeld
On Mon, Jul 27, 2020 at 5:06 PM Christoph Hellwig  wrote:
>
> On Mon, Jul 27, 2020 at 05:03:10PM +0200, Jason A. Donenfeld wrote:
> > Hi Christoph,
> >
> > On Thu, Jul 23, 2020 at 08:08:54AM +0200, Christoph Hellwig wrote:
> > > diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> > > index da933f99b5d517..42befbf12846c0 100644
> > > --- a/net/ipv4/ip_sockglue.c
> > > +++ b/net/ipv4/ip_sockglue.c
> > > @@ -1422,7 +1422,8 @@ int ip_setsockopt(struct sock *sk, int level,
> > > optname != IP_IPSEC_POLICY &&
> > > optname != IP_XFRM_POLICY &&
> > > !ip_mroute_opt(optname))
> > > -   err = nf_setsockopt(sk, PF_INET, optname, optval, optlen);
> > > +   err = nf_setsockopt(sk, PF_INET, optname, 
> > > USER_SOCKPTR(optval),
> > > +   optlen);
> > >  #endif
> > > return err;
> > >  }
> > > diff --git a/net/ipv4/netfilter/ip_tables.c 
> > > b/net/ipv4/netfilter/ip_tables.c
> > > index 4697d09c98dc3e..f2a9680303d8c0 100644
> > > --- a/net/ipv4/netfilter/ip_tables.c
> > > +++ b/net/ipv4/netfilter/ip_tables.c
> > > @@ -1102,7 +1102,7 @@ __do_replace(struct net *net, const char *name, 
> > > unsigned int valid_hooks,
> > >  }
> > >
> > >  static int
> > > -do_replace(struct net *net, const void __user *user, unsigned int len)
> > > +do_replace(struct net *net, sockptr_t arg, unsigned int len)
> > >  {
> > > int ret;
> > > struct ipt_replace tmp;
> > > @@ -1110,7 +1110,7 @@ do_replace(struct net *net, const void __user 
> > > *user, unsigned int len)
> > > void *loc_cpu_entry;
> > > struct ipt_entry *iter;
> > >
> > > -   if (copy_from_user(&tmp, user, sizeof(tmp)) != 0)
> > > +   if (copy_from_sockptr(&tmp, arg, sizeof(tmp)) != 0)
> > > return -EFAULT;
> > >
> > > /* overflow check */
> > > @@ -1126,8 +1126,8 @@ do_replace(struct net *net, const void __user 
> > > *user, unsigned int len)
> > > return -ENOMEM;
> > >
> > > loc_cpu_entry = newinfo->entries;
> > > -   if (copy_from_user(loc_cpu_entry, user + sizeof(tmp),
> > > -  tmp.size) != 0) {
> > > +   sockptr_advance(arg, sizeof(tmp));
> > > +   if (copy_from_sockptr(loc_cpu_entry, arg, tmp.size) != 0) {
> > > ret = -EFAULT;
> > > goto free_newinfo;
> > > }
> >
> > Something along this path seems to have broken with this patch. An
> > invocation of `iptables -A INPUT -m length --length 1360 -j DROP` now
> > fails, with
> >
> > nf_setsockopt->do_replace->translate_table->check_entry_size_and_hooks:
> >   (unsigned char *)e + e->next_offset > limit  ==>  TRUE
> >
> > resulting in the whole call chain returning -EINVAL. It bisects back to
> > this commit. This is on net-next.
>
> This is another use o sockptr_advance that Ido already found a problem
> in.  I'm looking into this at the moment..

I haven't seen Ido's patch, but it seems clear the issue is that you
want to call `sockptr_advance(&arg, sizeof(tmp))`, and adjust
sockptr_advance to take a pointer.

Slight concern about the whole concept:

Things are defined as

typedef union {
void*kernel;
void __user *user;
} sockptr_t;
static inline bool sockptr_is_kernel(sockptr_t sockptr)
{
return (unsigned long)sockptr.kernel >= TASK_SIZE;
}

So what happens if we have some code like:

sockptr_t sp;
init_user_sockptr(&sp, user_controlled_struct.extra_user_ptr);
sockptr_advance(&sp, user_controlled_struct.some_big_offset);
copy_to_sockptr(&sp, user_controlled_struct.a_few_bytes,
sizeof(user_controlled_struct.a_few_bytes));

With the user controlling some_big_offset, he can convert the user
sockptr into a kernel sockptr, causing the subsequent copy_to_sockptr
to be a vanilla memcpy, after which a security disaster ensues.

Maybe sockptr_advance should have some safety checks and sometimes
return -EFAULT? Or you should always use the implementation where
being a kernel address is an explicit bit of sockptr_t, rather than
being implicit?


Re: [PATCH] crypto: x86/curve25519 - Remove unused carry variables

2020-07-27 Thread Jason A. Donenfeld
On Mon, Jul 27, 2020 at 5:08 PM Karthik Bhargavan
 wrote:
>
> Removing unused variables is harmless. (GCC would do this automaticelly.)
> So this change seems fine.

Thanks for confirming.  Hopefully we can get that change upstream in
HACL* too, so that the code comes out the same.

Acked-by: Jason A. Donenfeld 


Re: [PATCH 12/26] netfilter: switch nf_setsockopt to sockptr_t

2020-07-27 Thread Jason A. Donenfeld
Hi Christoph,

On Thu, Jul 23, 2020 at 08:08:54AM +0200, Christoph Hellwig wrote:
> diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
> index da933f99b5d517..42befbf12846c0 100644
> --- a/net/ipv4/ip_sockglue.c
> +++ b/net/ipv4/ip_sockglue.c
> @@ -1422,7 +1422,8 @@ int ip_setsockopt(struct sock *sk, int level,
>   optname != IP_IPSEC_POLICY &&
>   optname != IP_XFRM_POLICY &&
>   !ip_mroute_opt(optname))
> - err = nf_setsockopt(sk, PF_INET, optname, optval, optlen);
> + err = nf_setsockopt(sk, PF_INET, optname, USER_SOCKPTR(optval),
> + optlen);
>  #endif
>   return err;
>  }
> diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
> index 4697d09c98dc3e..f2a9680303d8c0 100644
> --- a/net/ipv4/netfilter/ip_tables.c
> +++ b/net/ipv4/netfilter/ip_tables.c
> @@ -1102,7 +1102,7 @@ __do_replace(struct net *net, const char *name, 
> unsigned int valid_hooks,
>  }
>  
>  static int
> -do_replace(struct net *net, const void __user *user, unsigned int len)
> +do_replace(struct net *net, sockptr_t arg, unsigned int len)
>  {
>   int ret;
>   struct ipt_replace tmp;
> @@ -1110,7 +1110,7 @@ do_replace(struct net *net, const void __user *user, 
> unsigned int len)
>   void *loc_cpu_entry;
>   struct ipt_entry *iter;
>  
> - if (copy_from_user(&tmp, user, sizeof(tmp)) != 0)
> + if (copy_from_sockptr(&tmp, arg, sizeof(tmp)) != 0)
>   return -EFAULT;
>  
>   /* overflow check */
> @@ -1126,8 +1126,8 @@ do_replace(struct net *net, const void __user *user, 
> unsigned int len)
>   return -ENOMEM;
>  
>   loc_cpu_entry = newinfo->entries;
> - if (copy_from_user(loc_cpu_entry, user + sizeof(tmp),
> -tmp.size) != 0) {
> + sockptr_advance(arg, sizeof(tmp));
> + if (copy_from_sockptr(loc_cpu_entry, arg, tmp.size) != 0) {
>   ret = -EFAULT;
>   goto free_newinfo;
>   }

Something along this path seems to have broken with this patch. An
invocation of `iptables -A INPUT -m length --length 1360 -j DROP` now
fails, with

nf_setsockopt->do_replace->translate_table->check_entry_size_and_hooks:
  (unsigned char *)e + e->next_offset > limit  ==>  TRUE

resulting in the whole call chain returning -EINVAL. It bisects back to
this commit. This is on net-next.

Jason


Re: [PATCH] crypto: x86/curve25519 - Remove unused carry variables

2020-07-23 Thread Jason A. Donenfeld
Hi Herbert,

On Thu, Jul 23, 2020 at 9:51 AM Herbert Xu  wrote:
>
> The carry variables are assigned but never used, which upsets
> the compiler.  This patch removes them.
>
> Signed-off-by: Herbert Xu 
>
> diff --git a/arch/x86/crypto/curve25519-x86_64.c 
> b/arch/x86/crypto/curve25519-x86_64.c
> index 8a17621f7d3a..8acbb6584a37 100644
> --- a/arch/x86/crypto/curve25519-x86_64.c
> +++ b/arch/x86/crypto/curve25519-x86_64.c
> @@ -948,10 +948,8 @@ static void store_felem(u64 *b, u64 *f)
>  {
> u64 f30 = f[3U];
> u64 top_bit0 = f30 >> (u32)63U;
> -   u64 carry0;
> u64 f31;
> u64 top_bit;
> -   u64 carry;
> u64 f0;
> u64 f1;
> u64 f2;
> @@ -970,11 +968,11 @@ static void store_felem(u64 *b, u64 *f)
> u64 o2;
> u64 o3;
> f[3U] = f30 & (u64)0x7fffU;
> -   carry0 = add_scalar(f, f, (u64)19U * top_bit0);
> +   add_scalar(f, f, (u64)19U * top_bit0);
> f31 = f[3U];
> top_bit = f31 >> (u32)63U;
> f[3U] = f31 & (u64)0x7fffU;
> -   carry = add_scalar(f, f, (u64)19U * top_bit);
> +   add_scalar(f, f, (u64)19U * top_bit);
> f0 = f[0U];
> f1 = f[1U];
> f2 = f[2U];
> --

That seems obvious and reasonable, and so I'm inclined to ack this,
but I first wanted to give Karthik (CC'd) a chance to chime in here,
as it's his HACL* project that's responsible, and he might have some
curious insight.

Jason


Re: [PATCH v4 0/3] mm, treewide: Rename kzfree() to kfree_sensitive()

2020-06-16 Thread Jason A. Donenfeld
On Tue, Jun 16, 2020 at 12:54 PM Joe Perches  wrote:
>
> On Mon, 2020-06-15 at 21:57 -0400, Waiman Long wrote:
> >  v4:
> >   - Break out the memzero_explicit() change as suggested by Dan Carpenter
> > so that it can be backported to stable.
> >   - Drop the "crypto: Remove unnecessary memzero_explicit()" patch for
> > now as there can be a bit more discussion on what is best. It will be
> > introduced as a separate patch later on after this one is merged.
>
> To this larger audience and last week without reply:
> https://lore.kernel.org/lkml/573b3fbd5927c643920e1364230c296b23e7584d.ca...@perches.com/
>
> Are there _any_ fastpath uses of kfree or vfree?

The networking stack has various places where there will be a quick
kmalloc followed by a kfree for an incoming or outgoing packet. One
place that comes to mind would be esp_alloc_tmp, which does a quick
allocation of some temporary kmalloc memory, processes some packet
things inside of that, and then frees it, sometimes in the same
function, and sometimes later in an async callback. I don't know how
"fastpath" you consider this, but usually packet processing is
something people want to do with minimal overhead, considering how
fast NICs are these days.


Re: [PATCH 1/2] crypto: virtio: fix src/dst scatterlist calculation

2020-05-24 Thread Jason Wang



On 2020/5/25 上午8:56, Longpeng(Mike) wrote:

The system will crash when we insmod crypto/tcrypt.ko whit mode=38.

Usually the next entry of one sg will be @sg@ + 1, but if this sg element
is part of a chained scatterlist, it could jump to the start of a new
scatterlist array. Let's fix it by sg_next() on calculation of src/dst
scatterlist.

BTW I add a check for sg_nents_for_len() its return value since
sg_nents_for_len() function could fail.

Cc: Gonglei 
Cc: Herbert Xu 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: "David S. Miller" 
Cc: virtualizat...@lists.linux-foundation.org
Cc: linux-ker...@vger.kernel.org

Reported-by: LABBE Corentin 
Signed-off-by: Gonglei 
Signed-off-by: Longpeng(Mike) 
---
  drivers/crypto/virtio/virtio_crypto_algs.c | 14 ++
  1 file changed, 10 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/virtio/virtio_crypto_algs.c 
b/drivers/crypto/virtio/virtio_crypto_algs.c
index 372babb44112..2fa1129f96d6 100644
--- a/drivers/crypto/virtio/virtio_crypto_algs.c
+++ b/drivers/crypto/virtio/virtio_crypto_algs.c
@@ -359,8 +359,14 @@ __virtio_crypto_skcipher_do_req(struct 
virtio_crypto_sym_request *vc_sym_req,
unsigned int num_out = 0, num_in = 0;
int sg_total;
uint8_t *iv;
+   struct scatterlist *sg;
  
  	src_nents = sg_nents_for_len(req->src, req->cryptlen);

+   if (src_nents < 0) {
+   pr_err("Invalid number of src SG.\n");
+   return src_nents;
+   }
+
dst_nents = sg_nents(req->dst);
  
  	pr_debug("virtio_crypto: Number of sgs (src_nents: %d, dst_nents: %d)\n",

@@ -446,12 +452,12 @@ __virtio_crypto_skcipher_do_req(struct 
virtio_crypto_sym_request *vc_sym_req,
vc_sym_req->iv = iv;
  
  	/* Source data */

-   for (i = 0; i < src_nents; i++)
-   sgs[num_out++] = &req->src[i];
+   for (sg = req->src, i = 0; sg && i < src_nents; sg = sg_next(sg), i++)



Any reason sg is checked here?

I believe it should be checked in sg_nents_for_len().



+   sgs[num_out++] = sg;
  
  	/* Destination data */

-   for (i = 0; i < dst_nents; i++)
-   sgs[num_out + num_in++] = &req->dst[i];
+   for (sg = req->dst, i = 0; sg && i < dst_nents; sg = sg_next(sg), i++)
+   sgs[num_out + num_in++] = sg;



I believe sg should be checked in sg_nents().

Thanks


  
  	/* Status */

sg_init_one(&status_sg, &vc_req->status, sizeof(vc_req->status));




Re: [PATCH] Kbuild: disable FORTIFY_SOURCE on clang-10

2020-05-05 Thread Jason A. Donenfeld
On Tue, May 5, 2020 at 5:22 PM Jason A. Donenfeld  wrote:
>
> On Tue, May 5, 2020 at 5:19 PM Kees Cook  wrote:
> >
> > On Tue, May 05, 2020 at 04:37:38PM -0600, Jason A. Donenfeld wrote:
> > > On Tue, May 5, 2020 at 4:25 PM Nathan Chancellor
> > >  wrote:
> > > > I believe these issues are one in the same. I did a reverse bisect with
> > > > Arnd's test case and converged on George's first patch:
> > > >
> > > > https://github.com/llvm/llvm-project/commit/2dd17ff08165e6118e70f00e22b2c36d2d4e0a9a
> > > >
> > > > I think that in lieu of this patch, we should have that patch and its
> > > > follow-up fix merged into 10.0.1.
> > >
> > > If this is fixed in 10.0.1, do we even need to patch the kernel at
> > > all? Or can we just leave it be, considering most organizations using
> > > clang know what they're getting into? I'd personally prefer the
> > > latter, so that we don't clutter things.
> >
> > I agree: I'd rather this was fixed in 10.0.1 (but if we do want a
> > kernel-side work-around for 10.0.0, I would suggest doing the version
> > check in the Kconfig for FORTIFY_SOURCE instead of in the Makefile,
> > as that's where these things are supposed to live these days).
>
> Indeed this belongs in the Makefile. I can send a patch adjusting

er, Kconfig, is what I meant to type.


Re: [PATCH] Kbuild: disable FORTIFY_SOURCE on clang-10

2020-05-05 Thread Jason A. Donenfeld
On Tue, May 5, 2020 at 5:19 PM Kees Cook  wrote:
>
> On Tue, May 05, 2020 at 04:37:38PM -0600, Jason A. Donenfeld wrote:
> > On Tue, May 5, 2020 at 4:25 PM Nathan Chancellor
> >  wrote:
> > > I believe these issues are one in the same. I did a reverse bisect with
> > > Arnd's test case and converged on George's first patch:
> > >
> > > https://github.com/llvm/llvm-project/commit/2dd17ff08165e6118e70f00e22b2c36d2d4e0a9a
> > >
> > > I think that in lieu of this patch, we should have that patch and its
> > > follow-up fix merged into 10.0.1.
> >
> > If this is fixed in 10.0.1, do we even need to patch the kernel at
> > all? Or can we just leave it be, considering most organizations using
> > clang know what they're getting into? I'd personally prefer the
> > latter, so that we don't clutter things.
>
> I agree: I'd rather this was fixed in 10.0.1 (but if we do want a
> kernel-side work-around for 10.0.0, I would suggest doing the version
> check in the Kconfig for FORTIFY_SOURCE instead of in the Makefile,
> as that's where these things are supposed to live these days).

Indeed this belongs in the Makefile. I can send a patch adjusting
that, if you want, but I think I'd rather do nothing and have a fix be
rolled out in 10.0.1. Clang users should know what to expect in that
regard.

> (Though as was mentioned, it's likely that FORTIFY_SOURCE isn't working
> _at all_ under Clang, so I may still send a patch to depend on !clang
> just to avoid surprises until it's fixed, but I haven't had time to
> chase down a solution yet.)

That might be the most coherent thing to do, at least so that people
don't get some false sense of mitigation.

Jason


Re: [PATCH] Kbuild: disable FORTIFY_SOURCE on clang-10

2020-05-05 Thread Jason A. Donenfeld
On Tue, May 5, 2020 at 4:25 PM Nathan Chancellor
 wrote:
> I believe these issues are one in the same. I did a reverse bisect with
> Arnd's test case and converged on George's first patch:
>
> https://github.com/llvm/llvm-project/commit/2dd17ff08165e6118e70f00e22b2c36d2d4e0a9a
>
> I think that in lieu of this patch, we should have that patch and its
> follow-up fix merged into 10.0.1.

If this is fixed in 10.0.1, do we even need to patch the kernel at
all? Or can we just leave it be, considering most organizations using
clang know what they're getting into? I'd personally prefer the
latter, so that we don't clutter things.


[PATCH] Kbuild: disable FORTIFY_SOURCE on clang-10

2020-05-05 Thread Jason A. Donenfeld
clang-10 has a broken optimization stage that doesn't enable the
compiler to prove at compile time that certain memcpys are within
bounds, and thus the outline memcpy is always called, resulting in
horrific performance, and in some cases, excessive stack frame growth.
Here's a simple reproducer:

typedef unsigned long size_t;
void *c(void *dest, const void *src, size_t n) __asm__("memcpy");
extern inline __attribute__((gnu_inline)) void *memcpy(void *dest, const 
void *src, size_t n) { return c(dest, src, n); }
void blah(char *a)
{
  unsigned long long b[10], c[10];
  int i;

  memcpy(b, a, sizeof(b));
  for (i = 0; i < 10; ++i)
c[i] = b[i] ^ b[9 - i];
  for (i = 0; i < 10; ++i)
b[i] = c[i] ^ a[i];
  memcpy(a, b, sizeof(b));
}

Compile this with clang-9 and clang-10 and observe:

zx2c4@thinkpad /tmp/curve25519-hacl64-stack-frame-size-test $ clang-10 
-Wframe-larger-than=0 -O3 -c b.c -o c10.o
b.c:5:6: warning: stack frame size of 104 bytes in function 'blah' 
[-Wframe-larger-than=]
void blah(char *a)
 ^
1 warning generated.
zx2c4@thinkpad /tmp/curve25519-hacl64-stack-frame-size-test $ clang-9 
-Wframe-larger-than=0 -O3 -c b.c -o c9.o

Looking at the disassembly of c10.o and c9.o, one can see that c9.o is
properly optimized in the obvious way one would expect, while c10.o has
blown up and includes extern calls to memcpy.

This is present on the most trivial bits of code. Thus, for clang-10, we
just set __NO_FORTIFY globally, so that this issue won't be incurred.

Cc: Arnd Bergmann 
Cc: LKML 
Cc: clang-built-linux 
Cc: Kees Cook 
Cc: George Burgess 
Cc: Nick Desaulniers 
Link: https://bugs.llvm.org/show_bug.cgi?id=45802
Signed-off-by: Jason A. Donenfeld 
---
 Makefile | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/Makefile b/Makefile
index 49b2709ff44e..f022f077591d 100644
--- a/Makefile
+++ b/Makefile
@@ -768,6 +768,13 @@ KBUILD_CFLAGS += -Wno-gnu
 # source of a reference will be _MergedGlobals and not on of the whitelisted 
names.
 # See modpost pattern 2
 KBUILD_CFLAGS += -mno-global-merge
+
+# clang-10 has a broken optimization stage that causes memcpy to always be
+# outline, resulting in excessive stack frame growth and poor performance.
+ifeq ($(shell test $(CONFIG_CLANG_VERSION) -ge 10 && test 
$(CONFIG_CLANG_VERSION) -lt 11; echo $$?),0)
+KBUILD_CFLAGS += -D__NO_FORTIFY
+endif
+
 else
 
 # These warnings generated too much noise in a regular build.
-- 
2.26.2



Re: [PATCH] crypto: curve25519-hacl64 - Disable fortify-source for clang-10

2020-05-05 Thread Jason A. Donenfeld
On Tue, May 5, 2020 at 3:48 PM Nick Desaulniers  wrote:
>
> + Kees, George, who have started looking into this, too.

I have a smaller reproducer and analysis I'll send out very shortly.


Re: [PATCH] crypto: curve25519-hacl64 - Disable fortify-source for clang-10

2020-05-05 Thread Jason A. Donenfeld
As discussed on IRC, this issue here isn't specific to this file, but
rather fortify source has some serious issues on clang-10, everywhere
in the kernel, and we should probably disable it globally for this
clang version. I'll follow up with some more info in a different
patch.


Re: [PATCH] net: wireguard: avoid unused variable warning

2020-05-05 Thread Jason A. Donenfeld
On Tue, May 5, 2020 at 8:13 AM Arnd Bergmann  wrote:
>
> clang points out a harmless use of uninitialized variables that
> get passed into a local function but are ignored there:
>
> In file included from drivers/net/wireguard/ratelimiter.c:223:
> drivers/net/wireguard/selftest/ratelimiter.c:173:34: error: variable 'skb6' 
> is uninitialized when used here [-Werror,-Wuninitialized]
> ret = timings_test(skb4, hdr4, skb6, hdr6, &test_count);
>^~~~
> drivers/net/wireguard/selftest/ratelimiter.c:123:29: note: initialize the 
> variable 'skb6' to silence this warning
> struct sk_buff *skb4, *skb6;
>^
> = NULL
> drivers/net/wireguard/selftest/ratelimiter.c:173:40: error: variable 'hdr6' 
> is uninitialized when used here [-Werror,-Wuninitialized]
> ret = timings_test(skb4, hdr4, skb6, hdr6, &test_count);
>  ^~~~
> drivers/net/wireguard/selftest/ratelimiter.c:125:22: note: initialize the 
> variable 'hdr6' to silence this warning
> struct ipv6hdr *hdr6;
> ^

Seems like the code is a bit easier to read and is more uniform
looking by just initializing those two variables to NULL, like the
warning suggests. If you don't mind, I'll queue something up in my
tree to this effect.

By the way, I'm having a bit of a hard time reproducing the warning
with either clang-10 or clang-9. Just for my own curiosity, would you
mind sending the .config that results in this?

Jason


Re: [PATCH 0/7] sha1 library cleanup

2020-05-02 Thread Jason A. Donenfeld
Thanks for this series. I like the general idea. I think it might make
sense, though, to separate things out into sha1.h and sha256.h. That
will be nice preparation work for when we eventually move obsolete
primitives into some  subdirectory.


Re: [PATCH v2] crypto: lib/sha256 - return void

2020-05-01 Thread Jason A. Donenfeld
On Fri, May 01, 2020 at 09:42:29AM -0700, Eric Biggers wrote:
> From: Eric Biggers 
> 
> The SHA-256 / SHA-224 library functions can't fail, so remove the
> useless return value.

Looks good to me, and also matches the signatures of the blake2s library
functions, which return null.

Reviewed-by: Jason A. Donenfeld 


Re: [PATCH crypto-stable v3 1/2] crypto: arch/lib - limit simd usage to 4k chunks

2020-04-28 Thread Jason A. Donenfeld
Hi Herbert,

This v3 patchset has a Reviewed-by from Ard for 1/2 and from Eric for
2/2, from last week. Could you submit this to Linus for rc4?

Thanks,
Jason


Re: [PATCH v4 25/35] crypto: BLAKE2s - x86_64 SIMD implementation

2019-10-23 Thread Jason A. Donenfeld
On Wed, Oct 23, 2019 at 6:55 AM Eric Biggers  wrote:
> There are no comments in this 685-line assembly language file.
> Is this the original version, or is it a generated/stripped version?

It looks like Ard forgot to import the latest one from Zinc, which is
significantly shorter and has other improvements too:

https://git.zx2c4.com/WireGuard/tree/src/crypto/zinc/blake2s/blake2s-x86_64.S


Re: [PATCH v3 00/29] crypto: crypto API library interfaces for WireGuard

2019-10-14 Thread Jason A. Donenfeld
Hi Ard,

Just to keep track of it in public, here are the things that we're
deferring from my original series for after this one is (if it is)
merged:

- Zinc's generic C implementation of poly1305, which is faster and has
separate implementations for u64 and u128. Should be uncontroversial,
but it's a code replacement, so not appropriate for this series here.
- x86_64 ChaCha20 from Zinc, which will spark interesting discussions
with Martin and might prove to be a bit controversial.
- x86_64 Poly1305 from Zinc, which I think Martin will be on board
with, but is also a code replacement.
- The big_keys patch, showing the simplification the function call
interface offers.
- WireGuard - things are quasi-ready, so when the time comes, I look
forward to submitting that to Dave's net tree as a single boring
[PATCH 1/1].

Did I leave anything out?

Jason


Re: [PATCH v3 24/29] crypto: lib/curve25519 - work around Clang stack spilling issue

2019-10-14 Thread Jason A. Donenfeld
Hi Ard,

On Mon, Oct 7, 2019 at 6:46 PM Ard Biesheuvel  wrote:
> Arnd reports that the 32-bit generic library code for Curve25119 ends
> up using an excessive amount of stack space when built with Clang:
>
>   lib/crypto/curve25519-fiat32.c:756:6: error: stack frame size
>   of 1384 bytes in function 'curve25519_generic'
>   [-Werror,-Wframe-larger-than=]
>
> Let's give some hints to the compiler regarding which routines should
> not be inlined, to prevent it from running out of registers and spilling
> to the stack. The resulting code performs identically under both GCC
> and Clang, and makes the warning go away.

Are you *sure* about that? Couldn't we fix clang instead? I'd rather
fixes go there instead of gimping this. The reason is that I noticed
before that this code, performance-wise, was very inlining sensitive.
Can you benchmark this on ARM32-noneon and on MIPS32? If there's a
performance difference there, then maybe you can defer this part of
the series until after the rest lands, and then we'll discuss at
length various strategies? Alternatively, if you benchmark those and
it also makes no difference, then it indeed makes no difference.

Jason


Re: [PATCH v3 21/29] crypto: BLAKE2s - generic C library implementation and selftest

2019-10-11 Thread Jason A. Donenfeld
On Thu, Oct 10, 2019 at 11:02:32PM -0700, Eric Biggers wrote:
> FYI, I had left a few review comments on Jason's last version of this patch
> (https://lkml.kernel.org/linux-crypto/20190326173759.GA607@zzz.localdomain/),
> some of which Jason addressed in the Wireguard repository
> (https://git.zx2c4.com/WireGuard) but they didn't make it into this patch.
> I'd suggest taking a look at the version there.

Indeed I hadn't updated the Zinc patchset since then, but you can see
the changes since ~March here:

https://git.zx2c4.com/WireGuard/log/src/crypto

There are actually quite a few interesting Blake changes.


Re: [PATCH v2 04/20] crypto: arm/chacha - expose ARM ChaCha routine as library function

2019-10-04 Thread Jason A. Donenfeld
On Fri, Oct 4, 2019 at 5:36 PM Ard Biesheuvel  wrote:
> > Just checking for Cortex-A7 being the boot CPU is probably
> > sufficient, that takes care of the common case of all the
> > A7-only embedded chips that people definitely are going to care
> > about for a long time.
> >
>
> But do you agree that disabling kernel mode NEON altogether for these
> systems is probably more sensible than testing for CPU part IDs in an
> arbitrary crypto driver?

No. That NEON code is _still_ faster than the generic C code. But it
is not as fast as the scalar code. There might be another primitive
that has a fast NEON implementation but does not have a fast scalar
implementation. The choice there would be between fast NEON and slow
generic. In that case, we want fast NEON. Also, different algorithms
lend themselves to different implementation strategies. Leave this up
to the chacha code, as Zinc does it, since this is the place that has
the most information to decide what it should be running.


Re: [PATCH v2 05/20] crypto: mips/chacha - import accelerated 32r2 code from Zinc

2019-10-04 Thread Jason A. Donenfeld
On Fri, Oct 4, 2019 at 4:44 PM Ard Biesheuvel  wrote:
> The round count is passed via the fifth function parameter, so it is
> already on the stack. Reloading it for every block doesn't sound like
> a huge deal to me.

Please benchmark it to indicate that, if it really isn't a big deal. I
recall finding that memory accesses on common mips32r2 commodity
router hardware was extremely inefficient. The whole thing is designed
to minimize memory accesses, which are the primary bottleneck on that
platform.

Seems like this thing might be best deferred for after this all lands.
IOW, let's get this in with the 20 round original now, and later you
can submit a change for the 12 round and René and I can spend time
dusting off our test rigs and seeing which strategy works best. I very
nearly tossed out a bunch of old router hardware last night when
cleaning up. Glad I saved it!


Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

2019-10-04 Thread Jason A. Donenfeld
On Fri, Oct 4, 2019 at 4:53 PM Andy Lutomirski  wrote:
> I think it might be better to allow two different modules to export the same 
> symbol but only allow one of them to be loaded. Or use static calls.

Static calls perform well and are well understood. This would be my preference.


Re: [PATCH v2 04/20] crypto: arm/chacha - expose ARM ChaCha routine as library function

2019-10-04 Thread Jason A. Donenfeld
On Fri, Oct 4, 2019 at 4:23 PM Ard Biesheuvel  wrote:
> How is it relevant whether the boot CPU is A5 or A7? These are bL
> little cores that only implement NEON for feature parity with their bl
> big counterparts, but CPU intensive tasks are scheduled on big cores,
> where NEON performance is much better than scalar.

Yea big-little might confuse things indeed. Though the performance
difference between the NEON code and the scalar code is not that huge,
and I suspect that big-little machines might benefit from
unconditionally using the scalar code, given that sometimes they might
wind up doing things on the little cores.

Eric - what did you guys wind up doing on Android with the fast scalar
implementation?


Re: [PATCH v2 04/20] crypto: arm/chacha - expose ARM ChaCha routine as library function

2019-10-04 Thread Jason A. Donenfeld
On Fri, Oct 4, 2019 at 4:23 PM Ard Biesheuvel  wrote:
> > Did these changes make it into the existing tree?
>
> I'd like to keep Eric's code, but if it is really that much faster, we
> might drop it in arch/arm/lib so it supersedes the builtin code that
> /dev/random uses as well.

That was the idea with Zinc. For things like ARM and MIPS, the
optimized scalar code is really quite fast, and is worth using all the
time and not compiling the generic code at all.


Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

2019-10-04 Thread Jason A. Donenfeld
On Wed, Oct 2, 2019 at 4:17 PM Ard Biesheuvel  wrote:
> This is a followup to RFC 'crypto: wireguard with crypto API library 
> interface'
> [0]. Since no objections were raised to my approach, I've proceeded to fix up
> some minor issues, and incorporate [most of] the missing MIPS code.

Could you get the mips64 code into the next version? WireGuard is
widely used on the Ubiquiti EdgeRouter, which runs on the Octeon, a
64-bit mips chips.


Re: [PATCH v2 20/20] crypto: lib/chacha20poly1305 - reimplement crypt_from_sg() routine

2019-10-04 Thread Jason A. Donenfeld
On Wed, Oct 02, 2019 at 04:17:13PM +0200, Ard Biesheuvel wrote:
> Reimplement the library routines to perform chacha20poly1305 en/decryption
> on scatterlists, without [ab]using the [deprecated] blkcipher interface,
> which is rather heavyweight and does things we don't really need.
> 
> Instead, we use the sg_miter API in a novel and clever way, to iterate
> over the scatterlist in-place (i.e., source == destination, which is the
> only way this library is expected to be used). That way, we don't have to
> iterate over two scatterlists in parallel.

Nice idea. Probably this will result in a real speedup, as I suspect
those extra prior kmaps weren't free. Looking forward to benching it.


Re: [PATCH v2 18/20] crypto: arm/Curve25519 - wire up NEON implementation

2019-10-04 Thread Jason A. Donenfeld
On Wed, Oct 02, 2019 at 04:17:11PM +0200, Ard Biesheuvel wrote:
> +bool curve25519_arch(u8 out[CURVE25519_KEY_SIZE],
> +  const u8 scalar[CURVE25519_KEY_SIZE],
> +  const u8 point[CURVE25519_KEY_SIZE])
> +{
> + if (!have_neon || !crypto_simd_usable())
> + return false;
> + kernel_neon_begin();
> + curve25519_neon(out, scalar, point);
> + kernel_neon_end();
> + return true;
> +}
> +EXPORT_SYMBOL(curve25519_arch);

This now looks more like the Zinc way of doing things, with the _arch
function returning true or false based on whether or not the
implementation was available, allowing the generic implementation to
run.

I thought, from looking at the chacha commits earlier, you had done away
with that style of things in favor of weak symbols, whereby the arch
implementation would fall back to the generic one, instead of the other
way around?


Re: [PATCH v2 14/20] crypto: Curve25519 - generic C library implementations and selftest

2019-10-04 Thread Jason A. Donenfeld
On Wed, Oct 02, 2019 at 04:17:07PM +0200, Ard Biesheuvel wrote:
>- replace .c #includes with Kconfig based object selection

Cool!

> +config CRYPTO_ARCH_HAVE_LIB_CURVE25519
> + tristate
> +
> +config CRYPTO_ARCH_HAVE_LIB_CURVE25519_BASE
> + bool
> +
> +config CRYPTO_LIB_CURVE25519
> + tristate "Curve25519 scalar multiplication library"
> + depends on CRYPTO_ARCH_HAVE_LIB_CURVE25519 || 
> !CRYPTO_ARCH_HAVE_LIB_CURVE25519

a || !a ==> true

Did you mean for one of these to be _BASE? Or is this a Kconfig trick of
a different variety that's intentional?

> +libcurve25519-y  := curve25519-fiat32.o
> +libcurve25519-$(CONFIG_ARCH_SUPPORTS_INT128) := curve25519-hacl64.o

Nice idea.


Re: [PATCH v2 04/20] crypto: arm/chacha - expose ARM ChaCha routine as library function

2019-10-04 Thread Jason A. Donenfeld
On Wed, Oct 2, 2019 at 4:17 PM Ard Biesheuvel  wrote:
> Expose the accelerated NEON ChaCha routine directly as a symbol
> export so that users of the ChaCha library can use it directly.

Eric had some nice code for ChaCha for certain ARM cores that lived in
Zinc as chacha20-unrolled-arm.S. This code became active for certain
cores where NEON was bad and for cores with no NEON. The condition for
it was:

switch (read_cpuid_part()) {
   case ARM_CPU_PART_CORTEX_A7:
   case ARM_CPU_PART_CORTEX_A5:
   /* The Cortex-A7 and Cortex-A5 do not perform well with the NEON
* implementation but do incredibly with the scalar one and use
* less power.
*/
   break;
   default:
   chacha20_use_neon = elf_hwcap & HWCAP_NEON;
   }

...

for (;;) {
   if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && chacha20_use_neon &&
   len >= CHACHA20_BLOCK_SIZE * 3 && simd_use(simd_context)) {
   const size_t bytes = min_t(size_t, len, PAGE_SIZE);

   chacha20_neon(dst, src, bytes, ctx->key, ctx->counter);
   ctx->counter[0] += (bytes + 63) / 64;
   len -= bytes;
   if (!len)
   break;
   dst += bytes;
   src += bytes;
   simd_relax(simd_context);
   } else {
   chacha20_arm(dst, src, len, ctx->key, ctx->counter);
   ctx->counter[0] += (len + 63) / 64;
   break;
   }
   }

It's another instance in which the generic code was totally optimized
out of Zinc builds.

Did these changes make it into the existing tree?


Re: [PATCH v2 10/20] crypto: mips/poly1305 - import accelerated 32r2 code from Zinc

2019-10-04 Thread Jason A. Donenfeld
On Wed, Oct 02, 2019 at 04:17:03PM +0200, Ard Biesheuvel wrote:
> This integrates the accelerated MIPS 32r2 implementation of ChaCha
> into both the API and library interfaces of the kernel crypto stack.
> 
> The significance of this is that, in addition to becoming available
> as an accelerated library implementation, it can also be used by
> existing crypto API code such as IPsec, using chacha20poly1305.

Same thing here: can you please split the assembly into two commits so I
can see what you changed from the original?


Re: [PATCH v2 05/20] crypto: mips/chacha - import accelerated 32r2 code from Zinc

2019-10-04 Thread Jason A. Donenfeld
On Wed, Oct 02, 2019 at 04:16:58PM +0200, Ard Biesheuvel wrote:
> This integrates the accelerated MIPS 32r2 implementation of ChaCha
> into both the API and library interfaces of the kernel crypto stack.
> 
> The significance of this is that, in addition to becoming available
> as an accelerated library implementation, it can also be used by
> existing crypto API code such as Adiantum (for block encryption on
> ultra low performance cores) or IPsec using chacha20poly1305. These
> are use cases that have already opted into using the abstract crypto
> API. In order to support Adiantum, the core assembler routine has
> been adapted to take the round count as a function argument rather
> than hardcoding it to 20.

Could you resubmit this with first my original commit and then with your
changes on top? I'd like to see and be able to review exactly what's
changed. If I recall correctly, René and I were really starved for
registers and tried pretty hard to avoid spilling to the stack, so I'm
interested to learn how you crammed a bit more sauce in there.

I also wonder if maybe it'd be better to just leave this as is with 20
rounds, which it was previously optimized for, and just not do
accelerated Adiantum for MIPS. Android has long since given up on the
ISA entirely.


Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

2019-10-04 Thread Jason A. Donenfeld
On Thu, Oct 03, 2019 at 10:43:29AM +0200, Ard Biesheuvel wrote:
> On Wed, 2 Oct 2019 at 16:17, Ard Biesheuvel  wrote:
> >
> ...
> >
> > In the future, I would like to extend these interfaces to use static calls,
> > so that the accelerated implementations can be [un]plugged at runtime. For
> > the time being, we rely on weak aliases and conditional exports so that the
> > users of the library interfaces link directly to the accelerated versions,
> > but without the ability to unplug them.
> >
> 
> As it turns out, we don't actually need static calls for this.
> Instead, we can simply permit weak symbol references to go unresolved
> between modules (as we already do in the kernel itself, due to the
> fact that ELF permits it), and have the accelerated code live in
> separate modules that may not be loadable on certain systems, or be
> blacklisted by the user.

You're saying that at module insertion time, the kernel will override
weak symbols with those provided by the module itself? At runtime?

Do you know offhand how this patching works? Is there a PLT that gets
patched, and so the calls all go through a layer of function pointer
indirection? Or are all call sites fixed up at insertion time and the
call instructions rewritten with some runtime patching magic?

Unless the method is the latter, I would assume that static calls are
much faster in general? Or the approach already in this series is
perhaps fine enough, and we don't need to break this apart into
individual modules complicating everything?


Re: [PATCH v2 02/20] crypto: x86/chacha - expose SIMD ChaCha routine as library function

2019-10-04 Thread Jason A. Donenfeld
On Wed, Oct 02, 2019 at 04:16:55PM +0200, Ard Biesheuvel wrote:
> Wire the existing x86 SIMD ChaCha code into the new ChaCha library
> interface.
> 
> Signed-off-by: Ard Biesheuvel 
> ---
>  arch/x86/crypto/chacha_glue.c | 36 
>  crypto/Kconfig|  1 +
>  include/crypto/chacha.h   |  6 
>  3 files changed, 43 insertions(+)
> 
> diff --git a/arch/x86/crypto/chacha_glue.c b/arch/x86/crypto/chacha_glue.c
> index bc62daa8dafd..fd9ef42842cf 100644
> --- a/arch/x86/crypto/chacha_glue.c
> +++ b/arch/x86/crypto/chacha_glue.c
> @@ -123,6 +123,42 @@ static void chacha_dosimd(u32 *state, u8 *dst, const u8 
> *src,
>   }
>  }
>  
> +void hchacha_block(const u32 *state, u32 *stream, int nrounds)
> +{
> + state = PTR_ALIGN(state, CHACHA_STATE_ALIGN);
> +
> + if (!crypto_simd_usable()) {
> + hchacha_block_generic(state, stream, nrounds);
> + } else {
> + kernel_fpu_begin();
> + hchacha_block_ssse3(state, stream, nrounds);
> + kernel_fpu_end();
> + }
> +}
> +EXPORT_SYMBOL(hchacha_block);

Please correct me if I'm wrong:

The approach here is slightly different from Zinc. In Zinc, I had one
entry point that conditionally called into the architecture-specific
implementation, and I did it inline using #includes so that in some
cases it could be optimized out.

Here, you override the original symbol defined by the generic module
from the architecture-specific implementation, and in there you decide
which way to branch.

Your approach has the advantage that you don't need to #include a .c
file like I did, an ugly yet very effective approach.

But it has two disadvantages:

1. For architecture-specific code that _always_ runs, such as the
  MIPS32r2 implementation of chacha, the compiler no longer has an
  opportunity to remove the generic code entirely from the binary,
  which under Zinc resulted in a smaller module.

2. The inliner can't make optimizations for that call.

Disadvantage (2) might not make much of a difference. Disadvantage (1)
seems like a bigger deal. However, perhaps the linker is smart and can
remove the code and symbol? Or if not, is there a way to make the linker
smart? Or would all this require crazy LTO which isn't going to happen
any time soon?


Re: [PATCH v2 01/20] crypto: chacha - move existing library code into lib/crypto

2019-10-04 Thread Jason A. Donenfeld
On Wed, Oct 02, 2019 at 04:16:54PM +0200, Ard Biesheuvel wrote:
>  
>   chacha_permute(x, nrounds);

Interested in porting my single-statement unrolled implementation from
Zinc to this? I did see performance improvements on various platforms.


Re: [PATCH v2 00/20] crypto: crypto API library interfaces for WireGuard

2019-10-04 Thread Jason A. Donenfeld
Hi Ard,

On Wed, Oct 02, 2019 at 04:16:53PM +0200, Ard Biesheuvel wrote:
> This is a followup to RFC 'crypto: wireguard with crypto API library 
> interface'
> [0]. Since no objections were raised to my approach, I've proceeded to fix up
> some minor issues, and incorporate [most of] the missing MIPS code.
> 
> Changes since RFC/v1:
> - dropped the WireGuard patch itself, and the followup patches - since the
>   purpose was to illustrate the extent of the required changes, there is no
>   reason to keep including them.
> - import the MIPS 32r2 versions of ChaCha and Poly1305, but expose both the
>   crypto API and library interfaces so that not only WireGuard but also IPsec
>   and Adiantum can benefit immediately. (The latter required adding support 
> for
>   the reduced round version of ChaCha to the MIPS asm code)
> - fix up various minor kconfig/build issues found in randconfig testing
>   (thanks Arnd!)

Thanks for working on this. By wiring up the accelerated primitives,
you're essentially implementing Zinc, and I expect that by the end of
this, we'll wind up with something pretty close to where I started, with
the critical difference that the directory names are different. Despite
my initial email about WireGuard moving to the crypto API, it sounds
like in the end it is likely to stay with Zinc, just without that name.

Jason


Re: [RFC PATCH 11/20] crypto: BLAKE2s - x86_64 implementation

2019-09-29 Thread Jason A. Donenfeld
Hi Sebastian, Thomas,

Take a look at the below snippet from this patch.

I had previously put quite some effort into the simd_get, simd_put,
simd_relax mechanism, so that the simd state could be persisted during
both several calls to the same function and within long loops like
below, with simd_relax existing to reenable preemption briefly if
things were getting out of hand. Ard got rid of this and has moved the
kernel_fpu_begin and kernel_fpu_end calls into the inner loop:

On Sun, Sep 29, 2019 at 7:39 PM Ard Biesheuvel
 wrote:
> +   for (;;) {
> +   const size_t blocks = min_t(size_t, nblocks,
> +   PAGE_SIZE / BLAKE2S_BLOCK_SIZE);
> +
> +   kernel_fpu_begin();
> +   if (IS_ENABLED(CONFIG_AS_AVX512) && blake2s_use_avx512)
> +   blake2s_compress_avx512(state, block, blocks, inc);
> +   else
> +   blake2s_compress_avx(state, block, blocks, inc);
> +   kernel_fpu_end();
> +
> +   nblocks -= blocks;
> +   if (!nblocks)
> +   break;
> +   block += blocks * BLAKE2S_BLOCK_SIZE;
> +   }
> +   return true;
> +}

I'm wondering if on modern kernels this is actually fine and whether
my simd_get/put/relax thing no longer has a good use case.
Specifically, I recall last year there were a lot of patches and
discussions about doing FPU register restoration lazily -- on context
switch or the like. Did those land? Did the theory of action work out
in the end?

Regards,
Jason


Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API

2019-09-27 Thread Jason A. Donenfeld
Hey Andy,

Thanks for weighing in.

> inlining.  I'd be surprised for chacha20.  If you really want inlining
> to dictate the overall design, I think you need some real numbers for
> why it's necessary.  There also needs to be a clear story for how
> exactly making everything inline plays with the actual decision of
> which implementation to use.

Take a look at my description for the MIPS case: when on MIPS, the
arch code is *always* used since it's just straight up scalar
assembly. In this case, the chacha20_arch function *never* returns
false [1], which means it's always included [2], so the generic
implementation gets optimized out, saving disk and memory, which I
assume MIPS people care about.

[1] 
https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-mips-glue.c?h=jd/wireguard#n13
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n118

I'm fine with considering this a form of "premature optimization",
though, and ditching the motivation there.

On Thu, Sep 26, 2019 at 11:37 PM Andy Lutomirski  wrote:
> My suggestion from way back, which is at
> least a good deal of the way toward being doable, is to do static
> calls.  This means that the common code will call out to the arch code
> via a regular CALL instruction and will *not* inline the arch code.
> This means that the arch code could live in its own module, it can be
> selected at boot time, etc.

Alright, let's do static calls, then, to deal with the case of going
from the entry point implementation in lib/zinc (or lib/crypto, if you
want, Ard) to the arch-specific implementation in arch/${ARCH}/crypto.
And then within each arch, we can keep it simple, since everything is
already in the same directory.

Sound good?

Jason


Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API

2019-09-26 Thread Jason A. Donenfeld
e MIPS
case, the advantage is clear. Here's how Zinc handles it: [3]. Some
simple ifdefs and includes. Easy and straightforward. Some people
might object, though, and it sounds like you might. So let's talk
about some alternative mechanisms with their pros and cons:

- The zinc method: straightforward, but not everybody likes ifdefs.
- Stick the filename to be included into a Kconfig variable and let
the configuration system do the logic: also straightforward. Not sure
it's kosher, but it would work.
- Weak symbols: we don't get inlining or the dead code elimination.
- Function pointers: ditto, plus spectre.
- Other ideas you might have? I'm open to suggestions here.

[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54
[3] 
https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20.c?h=jd/wireguard#n19

> What I *don't* want is to merge WireGuard with its own library based
> crypto now, and extend that later for async accelerators once people
> realize that we really do need that as well.

I wouldn't worry so much about that. Zinc/library-based-crypto is just
trying to fulfill the boring synchronous pure-code part of crypto
implementations. For the async stuff, we can work together on
improving the existing crypto API to be more appealing, in tandem with
some interesting research into packet queuing systems. From the other
thread, you might have seen that already Toke has cool ideas that I
hope we can all sit down and talk about. I'm certainly not interested
in "bolting" anything on to Zinc/library-based-crypto, and I'm happy
to work to improve the asynchronous API for doing asynchronous crypto.

Jason


chapoly acceleration hardware [Was: Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API]

2019-09-26 Thread Jason A. Donenfeld
[CC +willy, toke, dave, netdev]

Hi Pascal

On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
 wrote:
> Actually, that assumption is factually wrong. I don't know if anything
> is *publicly* available, but I can assure you the silicon is running in
> labs already. And something will be publicly available early next year
> at the latest. Which could nicely coincide with having Wireguard support
> in the kernel (which I would also like to see happen BTW) ...
>
> Not "at some point". It will. Very soon. Maybe not in consumer or server
> CPUs, but definitely in the embedded (networking) space.
> And it *will* be much faster than the embedded CPU next to it, so it will
> be worth using it for something like bulk packet encryption.

Super! I was wondering if you could speak a bit more about the
interface. My biggest questions surround latency. Will it be
synchronous or asynchronous? If the latter, why? What will its
latencies be? How deep will its buffers be? The reason I ask is that a
lot of crypto acceleration hardware of the past has been fast and
having very deep buffers, but at great expense of latency. In the
networking context, keeping latency low is pretty important. Already
WireGuard is multi-threaded which isn't super great all the time for
latency (improvements are a work in progress). If you're involved with
the design of the hardware, perhaps this is something you can help
ensure winds up working well? For example, AES-NI is straightforward
and good, but Intel can do that because they are the CPU. It sounds
like your silicon will be adjacent. How do you envision this working
in a low latency environment?

Jason


Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API

2019-09-26 Thread Jason A. Donenfeld
On Thu, Sep 26, 2019 at 12:19 PM Pascal Van Leeuwen
 wrote:
> Actually, that assumption is factually wrong. I don't know if anything
> is *publicly* available, but I can assure you the silicon is running in
> labs already.

Great to hear, and thanks for the information. I'll follow up with
some questions on this in another thread.


Re: [RFC PATCH 00/18] crypto: wireguard using the existing crypto API

2019-09-26 Thread Jason A. Donenfeld
preferences, and we can see what that looks like. Less appealing would
be to do several iterations of you reworking Zinc from scratch and
going through the exercises all over again, but if you prefer that I
guess I could cope. Alternatively, maybe this is a lot to chew on, and
we should just throw caution into the wind, implement cryptoapi.c for
WireGuard (as described at the top), and add C functions to the crypto
API sometime later? This is what I had envisioned in [1].

And for the avoidance of doubt, or in case any of the above message
belied something different, I really am happy and relieved to have an
opportunity to work on this _with you_, and I am much more open than
before to compromise and finding practical solutions to the past
political issues. Also, if you’re into chat, we can always spec some
of the nitty-gritty aspects out over IRC or even the old-fashioned
telephone. Thanks again for pushing this forward.

Regards,
Jason

[1] 
https://lore.kernel.org/wireguard/cahmme9pmfzap5zd9bdlfc2fwuhtzzcjyzc2attptynffmed...@mail.gmail.com/
[2] 
https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/tree/lib/zinc/chacha20/chacha20-x86_64-glue.c?h=jd/wireguard#n54


Re: [PATCH v2] net: ipv4: move tcp_fastopen server side code to SipHash library

2019-06-14 Thread Jason Baron



On 6/14/19 11:34 AM, Eric Dumazet wrote:
> On Fri, Jun 14, 2019 at 7:01 AM Ard Biesheuvel
>  wrote:
>>
>> Using a bare block cipher in non-crypto code is almost always a bad idea,
>> not only for security reasons (and we've seen some examples of this in
>> the kernel in the past), but also for performance reasons.
>>
>> In the TCP fastopen case, we call into the bare AES block cipher one or
>> two times (depending on whether the connection is IPv4 or IPv6). On most
>> systems, this results in a call chain such as
>>
>>   crypto_cipher_encrypt_one(ctx, dst, src)
>> crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...);
>>   aesni_encrypt
>> kernel_fpu_begin();
>> aesni_enc(ctx, dst, src); // asm routine
>> kernel_fpu_end();
>>
>> It is highly unlikely that the use of special AES instructions has a
>> benefit in this case, especially since we are doing the above twice
>> for IPv6 connections, instead of using a transform which can process
>> the entire input in one go.
>>
>> We could switch to the cbcmac(aes) shash, which would at least get
>> rid of the duplicated overhead in *some* cases (i.e., today, only
>> arm64 has an accelerated implementation of cbcmac(aes), while x86 will
>> end up using the generic cbcmac template wrapping the AES-NI cipher,
>> which basically ends up doing exactly the above). However, in the given
>> context, it makes more sense to use a light-weight MAC algorithm that
>> is more suitable for the purpose at hand, such as SipHash.
>>
>> Since the output size of SipHash already matches our chosen value for
>> TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input
>> sizes, this greatly simplifies the code as well.
>>
>> Signed-off-by: Ard Biesheuvel 
> 
> While the patch looks fine (I yet have to run our tests with it), it
> might cause some deployment issues
> for server farms.
> 
> They usually share a common fastopen key, so that clients can reuse
> the same token for different sessions.
> 
> Changing some servers in the pool will lead to inconsistencies.

The inconsistencies coming from kernel version skew with some servers
being on the old hash and some on the newer one? Or is there another
source for the inconsistency you are referring to?

Thanks,

-Jason

> 
> Probably not a too big deal, but worth mentioning.
> 


Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce

2018-11-23 Thread Jason Gunthorpe
On Fri, Nov 23, 2018 at 04:02:42PM +0800, Kenneth Lee wrote:

> It is already part of Jean's patchset. And that's why I built my solution on
> VFIO in the first place. But I think the concept of SVA and PASID is not
> compatible with the original VFIO concept space. You would not share your 
> whole
> address space to a device at all in a virtual machine manager,
> wouldn't you?

Why not? That seems to fit VFIO's space just fine to me.. You might
need a new upcall to create a full MM registration, but that doesn't
seem unsuited.

Part of the point here is you should try to make sensible revisions to
existing subsystems before just inventing a new thing...

VFIO is deeply connected to the IOMMU, so enabling more general IOMMU
based approache seems perfectly fine to me..

> > Once the VFIO driver knows about this as a generic capability then the
> > device it exposes to userspace would use CPU addresses instead of DMA
> > addresses.
> > 
> > The question is if your driver needs much more than the device
> > agnostic generic services VFIO provides.
> > 
> > I'm not sure what you have in mind with resource management.. It is
> > hard to revoke resources from userspace, unless you are doing
> > kernel syscalls, but then why do all this?
> 
> Say, I have 1024 queues in my accelerator. I can get one by opening the device
> and attach it with the fd. If the process exit by any means, the queue can be
> returned with the release of the fd. But if it is mdev, it will still be there
> and some one should tell the allocator it is available again. This is not easy
> to design in user space.

?? why wouldn't the mdev track the queues assigned using the existing
open/close/ioctl callbacks?

That is basic flow I would expect:

 open(/dev/vfio)
 ioctl(unity map entire process MM to mdev with IOMMU)

 // Create a HQ queue and link the PASID in the HW to this HW queue
 struct hw queue[..];
 ioctl(create HW queue)

 // Get BAR doorbell memory for the queue
 bar = mmap()

 // Submit work to the queue using CPU addresses
 queue[0] = ...
 writel(bar [..], &queue);

 // Queue, SVA, etc is cleaned up when the VFIO closes
 close()

Presumably the kernel has to handle the PASID and related for security
reasons, so they shouldn't go to userspace?

If there is something missing in vfio to do this is it looks pretty
small to me..

Jason


Re: [PATCH net-next] cxgb4: use new fw interface to get the VIN and smt index

2018-11-22 Thread Jason Gunthorpe
On Thu, Nov 22, 2018 at 10:53:49AM -0800, David Miller wrote:
> From: Jason Gunthorpe 
> Date: Wed, 21 Nov 2018 19:46:24 -0700
> 
> > On Wed, Nov 21, 2018 at 01:40:24PM +0530, Ganesh Goudar wrote:
> >> From: Santosh Rastapur 
> >> 
> >> If the fw supports returning VIN/VIVLD in FW_VI_CMD save it
> >> in port_info structure else retrieve these from viid and save
> >> them  in port_info structure. Do the same for smt_idx from
> >> FW_VI_MAC_CMD
> >> 
> >> Signed-off-by: Santosh Rastapur 
> >> Signed-off-by: Ganesh Goudar 
> >>  drivers/crypto/chelsio/chtls/chtls_cm.c |  3 +-
> >>  drivers/infiniband/hw/cxgb4/cm.c|  6 +--
> >>  drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  | 12 -
> >>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 58 
> >> -
> >>  drivers/net/ethernet/chelsio/cxgb4/l2t.c| 13 +++---
> >>  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  | 46 ++--
> >>  drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   | 20 +
> >>  drivers/scsi/cxgbi/cxgb4i/cxgb4i.c  |  3 +-
> >>  drivers/target/iscsi/cxgbit/cxgbit_cm.c |  8 ++--
> >>  9 files changed, 114 insertions(+), 55 deletions(-)
> > 
> > Applied to for-next, but please try to write better commit messages in
> > future, explain what benifit your change is bringing.
> 
> The subject line indicates this is targetting my net-next tree, therefore
> why did you apply it to your's?

It is my mistake, it ended up in RDMA patchworks next to other RDMA
chelsio patches, and contained an IB component..

It is dropped from rdma.git now.

Thanks,
Jason


Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce

2018-11-21 Thread Jason Gunthorpe
On Wed, Nov 21, 2018 at 02:08:05PM +0800, Kenneth Lee wrote:

> > But considering Jean's SVA stuff seems based on mmu notifiers, I have
> > a hard time believing that it has any different behavior from RDMA's
> > ODP, and if it does have different behavior, then it is probably just
> > a bug in the ODP implementation.
> 
> As Jean has explained, his solution is based on page table sharing. I think 
> ODP
> should also consider this new feature.

Shared page tables would require the HW to walk the page table format
of the CPU directly, not sure how that would be possible for ODP?

Presumably the implementation for ARM relies on the IOMMU hardware
doing this?

> > > > If all your driver needs is to mmap some PCI bar space, route
> > > > interrupts and do DMA mapping then mediated VFIO is probably a good
> > > > choice. 
> > > 
> > > Yes. That is what is done in our RFCv1/v2. But we accepted Jerome's 
> > > opinion and
> > > try not to add complexity to the mm subsystem.
> > 
> > Why would a mediated VFIO driver touch the mm subsystem? Sounds like
> > you don't have a VFIO driver if it needs to do stuff like that...
> 
> VFIO has no ODP-like solution, and if we want to solve the fork problem, we 
> have
> to make some change to iommu and the fork procedure. Further, VFIO takes every
> queue as a independent device. This create a lot of trouble on resource
> management. For example, you will need a manager process to withdraw the 
> unused
> device and you need to let the user process know about PASID of the queue, and
> so on.

Well, I would think you'd add SVA support to the VFIO driver as a
generic capability - it seems pretty useful for any VFIO user as it
avoids all the kernel upcalls to do memory pinning and DMA address
translation.

Once the VFIO driver knows about this as a generic capability then the
device it exposes to userspace would use CPU addresses instead of DMA
addresses.

The question is if your driver needs much more than the device
agnostic generic services VFIO provides.

I'm not sure what you have in mind with resource management.. It is
hard to revoke resources from userspace, unless you are doing
kernel syscalls, but then why do all this?

Jason


Re: [PATCH net-next] cxgb4: use new fw interface to get the VIN and smt index

2018-11-21 Thread Jason Gunthorpe
On Wed, Nov 21, 2018 at 01:40:24PM +0530, Ganesh Goudar wrote:
> From: Santosh Rastapur 
> 
> If the fw supports returning VIN/VIVLD in FW_VI_CMD save it
> in port_info structure else retrieve these from viid and save
> them  in port_info structure. Do the same for smt_idx from
> FW_VI_MAC_CMD
> 
> Signed-off-by: Santosh Rastapur 
> Signed-off-by: Ganesh Goudar 
>  drivers/crypto/chelsio/chtls/chtls_cm.c |  3 +-
>  drivers/infiniband/hw/cxgb4/cm.c|  6 +--
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4.h  | 12 -
>  drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 58 
> -
>  drivers/net/ethernet/chelsio/cxgb4/l2t.c| 13 +++---
>  drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  | 46 ++--
>  drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   | 20 +
>  drivers/scsi/cxgbi/cxgb4i/cxgb4i.c  |  3 +-
>  drivers/target/iscsi/cxgbit/cxgbit_cm.c |  8 ++--
>  9 files changed, 114 insertions(+), 55 deletions(-)

Applied to for-next, but please try to write better commit messages in
future, explain what benifit your change is bringing.

Jason


Re: [PATCH 0/6] crypto: x86/chacha20 - SIMD performance improvements

2018-11-20 Thread Jason A. Donenfeld
Hi Martin,

On Tue, Nov 20, 2018 at 5:29 PM Martin Willi  wrote:
> Thanks for the offer, no need at this time. But I certainly would
> welcome if you could do some (Wireguard) benching with that code to see
> if it works for you.

I certainly will test it in a few different network circumstances,
especially since real testing like this is sometimes more telling than
busy-loop benchmarks.

> > Actually, similarly here, a 10nm Cannon Lake machine should be
> > arriving at my house this week, which should make for some
> > interesting testing ground for non-throttled zmm, if you'd like to
> > play with it.
>
> Maybe in a future iteration, thanks. In fact would it be interesting to
> know if Cannon Lake can handle that throttling better.

Everything I've read on the Internet seems to indicate that's the
case, so one of the first things I'll be doing is seeing if that's
true. There are also the AVX512 IFMA instructions to play with!

Jason


Re: [RFC PATCH v2 0/4] Exporting existing crypto API code through zinc

2018-11-20 Thread Jason A. Donenfeld
On Tue, Nov 20, 2018 at 3:19 PM Herbert Xu  wrote:
> Yes.  In fact it's used for FIPS certification testing.
> Sure, nobody sane should be doing it.  But when it comes to
> government certification... :)

The kernel does not aim toward any FIPS certification, and we're not
going to start bloating our designs to fulfill this. It's never been a
goal. Maybe ask Ted to add a FIPS mode to random.c and see what
happens... When you start arguing "because FIPS!" as your
justification, you really hit a head scratcher.

> They've already paid for the indirect
> function call so why make them go through yet another run-time
> branch?

The indirect function call in the crypto API is the performance hit.
The branch in Zinc is not, as the predictor does the correct thing
every single time. I'm not able to distinguish between the two looking
at the performance measurements between it being there and the branch
being commented out.

Give me a few more days to finish v9's latest required changes for
chacha12, and then I'll submit a revision that I think should address
the remaining technical objections raised over the last several months
we've been discussing this.

Regards,
Jason


Re: [RFC PATCH v2 0/4] Exporting existing crypto API code through zinc

2018-11-20 Thread Jason A. Donenfeld
Hi Herbert,

On Tue, Nov 20, 2018 at 7:02 AM Herbert Xu  wrote:
> Here is an updated version demonstrating how we could access the
> accelerated versions of chacha20.  It also includes a final patch
> to deposit the new zinc version of x86-64 chacha20 into
> arch/x86/crypto where it can be used by both the crypto API as well
> as zinc.

N'ack. As I mentioned in the last email, this really isn't okay, and
mostly defeats the purpose of Zinc in the first place, adding
complexity in a discombobulated half-baked way. Your proposal seems to
be the worst of all worlds. Having talked to lots of people about the
design of Zinc at this point to very positive reception, I'm going to
move forward with submitting v9, once I rebase with the required
changes for Eric's latest merge.

Jason


Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce

2018-11-19 Thread Jason Gunthorpe
On Tue, Nov 20, 2018 at 11:07:02AM +0800, Kenneth Lee wrote:
> On Mon, Nov 19, 2018 at 11:49:54AM -0700, Jason Gunthorpe wrote:
> > Date: Mon, 19 Nov 2018 11:49:54 -0700
> > From: Jason Gunthorpe 
> > To: Kenneth Lee 
> > CC: Leon Romanovsky , Kenneth Lee ,
> >  Tim Sell , linux-...@vger.kernel.org, Alexander
> >  Shishkin , Zaibo Xu
> >  , zhangfei@foxmail.com, linux...@huawei.com,
> >  haojian.zhu...@linaro.org, Christoph Lameter , Hao Fang
> >  , Gavin Schenk , RDMA mailing
> >  list , Zhou Wang ,
> >  Doug Ledford , Uwe Kleine-König
> >  , David Kershner
> >  , Johan Hovold , Cyrille
> >  Pitchen , Sagar Dharia
> >  , Jens Axboe ,
> >  guodong...@linaro.org, linux-netdev , Randy Dunlap
> >  , linux-ker...@vger.kernel.org, Vinod Koul
> >  , linux-crypto@vger.kernel.org, Philippe Ombredanne
> >  , Sanyog Kale , "David S.
> >  Miller" , linux-accelerat...@lists.ozlabs.org
> > Subject: Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce
> > User-Agent: Mutt/1.9.4 (2018-02-28)
> > Message-ID: <20181119184954.gb4...@ziepe.ca>
> > 
> > On Mon, Nov 19, 2018 at 05:14:05PM +0800, Kenneth Lee wrote:
> >  
> > > If the hardware cannot share page table with the CPU, we then need to have
> > > some way to change the device page table. This is what happen in ODP. It
> > > invalidates the page table in device upon mmu_notifier call back. But 
> > > this cannot
> > > solve the COW problem: if the user process A share a page P with device, 
> > > and A 
> > > forks a new process B, and it continue to write to the page. By COW, the
> > > process B will keep the page P, while A will get a new page P'. But you 
> > > have
> > > no way to let the device know it should use P' rather than P.
> > 
> > Is this true? I thought mmu_notifiers covered all these cases.
> > 
> > The mm_notifier for A should fire if B causes the physical address of
> > A's pages to change via COW. 
> > 
> > And this causes the device page tables to re-synchronize.
> 
> I don't see such code. The current do_cow_fault() implemenation has nothing to
> do with mm_notifer.

Well, that sure sounds like it would be a bug in mmu_notifiers..

But considering Jean's SVA stuff seems based on mmu notifiers, I have
a hard time believing that it has any different behavior from RDMA's
ODP, and if it does have different behavior, then it is probably just
a bug in the ODP implementation.

> > > In WarpDrive/uacce, we make this simple. If you support IOMMU and it 
> > > support
> > > SVM/SVA. Everything will be fine just like ODP implicit mode. And you 
> > > don't need
> > > to write any code for that. Because it has been done by IOMMU framework. 
> > > If it
> > 
> > Looks like the IOMMU code uses mmu_notifier, so it is identical to
> > IB's ODP. The only difference is that IB tends to have the IOMMU page
> > table in the device, not in the CPU.
> > 
> > The only case I know if that is different is the new-fangled CAPI
> > stuff where the IOMMU can directly use the CPU's page table and the
> > IOMMU page table (in device or CPU) is eliminated.
>
> Yes. We are not focusing on the current implementation. As mentioned in the
> cover letter. We are expecting Jean Philips' SVA patch:
> git://linux-arm.org/linux-jpb.

This SVA stuff does not look comparable to CAPI as it still requires
maintaining seperate IOMMU page tables.

Also, those patches from Jean have a lot of references to
mmu_notifiers (ie look at iommu_mmu_notifier).

Are you really sure it is actually any different at all?

> > Anyhow, I don't think a single instance of hardware should justify an
> > entire new subsystem. Subsystems are hard to make and without multiple
> > hardware examples there is no way to expect that it would cover any
> > future use cases.
> 
> Yes. That's our first expectation. We can keep it with our driver. But because
> there is no user driver support for any accelerator in mainline kernel. Even 
> the
> well known QuickAssit has to be maintained out of tree. So we try to see if
> people is interested in working together to solve the problem.

Well, you should come with patches ack'ed by these other groups.

> > If all your driver needs is to mmap some PCI bar space, route
> > interrupts and do DMA mapping then mediated VFIO is probably a good
> > choice. 
> 
> Yes. That is what is done in our RFCv1/v2. But we accepted Jerome's opinion 
> and
> try not to add complexity to the mm subsystem.

Why would 

Re: [RFC PATCH] zinc chacha20 generic implementation using crypto API code

2018-11-19 Thread Jason A. Donenfeld
Hi Herbert,

On Tue, Nov 20, 2018 at 4:06 AM Herbert Xu  wrote:
> > I'd still prefer to see the conversion patches included.  Skipping them 
> > would be
> > kicking the can down the road and avoiding issues that will need to be 
> > addressed
> > anyway.  Like you, I don't want a "half-baked concoction that will be maybe
> > possibly be replaced 'later'" :-)
>
> Are you guys talking about the conversion patches to eliminate the
> two copies of chacha code that would otherwise exist in crypto as
> well as in zinc?
>
> If so I think that's not negotiable.  Having two copies of the same
> code is just not acceptable.

Yes, that's the topic. Eric already expressed his preference to keep
them, and I agreed, so the plan is to keep them.

Jason


Re: [RFC PATCH] zinc chacha20 generic implementation using crypto API code

2018-11-19 Thread Jason A. Donenfeld
Hi Eric,

On Tue, Nov 20, 2018 at 12:23 AM Eric Biggers  wrote:
> It's much better to have the documentation in a permanent location.

Agreed.

> I actually did add ChaCha12 support to most of the Zinc assembly in
> "[WIP] crypto: assembly support for ChaCha12"
> (https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/commit/?h=adiantum-zinc&id=0a7787a515a977e11b680f1752b430ca1744e399).
> But I skipped AVX-512 and MIPS since I didn't have a way to test those yet,
> and I haven't ported the changes to your new perl scripts yet.

Oh, great. If you're playing with this in the context of the WireGuard
repo, you can run `ARCH=mips make test-qemu -j$(nproc)` (which is
what's running on https://www.wireguard.com/build-status/ ). For
AVX-512 testing, I've got a box I'm happy to give you access to, if
you want to send an SSH key and your employer allows for that kind of
thing etc. I can have a stab at all of this too if you like.

Jason


Re: [RFC PATCH] zinc chacha20 generic implementation using crypto API code

2018-11-19 Thread Jason A. Donenfeld
Hi Eric,

On Mon, Nov 19, 2018 at 11:54 PM Eric Biggers  wrote:
> Will v9 include a documentation file for Zinc in Documentation/crypto/?
> That's been suggested several times.

I had started writing that there, but then thought that the requested
information could go in the commit message instead. But I'm guessing
you're asking again now because you poked into the repo and didn't
find the Documentation/, so presumably you still want it. I can
reorganize the presentation of that to be more suitable for
Documentation/, and I'll have that for v9.

> I'd still prefer to see the conversion patches included.  Skipping them would 
> be
> kicking the can down the road and avoiding issues that will need to be 
> addressed
> anyway.  Like you, I don't want a "half-baked concoction that will be maybe
> possibly be replaced 'later'" :-)

Okay, fair enough. Will do.

> Either way though, it would make things much easier if you at least named the
> files, structures, constants, etc. "ChaCha" rather than "ChaCha20" from the
> start where appropriate.  For an example, see the commit "crypto: chacha -
> prepare for supporting non-20-round variants" on my "adiantum-zinc" branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/linux.git/commit/?h=adiantum-zinc&id=754af8d7d39f31238114426e39786c84d7cc0f98
> Then the actual introduction of the 12-round variant is much less noisy.

That's a good idea. I'll do it like that. I'll likely order it as what
we have now (renamed to omit the 20), and then put the 12 stuff on top
of that, so it's easier to see what's changed in the process. I
noticed in that branch, you didn't port the assembly to support fewer
rounds. Shall I follow suite, and then expect patches from you later
doing that? Or were you expecting me to also port the architecture
implementations to chacha12 as well?

Regards,
Jason


Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce

2018-11-19 Thread Jason Gunthorpe
On Mon, Nov 19, 2018 at 04:33:20PM -0500, Jerome Glisse wrote:
> On Mon, Nov 19, 2018 at 02:26:38PM -0700, Jason Gunthorpe wrote:
> > On Mon, Nov 19, 2018 at 03:26:15PM -0500, Jerome Glisse wrote:
> > > On Mon, Nov 19, 2018 at 01:11:56PM -0700, Jason Gunthorpe wrote:
> > > > On Mon, Nov 19, 2018 at 02:46:32PM -0500, Jerome Glisse wrote:
> > > > 
> > > > > > ?? How can O_DIRECT be fine but RDMA not? They use exactly the same
> > > > > > get_user_pages flow, right? Can we do what O_DIRECT does in RDMA and
> > > > > > be fine too?
> > > > > > 
> > > > > > AFAIK the only difference is the length of the race window. You'd 
> > > > > > have
> > > > > > to fork and fault during the shorter time O_DIRECT has 
> > > > > > get_user_pages
> > > > > > open.
> > > > > 
> > > > > Well in O_DIRECT case there is only one page table, the CPU
> > > > > page table and it gets updated during fork() so there is an
> > > > > ordering there and the race window is small.
> > > > 
> > > > Not really, in O_DIRECT case there is another 'page table', we just
> > > > call it a DMA scatter/gather list and it is sent directly to the block
> > > > device's DMA HW. The sgl plays exactly the same role as the various HW
> > > > page list data structures that underly RDMA MRs.
> > > > 
> > > > It is not a page table that matters here, it is if the DMA address of
> > > > the page is active for DMA on HW.
> > > > 
> > > > Like you say, the only difference is that the race is hopefully small
> > > > with O_DIRECT (though that is not really small, NVMeof for instance
> > > > has windows as large as connection timeouts, if you try hard enough)
> > > > 
> > > > So we probably can trigger this trouble with O_DIRECT and fork(), and
> > > > I would call it a bug :(
> > > 
> > > I can not think of any scenario that would be a bug with O_DIRECT.
> > > Do you have one in mind ? When you fork() and do other syscall that
> > > affect the memory of your process in another thread you should
> > > expect non consistant results. Kernel is not here to provide a fully
> > > safe environement to user, user can shoot itself in the foot and
> > > that's fine as long as it only affect the process itself and no one
> > > else. We should not be in the business of making everything baby
> > > proof :)
> > 
> > Sure, I setup AIO with O_DIRECT and launch a read.
> > 
> > Then I fork and dirty the READ target memory using the CPU in the
> > child.
> > 
> > As you described in this case the fork will retain the physical page
> > that is undergoing O_DIRECT DMA, and the parent gets a new copy'd page.
> > 
> > The DMA completes, and the child gets the DMA'd to page. The parent
> > gets an unchanged copy'd page.
> > 
> > The parent gets the AIO completion, but can't see the data.
> > 
> > I'd call that a bug with O_DIRECT. The only correct outcome is that
> > the parent will always see the O_DIRECT data. Fork should not cause
> > the *parent* to malfunction. I agree the child cannot make any
> > prediction what memory it will see.
> > 
> > I assume the same flow is possible using threads and read()..
> > 
> > It is really no different than the RDMA bug with fork.
> > 
> 
> Yes and that's expected behavior :) If you fork() and have anything
> still in flight at time of fork that can change your process address
> space (including data in it) then all bets are of.
> 
> At least this is my reading of fork() syscall.

Not mine.. I can't think of anything else that would have this
behavior.

All traditional syscalls, will properly dirty the pages of the
parent. ie if I call read() in a thread and do fork in another thread,
then not seeing the data after read() completes is clearly a bug. All
other syscalls are the same.

It is bonkers that opening the file with O_DIRECT would change this
basic behavior. I'm calling it a bug :)

Jason


Re: [RFCv3 PATCH 1/6] uacce: Add documents for WarpDrive/uacce

2018-11-19 Thread Jason Gunthorpe
On Mon, Nov 19, 2018 at 03:26:15PM -0500, Jerome Glisse wrote:
> On Mon, Nov 19, 2018 at 01:11:56PM -0700, Jason Gunthorpe wrote:
> > On Mon, Nov 19, 2018 at 02:46:32PM -0500, Jerome Glisse wrote:
> > 
> > > > ?? How can O_DIRECT be fine but RDMA not? They use exactly the same
> > > > get_user_pages flow, right? Can we do what O_DIRECT does in RDMA and
> > > > be fine too?
> > > > 
> > > > AFAIK the only difference is the length of the race window. You'd have
> > > > to fork and fault during the shorter time O_DIRECT has get_user_pages
> > > > open.
> > > 
> > > Well in O_DIRECT case there is only one page table, the CPU
> > > page table and it gets updated during fork() so there is an
> > > ordering there and the race window is small.
> > 
> > Not really, in O_DIRECT case there is another 'page table', we just
> > call it a DMA scatter/gather list and it is sent directly to the block
> > device's DMA HW. The sgl plays exactly the same role as the various HW
> > page list data structures that underly RDMA MRs.
> > 
> > It is not a page table that matters here, it is if the DMA address of
> > the page is active for DMA on HW.
> > 
> > Like you say, the only difference is that the race is hopefully small
> > with O_DIRECT (though that is not really small, NVMeof for instance
> > has windows as large as connection timeouts, if you try hard enough)
> > 
> > So we probably can trigger this trouble with O_DIRECT and fork(), and
> > I would call it a bug :(
> 
> I can not think of any scenario that would be a bug with O_DIRECT.
> Do you have one in mind ? When you fork() and do other syscall that
> affect the memory of your process in another thread you should
> expect non consistant results. Kernel is not here to provide a fully
> safe environement to user, user can shoot itself in the foot and
> that's fine as long as it only affect the process itself and no one
> else. We should not be in the business of making everything baby
> proof :)

Sure, I setup AIO with O_DIRECT and launch a read.

Then I fork and dirty the READ target memory using the CPU in the
child.

As you described in this case the fork will retain the physical page
that is undergoing O_DIRECT DMA, and the parent gets a new copy'd page.

The DMA completes, and the child gets the DMA'd to page. The parent
gets an unchanged copy'd page.

The parent gets the AIO completion, but can't see the data.

I'd call that a bug with O_DIRECT. The only correct outcome is that
the parent will always see the O_DIRECT data. Fork should not cause
the *parent* to malfunction. I agree the child cannot make any
prediction what memory it will see.

I assume the same flow is possible using threads and read()..

It is really no different than the RDMA bug with fork.

Jason


  1   2   3   4   5   6   7   8   9   >