Re: [PATCH v2 03/12] soc: fsl: dpio: add frame list format support

2018-09-14 Thread Li Yang
On Wed, Sep 12, 2018 at 4:02 AM Horia Geantă  wrote:
>
> Add support for dpaa2_fd_list format, i.e. dpaa2_fl_entry structure
> and accessors.
>
> Frame list entries (FLEs) are similar, but not identical to FDs:
> + "F" (final) bit
> - FMT[b'01] is reserved
> - DD, SC, DROPP bits (covered by "FD compatibility" field in FLE case)
> - FLC[5:0] not used for stashing
>
> Signed-off-by: Horia Geantă 

Acked-by: Li Yang 

> ---
>  include/soc/fsl/dpaa2-fd.h | 242 
> +
>  1 file changed, 242 insertions(+)
>


[PATCH 1/2] Fix static checker warning

2018-09-14 Thread Janakarajan Natarajan
Under certain configuration SEV functions can be defined as no-op.
In such a case error can be uninitialized.

Initialize the variable to 0.

Cc: Dan Carpenter 
Reported-by: Dan Carpenter 
Signed-off-by: Janakarajan Natarajan 
---
 drivers/crypto/ccp/psp-dev.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/crypto/ccp/psp-dev.c b/drivers/crypto/ccp/psp-dev.c
index 72790d8..f541e60 100644
--- a/drivers/crypto/ccp/psp-dev.c
+++ b/drivers/crypto/ccp/psp-dev.c
@@ -423,7 +423,7 @@ EXPORT_SYMBOL_GPL(psp_copy_user_blob);
 static int sev_get_api_version(void)
 {
struct sev_user_data_status *status;
-   int error, ret;
+   int error = 0, ret;
 
status = &psp_master->status_cmd_buf;
ret = sev_platform_status(status, &error);
-- 
2.7.4



[PATCH 2/2] Allow SEV firmware to be chosen based on Family and Model

2018-09-14 Thread Janakarajan Natarajan
During PSP initialization, there is an attempt to update the SEV firmware
by looking in /lib/firmware/amd/. Currently, sev.fw is the expected name
of the firmware blob.

This patch will allow for firmware filenames based on the family and
model of the processor.

Model specific firmware files are given highest priority. Followed by
firmware for a subset of models. Lastly, failing the previous two options,
fallback to looking for sev.fw.

Signed-off-by: Janakarajan Natarajan 
---
 drivers/crypto/ccp/psp-dev.c | 44 
 1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/ccp/psp-dev.c b/drivers/crypto/ccp/psp-dev.c
index f541e60..3b33863 100644
--- a/drivers/crypto/ccp/psp-dev.c
+++ b/drivers/crypto/ccp/psp-dev.c
@@ -31,8 +31,9 @@
((psp_master->api_major) >= _maj && \
 (psp_master->api_minor) >= _min)
 
-#define DEVICE_NAME"sev"
-#define SEV_FW_FILE"amd/sev.fw"
+#define DEVICE_NAME"sev"
+#define SEV_FW_FILE"amd/sev.fw"
+#define SEV_FW_NAME_SIZE   64
 
 static DEFINE_MUTEX(sev_cmd_mutex);
 static struct sev_misc_dev *misc_dev;
@@ -440,6 +441,40 @@ static int sev_get_api_version(void)
return 0;
 }
 
+int sev_get_firmware(struct device *dev, const struct firmware **firmware)
+{
+   char fw_name_specific[SEV_FW_NAME_SIZE];
+   char fw_name_subset[SEV_FW_NAME_SIZE];
+
+   snprintf(fw_name_specific, sizeof(fw_name_specific),
+"amd/amd_sev_fam%.2xh_model%.2xh.sbin",
+boot_cpu_data.x86, boot_cpu_data.x86_model);
+
+   snprintf(fw_name_subset, sizeof(fw_name_subset),
+"amd/amd_sev_fam%.2xh_model%.1xxh.sbin",
+boot_cpu_data.x86, (boot_cpu_data.x86_model & 0xf0) >> 4);
+
+   /* Check for SEV FW for a particular model.
+* Ex. amd_sev_fam17h_model00h.sbin for Family 17h Model 00h
+*
+* or
+*
+* Check for SEV FW common to a subset of models.
+* Ex. amd_sev_fam17h_model0xh.sbin for
+* Family 17h Model 00h -- Family 17h Model 0Fh
+*
+* or
+*
+* Fall-back to using generic name: sev.fw
+*/
+   if ((firmware_request_nowarn(firmware, fw_name_specific, dev) >= 0) ||
+   (firmware_request_nowarn(firmware, fw_name_subset, dev) >= 0) ||
+   (firmware_request_nowarn(firmware, SEV_FW_FILE, dev) >= 0))
+   return 0;
+
+   return -ENOENT;
+}
+
 /* Don't fail if SEV FW couldn't be updated. Continue with existing SEV FW */
 static int sev_update_firmware(struct device *dev)
 {
@@ -449,9 +484,10 @@ static int sev_update_firmware(struct device *dev)
struct page *p;
u64 data_size;
 
-   ret = request_firmware(&firmware, SEV_FW_FILE, dev);
-   if (ret < 0)
+   if (sev_get_firmware(dev, &firmware) == -ENOENT) {
+   dev_dbg(dev, "No SEV firmware file present\n");
return -1;
+   }
 
/*
 * SEV FW expects the physical address given to it to be 32
-- 
2.7.4



[PATCH 0/2] Miscellaneous SEV fixes

2018-09-14 Thread Janakarajan Natarajan
The first patch provides a fix for a static checker warning.

The second patch allows for the SEV firmware blob, which will
be searched for when updating SEV during PSP initialization,
to be named based on the family and model of the processor.

Janakarajan Natarajan (2):
  Fix static checker warning
  Allow SEV firmware to be chosen based on Family and Model

 drivers/crypto/ccp/psp-dev.c | 46 +++-
 1 file changed, 41 insertions(+), 5 deletions(-)

-- 
2.7.4



Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg

2018-09-14 Thread Al Viro
On Fri, Sep 14, 2018 at 01:35:06PM -0700, Darren Hart wrote:
 
> Acked-by: Darren Hart (VMware) 
> 
> As for a longer term solution, would it be possible to init fops in such
> a way that the compat_ioctl call defaults to generic_compat_ioctl_ptrarg
> so we don't have to duplicate this boilerplate for every ioctl fops
> structure?

Bad idea, that...  Because several years down the road somebody will add
an ioctl that takes an unsigned int for argument.  Without so much as looking
at your magical mystery macro being used to initialize file_operations.

FWIW, I would name that helper in more blunt way - something like
compat_ioctl_only_compat_pointer_ioctls_here()...


Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg

2018-09-14 Thread Darren Hart
On Wed, Sep 12, 2018 at 05:08:52PM +0200, Arnd Bergmann wrote:
> The .ioctl and .compat_ioctl file operations have the same prototype so
> they can both point to the same function, which works great almost all
> the time when all the commands are compatible.
> 
> One exception is the s390 architecture, where a compat pointer is only
> 31 bit wide, and converting it into a 64-bit pointer requires calling
> compat_ptr(). Most drivers here will ever run in s390, but since we now
> have a generic helper for it, it's easy enough to use it consistently.
> 
> I double-checked all these drivers to ensure that all ioctl arguments
> are used as pointers or are ignored, but are not interpreted as integer
> values.
> 
> Signed-off-by: Arnd Bergmann 
> ---
...
>  drivers/platform/x86/wmi.c  | 2 +-
...
>  static void link_event_work(struct work_struct *work)
> diff --git a/drivers/platform/x86/wmi.c b/drivers/platform/x86/wmi.c
> index 04791ea5d97b..e4d0697e07d6 100644
> --- a/drivers/platform/x86/wmi.c
> +++ b/drivers/platform/x86/wmi.c
> @@ -886,7 +886,7 @@ static const struct file_operations wmi_fops = {
>   .read   = wmi_char_read,
>   .open   = wmi_char_open,
>   .unlocked_ioctl = wmi_ioctl,
> - .compat_ioctl   = wmi_ioctl,
> + .compat_ioctl   = generic_compat_ioctl_ptrarg,
>  };

For platform/drivers/x86:

Acked-by: Darren Hart (VMware) 

As for a longer term solution, would it be possible to init fops in such
a way that the compat_ioctl call defaults to generic_compat_ioctl_ptrarg
so we don't have to duplicate this boilerplate for every ioctl fops
structure?

-- 
Darren Hart
VMware Open Source Technology Center


Re: [PATCH net-next v4 18/20] crypto: port ChaCha20 to Zinc

2018-09-14 Thread Jason A. Donenfeld
On Fri, Sep 14, 2018 at 7:38 PM Ard Biesheuvel
 wrote:
> so could we please bring that discussion to a close before we drop the ARM 
> code?

My understanding is that either these will find their way up to AndyP
and then back down here, or Eric or you will augment the .S in this
patch at a later date with an improvement commit that includes some
benchmarks.

Jason


Re: [PATCH net-next v4 00/20] WireGuard: Secure Network Tunnel

2018-09-14 Thread Jason A. Donenfeld
On Fri, Sep 14, 2018 at 7:40 PM Ard Biesheuvel
 wrote:
> >   - Move away from makefile ifdef maze and instead prefer kconfig values,
> > which also makes the design a bit more modular too, which could help
> > in the future.
>
> Could you elaborate on this? From the patches, it is not clear to me
> how this has improved.

Feature detection was prior done as a confusing set of ifeq and
ifdefs. Instead, I've now put the logic for this into the kconfig,
which makes the makefiles and header files a bit simpler. This also
makes it easier to later on modularize Zinc itself if deemed
necessary.


Re: [PATCH net-next v4 08/20] zinc: Poly1305 ARM and ARM64 implementations

2018-09-14 Thread Jason A. Donenfeld
Hi Ard,

On Fri, Sep 14, 2018 at 7:27 PM Ard Biesheuvel
 wrote:
> As I asked in response to v3, could we please have this as a separate
> patch on top? The diff below is corrupted.

I had played with that originally, but thought it made things actually
harder to review, whereas here you have the changes presented pretty
straight forwardly, and I'd appreciate your review of them. If you and
Eric both prefer I split this into two commits, with the first one
just plopping down the CRYPTOGAMS code as is and the second one
bringing it up to kernel-snuff, I can do that.

> Also, both Andy and Eric have offered to get involved in upstreaming
> these changes to OpenSSL, so there is no delta to begin with.

Yes, I think this is probably a good long-term plan, which we can act
on sometime after Zinc is merged.

> I still don't like the GCC -includes, especially because these .h
> files contain function and variable definitions so they are not
> actually header files to begin with.

I very very strongly disagree with you here. I think doing it via
-include is significantly cleaner than any of the alternatives, and
allows the code to be cleanly expressed as conditionals that the
optimizer trivially compiles out in the case of stub functions
returning false and branch optimizes when the stub functions return
true. It is extremely important that these compile together as one
compilation unit. Yes, this is a different design than the crypto
API's approach, but I believe the approach presented here poses
significant improvements and is a lot cleaner.

> Also, you mentioned in the commit log that you got rid of defines and
> made the code more modular, but as far as I can tell, libzinc is still
> a single monolithic binary that is essentially always builtin once we
> move random.c to it.

Yes, it's still monolithic, but it's now trivial to split up when the
time comes to do that. If you and AndyL think that it should be split
into multiple modules _now_, then I can go ahead and do that for v5.
But if it's not essential, it seems simpler to keep it as is. I'll
wait for word from you two on this.

Jason


Re: [PATCH net-next v4 00/20] WireGuard: Secure Network Tunnel

2018-09-14 Thread Ard Biesheuvel
On 14 September 2018 at 18:19, Jason A. Donenfeld  wrote:
> Changes v3->v4:
>   - Remove mistaken double 07/17 patch.
>   - Fix whitespace issues in blake2s assembly.
>   - It's not possible to put compound literals into __initconst, so
> we now instead just use boring fixed size struct members.
>   - Move away from makefile ifdef maze and instead prefer kconfig values,
> which also makes the design a bit more modular too, which could help
> in the future.

Could you elaborate on this? From the patches, it is not clear to me
how this has improved.

>   - Port old crypto API implementations (ChaCha20 and Poly1305) to Zinc.
>   - Port security/keys/big_key to Zinc as second example of a good usage of
> Zinc.
>   - Document precisely what is different between the kernel code and
> CRYPTOGAMS code when the CRYPTOGAMS code is used.
>   - Move changelog to top of 00/20 message so that people can
> actually find it.
>
> ---
>
> This patchset is available on git.kernel.org in this branch, where it may be
> pulled directly for inclusion into net-next:
>
>   * 
> https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/log/?h=jd/wireguard
>
> ---
>
> WireGuard is a secure network tunnel written especially for Linux, which
> has faced around three years of serious development, deployment, and
> scrutiny. It delivers excellent performance and is extremely easy to
> use and configure. It has been designed with the primary goal of being
> both easy to audit by virtue of being small and highly secure from a
> cryptography and systems security perspective. WireGuard is used by some
> massive companies pushing enormous amounts of traffic, and likely
> already today you've consumed bytes that at some point transited through
> a WireGuard tunnel. Even as an out-of-tree module, WireGuard has been
> integrated into various userspace tools, Linux distributions, mobile
> phones, and data centers. There are ports in several languages to
> several operating systems, and even commercial hardware and services
> sold integrating WireGuard. It is time, therefore, for WireGuard to be
> properly integrated into Linux.
>
> Ample information, including documentation, installation instructions,
> and project details, is available at:
>
>   * https://www.wireguard.com/
>   * https://www.wireguard.com/papers/wireguard.pdf
>
> As it is currently an out-of-tree module, it lives in its own git repo
> and has its own mailing list, and every commit for the module is tested
> against every stable kernel since 3.10 on a variety of architectures
> using an extensive test suite:
>
>   * https://git.zx2c4.com/WireGuard
> https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/WireGuard.git/
>   * https://lists.zx2c4.com/mailman/listinfo/wireguard
>   * https://www.wireguard.com/build-status/
>
> The project has been broadly discussed at conferences, and was presented
> to the Netdev developers in Seoul last November, where a paper was
> released detailing some interesting aspects of the project. Dave asked
> me after the talk if I would consider sending in a v1 "sooner rather
> than later", hence this patchset. A decision is still waiting from the
> Linux Plumbers Conference, but an update on these topics may be presented
> in Vancouver in a few months. Prior presentations:
>
>   * https://www.wireguard.com/presentations/
>   * https://www.wireguard.com/papers/wireguard-netdev22.pdf
>
> The cryptography in the protocol itself has been formally verified by
> several independent academic teams with positive results, and I know of
> two additional efforts on their way to further corroborate those
> findings. The version 1 protocol is "complete", and so the purpose of
> this review is to assess the implementation of the protocol. However, it
> still may be of interest to know that the thing you're reviewing uses a
> protocol with various nice security properties:
>
>   * https://www.wireguard.com/formal-verification/
>
> This patchset is divided into four segments. The first introduces a very
> simple helper for working with the FPU state for the purposes of amortizing
> SIMD operations. The second segment is a small collection of cryptographic
> primitives, split up into several commits by primitive and by hardware. The
> third shows usage of Zinc within the existing crypto API and as a replacement
> to the existing crypto API. The last is WireGuard itself, presented as an
> unintrusive and self-contained virtual network driver.
>
> It is intended that this entire patch series enter the kernel through
> DaveM's net-next tree. Subsequently, WireGuard patches will go through
> DaveM's net-next tree, while Zinc patches will go through Greg KH's tree.
>
> Enjoy,
> Jason


Re: [PATCH net-next v4 18/20] crypto: port ChaCha20 to Zinc

2018-09-14 Thread Ard Biesheuvel
On 14 September 2018 at 18:22, Jason A. Donenfeld  wrote:
> Now that ChaCha20 is in Zinc, we can have the crypto API code simply
> call into it. The crypto API expects to have a stored key per instance
> and independent nonces, so we follow suite and store the key and
> initialize the nonce independently.
>

>From our exchange re v3:

>> Then there is the performance claim. We know for instance that the
>> OpenSSL ARM NEON code for ChaCha20 is faster on cores that happen to
>> possess a micro-architectural property that ALU instructions are
>> essentially free when they are interleaved with SIMD instructions. But
>> we also know that a) Cortex-A7, which is a relevant target, is not one
>> of those cores, and b) that chip designers are not likely to optimize
>> for that particular usage pattern so relying on it in generic code is
>> unwise in general.
>
> That's interesting. I'll bring this up with AndyP. FWIW, if you think
> you have a real and compelling claim here, I'd be much more likely to
> accept a different ChaCha20 implementation than I would be to accept a
> different Poly1305 implementation. (It's a *lot* harder to screw up
> ChaCha20 than it is to screw up Poly1305.)
>

so could we please bring that discussion to a close before we drop the ARM code?

I am fine with dropping the arm64 code btw.

> Signed-off-by: Jason A. Donenfeld 
> Cc: Samuel Neves 
> Cc: Andy Lutomirski 
> Cc: Greg KH 
> Cc: Jean-Philippe Aumasson 
> Cc: Eric Biggers 
> ---
>  arch/arm/configs/exynos_defconfig   |   1 -
>  arch/arm/configs/multi_v7_defconfig |   1 -
>  arch/arm/configs/omap2plus_defconfig|   1 -
>  arch/arm/crypto/Kconfig |   6 -
>  arch/arm/crypto/Makefile|   2 -
>  arch/arm/crypto/chacha20-neon-core.S| 521 
>  arch/arm/crypto/chacha20-neon-glue.c| 127 -
>  arch/arm64/configs/defconfig|   1 -
>  arch/arm64/crypto/Kconfig   |   6 -
>  arch/arm64/crypto/Makefile  |   3 -
>  arch/arm64/crypto/chacha20-neon-core.S  | 450 -
>  arch/arm64/crypto/chacha20-neon-glue.c  | 133 -
>  arch/x86/crypto/Makefile|   3 -
>  arch/x86/crypto/chacha20-avx2-x86_64.S  | 448 -
>  arch/x86/crypto/chacha20-ssse3-x86_64.S | 630 
>  arch/x86/crypto/chacha20_glue.c | 146 --
>  crypto/Kconfig  |  16 -
>  crypto/Makefile |   2 +-
>  crypto/chacha20_generic.c   | 136 -
>  crypto/chacha20_zinc.c  | 100 
>  crypto/chacha20poly1305.c   |   2 +-
>  include/crypto/chacha20.h   |  12 -
>  22 files changed, 102 insertions(+), 2645 deletions(-)
>  delete mode 100644 arch/arm/crypto/chacha20-neon-core.S
>  delete mode 100644 arch/arm/crypto/chacha20-neon-glue.c
>  delete mode 100644 arch/arm64/crypto/chacha20-neon-core.S
>  delete mode 100644 arch/arm64/crypto/chacha20-neon-glue.c
>  delete mode 100644 arch/x86/crypto/chacha20-avx2-x86_64.S
>  delete mode 100644 arch/x86/crypto/chacha20-ssse3-x86_64.S
>  delete mode 100644 arch/x86/crypto/chacha20_glue.c
>  delete mode 100644 crypto/chacha20_generic.c
>  create mode 100644 crypto/chacha20_zinc.c
>
> diff --git a/arch/arm/configs/exynos_defconfig 
> b/arch/arm/configs/exynos_defconfig
> index 27ea6dfcf2f2..95929b5e7b10 100644
> --- a/arch/arm/configs/exynos_defconfig
> +++ b/arch/arm/configs/exynos_defconfig
> @@ -350,7 +350,6 @@ CONFIG_CRYPTO_SHA1_ARM_NEON=m
>  CONFIG_CRYPTO_SHA256_ARM=m
>  CONFIG_CRYPTO_SHA512_ARM=m
>  CONFIG_CRYPTO_AES_ARM_BS=m
> -CONFIG_CRYPTO_CHACHA20_NEON=m
>  CONFIG_CRC_CCITT=y
>  CONFIG_FONTS=y
>  CONFIG_FONT_7x14=y
> diff --git a/arch/arm/configs/multi_v7_defconfig 
> b/arch/arm/configs/multi_v7_defconfig
> index fc33444e94f0..63be07724db3 100644
> --- a/arch/arm/configs/multi_v7_defconfig
> +++ b/arch/arm/configs/multi_v7_defconfig
> @@ -1000,4 +1000,3 @@ CONFIG_CRYPTO_AES_ARM_BS=m
>  CONFIG_CRYPTO_AES_ARM_CE=m
>  CONFIG_CRYPTO_GHASH_ARM_CE=m
>  CONFIG_CRYPTO_CRC32_ARM_CE=m
> -CONFIG_CRYPTO_CHACHA20_NEON=m
> diff --git a/arch/arm/configs/omap2plus_defconfig 
> b/arch/arm/configs/omap2plus_defconfig
> index 6491419b1dad..f585a8ecc336 100644
> --- a/arch/arm/configs/omap2plus_defconfig
> +++ b/arch/arm/configs/omap2plus_defconfig
> @@ -547,7 +547,6 @@ CONFIG_CRYPTO_SHA512_ARM=m
>  CONFIG_CRYPTO_AES_ARM=m
>  CONFIG_CRYPTO_AES_ARM_BS=m
>  CONFIG_CRYPTO_GHASH_ARM_CE=m
> -CONFIG_CRYPTO_CHACHA20_NEON=m
>  CONFIG_CRC_CCITT=y
>  CONFIG_CRC_T10DIF=y
>  CONFIG_CRC_ITU_T=y
> diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
> index 925d1364727a..fb80fd89f0e7 100644
> --- a/arch/arm/crypto/Kconfig
> +++ b/arch/arm/crypto/Kconfig
> @@ -115,12 +115,6 @@ config CRYPTO_CRC32_ARM_CE
> depends on KERNEL_MODE_NEON && CRC32
> select CRYPTO_HASH
>
> -config CRYPTO_CHACHA20_NEON
> -   tristate "NEON accelerated ChaCha20 symmetric cipher"
> -   depends on KER

Re: [PATCH net-next v4 08/20] zinc: Poly1305 ARM and ARM64 implementations

2018-09-14 Thread Ard Biesheuvel
On 14 September 2018 at 18:22, Jason A. Donenfeld  wrote:
> These NEON and non-NEON implementations come from Andy Polyakov's
> implementation. They are exactly the same as Andy Polyakov's original,
> with the following exceptions:
>
> - Entries and exits use the proper kernel convention macro.
> - CPU feature checking is done in C by the glue code, so that has been
>   removed from the assembly.
> - The function names have been renamed to fit kernel conventions.
> - Labels have been renamed to fit kernel conventions.
> - The neon code can jump to the scalar code when it makes sense to do
>   so.
>
> After '/^#/d;/^\..*[^:]$/d', the code has the following diff in actual
> instructions from the original.
>

As I asked in response to v3, could we please have this as a separate
patch on top? The diff below is corrupted.

Also, both Andy and Eric have offered to get involved in upstreaming
these changes to OpenSSL, so there is no delta to begin with.

> ARM:
>
> -poly1305_init:
> -.Lpoly1305_init:
> +ENTRY(poly1305_init_arm)
> stmdb   sp!,{r4-r11}
>
> eor r3,r3,r3
> @@ -18,8 +25,6 @@
> moveq   r0,#0
> beq .Lno_key
>
> -   adr r11,.Lpoly1305_init
> -   ldr r12,.LOPENSSL_armcap
> ldrbr4,[r1,#0]
> mov r10,#0x0fff
> ldrbr5,[r1,#1]
> @@ -34,8 +39,6 @@
> ldrbr7,[r1,#6]
> and r4,r4,r10
>
> -   ldr r12,[r11,r12]   @ OPENSSL_armcap_P
> -   ldr r12,[r12]
> ldrbr8,[r1,#7]
> orr r5,r5,r6,lsl#8
> ldrbr6,[r1,#8]
> @@ -45,22 +48,6 @@
> ldrbr8,[r1,#10]
> and r5,r5,r3
>
> -   tst r12,#ARMV7_NEON @ check for NEON
> -   adr r9,poly1305_blocks_neon
> -   adr r11,poly1305_blocks
> -   it  ne
> -   movne   r11,r9
> -   adr r12,poly1305_emit
> -   adr r10,poly1305_emit_neon
> -   it  ne
> -   movne   r12,r10
> -   itete   eq
> -   addeq   r12,r11,#(poly1305_emit-.Lpoly1305_init)
> -   addne   r12,r11,#(poly1305_emit_neon-.Lpoly1305_init)
> -   addeq   r11,r11,#(poly1305_blocks-.Lpoly1305_init)
> -   addne   r11,r11,#(poly1305_blocks_neon-.Lpoly1305_init)
> -   orr r12,r12,#1  @ thumb-ify address
> -   orr r11,r11,#1
> ldrbr9,[r1,#11]
> orr r6,r6,r7,lsl#8
> ldrbr7,[r1,#12]
> @@ -79,17 +66,16 @@
> str r6,[r0,#8]
> and r7,r7,r3
> str r7,[r0,#12]
> -   stmia   r2,{r11,r12}@ fill functions table
> -   mov r0,#1
> -   mov r0,#0
>  .Lno_key:
> ldmia   sp!,{r4-r11}
> bx  lr  @ bxlr
> tst lr,#1
> moveq   pc,lr   @ be binary compatible with V4, yet
> .word   0xe12fff1e  @ interoperable with Thumb 
> ISA:-)
> -poly1305_blocks:
> -.Lpoly1305_blocks:
> +ENDPROC(poly1305_init_arm)
> +
> +ENTRY(poly1305_blocks_arm)
> +.Lpoly1305_blocks_arm:
> stmdb   sp!,{r3-r11,lr}
>
> andsr2,r2,#-16
> @@ -231,10 +217,11 @@
> tst lr,#1
> moveq   pc,lr   @ be binary compatible with V4, yet
> .word   0xe12fff1e  @ interoperable with Thumb 
> ISA:-)
> -poly1305_emit:
> +ENDPROC(poly1305_blocks_arm)
> +
> +ENTRY(poly1305_emit_arm)
> stmdb   sp!,{r4-r11}
>  .Lpoly1305_emit_enter:
> -
> ldmia   r0,{r3-r7}
> addsr8,r3,#5@ compare to modulus
> adcsr9,r4,#0
> @@ -305,8 +292,12 @@
> tst lr,#1
> moveq   pc,lr   @ be binary compatible with V4, yet
> .word   0xe12fff1e  @ interoperable with Thumb 
> ISA:-)
> +ENDPROC(poly1305_emit_arm)
> +
> +
>
> -poly1305_init_neon:
> +ENTRY(poly1305_init_neon)
> +.Lpoly1305_init_neon:
> ldr r4,[r0,#20] @ load key base 2^32
> ldr r5,[r0,#24]
> ldr r6,[r0,#28]
> @@ -515,8 +506,9 @@
> vst1.32 {d8[1]},[r7]
>
> bx  lr  @ bxlr
> +ENDPROC(poly1305_init_neon)
>
> -poly1305_blocks_neon:
> +ENTRY(poly1305_blocks_neon)
> ldr ip,[r0,#36] @ is_base2_26
> andsr2,r2,#-16
> beq .Lno_data_neon
> @@ -524,7 +516,7 @@
> cmp r2,#64
> bhs .Lenter_neon
> tst ip,ip   @ is_base2_26?
> -   beq .Lpoly1305_blocks
> +   beq .Lpoly1305_blocks_arm
>
>  .Lenter_neon:
> stmdb   sp!,{r4-r7}
> @@ -534,7 +526,7 @@
> bne .Lbase2_26_neon
>
> stmdb   sp!,{r1-r3,lr}
> -   bl  poly1305_init_neon
> +   bl  .Lpoly1305_init_neon
>
> ldr r4,[r0,#0]  @ load hash value base 2^32
> ldr r5,[r0,#4]
> @@ -989,8 +981,9 @@
> ldmia   sp!,{r4-r7}
>  .Lno_data_ne

[PATCH net-next v4 12/20] zinc: BLAKE2s generic C implementation and selftest

2018-09-14 Thread Jason A. Donenfeld
The C implementation was originally based on Samuel Neves' public
domain reference implementation but has since been heavily modified
for the kernel. We're able to do compile-time optimizations by moving
some scaffolding around the final function into the header file.

Information: https://blake2.net/

Signed-off-by: Jason A. Donenfeld 
Signed-off-by: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
---
 include/zinc/blake2s.h  |  101 ++
 lib/zinc/Kconfig|4 +
 lib/zinc/Makefile   |4 +
 lib/zinc/blake2s/blake2s.c  |  274 +
 lib/zinc/main.c |5 +
 lib/zinc/selftest/blake2s.h | 2095 +++
 6 files changed, 2483 insertions(+)
 create mode 100644 include/zinc/blake2s.h
 create mode 100644 lib/zinc/blake2s/blake2s.c
 create mode 100644 lib/zinc/selftest/blake2s.h

diff --git a/include/zinc/blake2s.h b/include/zinc/blake2s.h
new file mode 100644
index ..5e32d576e86a
--- /dev/null
+++ b/include/zinc/blake2s.h
@@ -0,0 +1,101 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#ifndef _ZINC_BLAKE2S_H
+#define _ZINC_BLAKE2S_H
+
+#include 
+#include 
+#include 
+
+enum blake2s_lengths {
+   BLAKE2S_BLOCKBYTES = 64,
+   BLAKE2S_OUTBYTES = 32,
+   BLAKE2S_KEYBYTES = 32
+};
+
+struct blake2s_state {
+   u32 h[8];
+   u32 t[2];
+   u32 f[2];
+   u8 buf[BLAKE2S_BLOCKBYTES];
+   size_t buflen;
+   u8 last_node;
+};
+
+void blake2s_init(struct blake2s_state *state, const size_t outlen);
+void blake2s_init_key(struct blake2s_state *state, const size_t outlen,
+ const void *key, const size_t keylen);
+void blake2s_update(struct blake2s_state *state, const u8 *in, size_t inlen);
+void __blake2s_final(struct blake2s_state *state);
+static inline void blake2s_final(struct blake2s_state *state, u8 *out,
+const size_t outlen)
+{
+   int i;
+
+#ifdef DEBUG
+   BUG_ON(!out || !outlen || outlen > BLAKE2S_OUTBYTES);
+#endif
+   __blake2s_final(state);
+
+   if (__builtin_constant_p(outlen) && !(outlen % sizeof(u32))) {
+   if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
+   IS_ALIGNED((unsigned long)out, __alignof__(u32))) {
+   __le32 *outwords = (__le32 *)out;
+
+   for (i = 0; i < outlen / sizeof(u32); ++i)
+   outwords[i] = cpu_to_le32(state->h[i]);
+   } else {
+   __le32 buffer[BLAKE2S_OUTBYTES];
+
+   for (i = 0; i < outlen / sizeof(u32); ++i)
+   buffer[i] = cpu_to_le32(state->h[i]);
+   memcpy(out, buffer, outlen);
+   memzero_explicit(buffer, sizeof(buffer));
+   }
+   } else {
+   u8 buffer[BLAKE2S_OUTBYTES] __aligned(__alignof__(u32));
+   __le32 *outwords = (__le32 *)buffer;
+
+   for (i = 0; i < 8; ++i)
+   outwords[i] = cpu_to_le32(state->h[i]);
+   memcpy(out, buffer, outlen);
+   memzero_explicit(buffer, sizeof(buffer));
+   }
+
+   memzero_explicit(state, sizeof(*state));
+}
+
+static inline void blake2s(u8 *out, const u8 *in, const u8 *key,
+  const size_t outlen, const size_t inlen,
+  const size_t keylen)
+{
+   struct blake2s_state state;
+
+#ifdef DEBUG
+   BUG_ON((!in && inlen > 0) || !out || !outlen ||
+  outlen > BLAKE2S_OUTBYTES || keylen > BLAKE2S_KEYBYTES ||
+  (!key && keylen));
+#endif
+
+   if (keylen)
+   blake2s_init_key(&state, outlen, key, keylen);
+   else
+   blake2s_init(&state, outlen);
+
+   blake2s_update(&state, in, inlen);
+   blake2s_final(&state, out, outlen);
+}
+
+void blake2s_hmac(u8 *out, const u8 *in, const u8 *key, const size_t outlen,
+ const size_t inlen, const size_t keylen);
+
+void blake2s_fpu_init(void);
+
+#ifdef DEBUG
+bool blake2s_selftest(void);
+#endif
+
+#endif /* _ZINC_BLAKE2S_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index 98d095ed2418..4db30012c2d6 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -17,6 +17,10 @@ config ZINC_CHACHA20POLY1305
select ZINC_POLY1305
select CRYPTO_BLKCIPHER
 
+config ZINC_BLAKE2S
+   bool
+   select ZINC
+
 config ZINC_DEBUG
bool "Zinc cryptography library debugging and self-tests"
depends on ZINC
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 1664871aaf71..45817ec5539c 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -51,6 +51,10 @@ ifeq ($(CONFIG_ZINC_CHACHA20POLY1305),y)
 zinc-y += chacha20poly1305.o
 endif
 
+ifeq ($(CONFIG_ZINC_BLAKE2S),y)
+zinc-y += blake2s/blake2s.o
+endif
+
 zinc-y += main.o
 
 obj-$(CONFIG_ZINC

[PATCH net-next v4 15/20] zinc: Curve25519 ARM implementation

2018-09-14 Thread Jason A. Donenfeld
This comes from Dan Bernstein and Peter Schwabe's public domain NEON
code, and has been modified to be friendly for kernel space, as well as
removing some qhasm strangeness to be more idiomatic.

Signed-off-by: Jason A. Donenfeld 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Russell King 
Cc: linux-arm-ker...@lists.infradead.org
---
 lib/zinc/Makefile |4 +
 lib/zinc/curve25519/curve25519-arm-glue.h |   46 +
 lib/zinc/curve25519/curve25519-arm.S  | 2095 +
 3 files changed, 2145 insertions(+)
 create mode 100644 lib/zinc/curve25519/curve25519-arm-glue.h
 create mode 100644 lib/zinc/curve25519/curve25519-arm.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 25826d3eb74a..0a9d97146c70 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -53,6 +53,10 @@ endif
 
 ifeq ($(CONFIG_ZINC_CURVE25519),y)
 zinc-y += curve25519/curve25519.o
+ifeq ($(CONFIG_ZINC_ARCH_ARM)$(CONFIG_KERNEL_MODE_NEON),yy)
+zinc-y += curve25519/curve25519-arm.o
+CFLAGS_curve25519.o += -include 
$(srctree)/$(src)/curve25519/curve25519-arm-glue.h
+endif
 endif
 
 ifeq ($(CONFIG_ZINC_BLAKE2S),y)
diff --git a/lib/zinc/curve25519/curve25519-arm-glue.h 
b/lib/zinc/curve25519/curve25519-arm-glue.h
new file mode 100644
index ..36e8002e2477
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-arm-glue.h
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+#if IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && __LINUX_ARM_ARCH__ == 7
+#define ARM_USE_NEON
+asmlinkage void curve25519_neon(u8 mypublic[CURVE25519_POINT_SIZE],
+   const u8 secret[CURVE25519_POINT_SIZE],
+   const u8 basepoint[CURVE25519_POINT_SIZE]);
+#endif
+
+static bool curve25519_use_neon __ro_after_init;
+
+void __init curve25519_fpu_init(void)
+{
+   curve25519_use_neon = elf_hwcap & HWCAP_NEON;
+}
+
+static inline bool curve25519_arch(u8 mypublic[CURVE25519_POINT_SIZE],
+  const u8 secret[CURVE25519_POINT_SIZE],
+  const u8 basepoint[CURVE25519_POINT_SIZE])
+{
+#ifdef ARM_USE_NEON
+   if (curve25519_use_neon && may_use_simd()) {
+   kernel_neon_begin();
+   curve25519_neon(mypublic, secret, basepoint);
+   kernel_neon_end();
+   return true;
+   }
+#endif
+   return false;
+}
+
+static inline bool curve25519_base_arch(u8 pub[CURVE25519_POINT_SIZE],
+   const u8 secret[CURVE25519_POINT_SIZE])
+{
+   return false;
+}
+
+#define HAVE_CURVE25519_ARCH_IMPLEMENTATION
diff --git a/lib/zinc/curve25519/curve25519-arm.S 
b/lib/zinc/curve25519/curve25519-arm.S
new file mode 100644
index ..ad6690b8ffd7
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-arm.S
@@ -0,0 +1,2095 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ *
+ * Based on public domain code from Daniel J. Bernstein and Peter Schwabe. This
+ * has been built from SUPERCOP's curve25519/neon2/scalarmult.pq using qhasm,
+ * but has subsequently been manually reworked for use in kernel space.
+ */
+
+#if __LINUX_ARM_ARCH__ == 7
+#include 
+
+.text
+.fpu neon
+.align 4
+
+ENTRY(curve25519_neon)
+   push{r4-r11, lr}
+   mov ip, sp
+   sub r3, sp, #704
+   and r3, r3, #0xfff0
+   mov sp, r3
+   movwr4, #0
+   movwr5, #254
+   vmov.i32q0, #1
+   vshr.u64q1, q0, #7
+   vshr.u64q0, q0, #8
+   vmov.i32d4, #19
+   vmov.i32d5, #38
+   add r6, sp, #480
+   vst1.8  {d2-d3}, [r6, : 128]
+   add r6, sp, #496
+   vst1.8  {d0-d1}, [r6, : 128]
+   add r6, sp, #512
+   vst1.8  {d4-d5}, [r6, : 128]
+   add r6, r3, #0
+   vmov.i32q2, #0
+   vst1.8  {d4-d5}, [r6, : 128]!
+   vst1.8  {d4-d5}, [r6, : 128]!
+   vst1.8  d4, [r6, : 64]
+   add r6, r3, #0
+   movwr7, #960
+   sub r7, r7, #2
+   neg r7, r7
+   sub r7, r7, r7, LSL #7
+   str r7, [r6]
+   add r6, sp, #672
+   vld1.8  {d4-d5}, [r1]!
+   vld1.8  {d6-d7}, [r1]
+   vst1.8  {d4-d5}, [r6, : 128]!
+   vst1.8  {d6-d7}, [r6, : 128]
+   sub r1, r6, #16
+   ldrbr6, [r1]
+   and r6, r6, #248
+   strbr6, [r1]
+   ldrbr6, [r1, #31]
+   and r6, r6, #127
+   orr r6, r6, #64
+   strbr6,

[PATCH net-next v4 16/20] zinc: Curve25519 x86_64 implementation

2018-09-14 Thread Jason A. Donenfeld
This implementation is the fastest available x86_64 implementation, and
unlike Sandy2x, it doesn't requie use of the floating point registers at
all. Instead it makes use of BMI2 and ADX, available on recent
microarchitectures. The implementation was written by Armando
Faz-Hernández with contributions (upstream) from Samuel Neves and me,
in addition to further changes in the kernel implementation from us.

Signed-off-by: Jason A. Donenfeld 
Signed-off-by: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Armando Faz-Hernández 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: x...@kernel.org
---
 lib/zinc/Makefile|3 +
 lib/zinc/curve25519/curve25519-x86_64-glue.h |   49 +
 lib/zinc/curve25519/curve25519-x86_64.h  | 2333 ++
 3 files changed, 2385 insertions(+)
 create mode 100644 lib/zinc/curve25519/curve25519-x86_64-glue.h
 create mode 100644 lib/zinc/curve25519/curve25519-x86_64.h

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 0a9d97146c70..adb3acda630a 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -57,6 +57,9 @@ ifeq ($(CONFIG_ZINC_ARCH_ARM)$(CONFIG_KERNEL_MODE_NEON),yy)
 zinc-y += curve25519/curve25519-arm.o
 CFLAGS_curve25519.o += -include 
$(srctree)/$(src)/curve25519/curve25519-arm-glue.h
 endif
+ifeq ($(CONFIG_ZINC_ARCH_X86_64),y)
+CFLAGS_curve25519.o += -include 
$(srctree)/$(src)/curve25519/curve25519-x86_64-glue.h
+endif
 endif
 
 ifeq ($(CONFIG_ZINC_BLAKE2S),y)
diff --git a/lib/zinc/curve25519/curve25519-x86_64-glue.h 
b/lib/zinc/curve25519/curve25519-x86_64-glue.h
new file mode 100644
index ..7011ad055c33
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-x86_64-glue.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#include 
+#include 
+#include 
+
+#include "curve25519-x86_64.h"
+
+static bool curve25519_use_bmi2 __ro_after_init;
+static bool curve25519_use_adx __ro_after_init;
+
+void __init curve25519_fpu_init(void)
+{
+   curve25519_use_bmi2 = boot_cpu_has(X86_FEATURE_BMI2);
+   curve25519_use_adx = boot_cpu_has(X86_FEATURE_BMI2) &&
+boot_cpu_has(X86_FEATURE_ADX);
+}
+
+static inline bool curve25519_arch(u8 mypublic[CURVE25519_POINT_SIZE],
+  const u8 secret[CURVE25519_POINT_SIZE],
+  const u8 basepoint[CURVE25519_POINT_SIZE])
+{
+   if (curve25519_use_adx) {
+   curve25519_adx(mypublic, secret, basepoint);
+   return true;
+   } else if (curve25519_use_bmi2) {
+   curve25519_bmi2(mypublic, secret, basepoint);
+   return true;
+   }
+   return false;
+}
+
+static inline bool curve25519_base_arch(u8 pub[CURVE25519_POINT_SIZE],
+   const u8 secret[CURVE25519_POINT_SIZE])
+{
+   if (curve25519_use_adx) {
+   curve25519_adx_base(pub, secret);
+   return true;
+   } else if (curve25519_use_bmi2) {
+   curve25519_bmi2_base(pub, secret);
+   return true;
+   }
+   return false;
+}
+
+#define HAVE_CURVE25519_ARCH_IMPLEMENTATION
diff --git a/lib/zinc/curve25519/curve25519-x86_64.h 
b/lib/zinc/curve25519/curve25519-x86_64.h
new file mode 100644
index ..15c2e4307583
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-x86_64.h
@@ -0,0 +1,2333 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (c) 2017 Armando Faz . All Rights Reserved.
+ * Copyright (C) 2018 Jason A. Donenfeld . All Rights 
Reserved.
+ * Copyright (C) 2018 Samuel Neves . All Rights Reserved.
+ */
+
+enum { NUM_WORDS_ELTFP25519 = 4 };
+typedef __aligned(32) u64 eltfp25519_1w[NUM_WORDS_ELTFP25519];
+typedef __aligned(32) u64 eltfp25519_1w_buffer[2 * NUM_WORDS_ELTFP25519];
+
+#define mul_eltfp25519_1w_adx(c, a, b) do { \
+   mul_256x256_integer_adx(m.buffer, a, b); \
+   red_eltfp25519_1w_adx(c, m.buffer); \
+} while (0)
+
+#define mul_eltfp25519_1w_bmi2(c, a, b) do { \
+   mul_256x256_integer_bmi2(m.buffer, a, b); \
+   red_eltfp25519_1w_bmi2(c, m.buffer); \
+} while (0)
+
+#define sqr_eltfp25519_1w_adx(a) do { \
+   sqr_256x256_integer_adx(m.buffer, a); \
+   red_eltfp25519_1w_adx(a, m.buffer); \
+} while (0)
+
+#define sqr_eltfp25519_1w_bmi2(a) do { \
+   sqr_256x256_integer_bmi2(m.buffer, a); \
+   red_eltfp25519_1w_bmi2(a, m.buffer); \
+} while (0)
+
+#define mul_eltfp25519_2w_adx(c, a, b) do { \
+   mul2_256x256_integer_adx(m.buffer, a, b); \
+   red_eltfp25519_2w_adx(c, m.buffer); \
+} while (0)
+
+#define mul_eltfp25519_2w_bmi2(c, a, b) do { \
+   mul2_256x256_integer_bmi2(m.buffer, a, b); \
+   red_eltfp25519_2w_bmi2(c, m.buffer); \
+} while (0)
+
+#define sqr_eltfp25519_2w_adx(a) do { \
+   sqr2_256x256_integer_adx(m.buffer, a); \
+   red_eltfp25519_2w_adx(a, m.buffer); \
+} while (0)
+
+#define sqr_el

[PATCH net-next v4 14/20] zinc: Curve25519 generic C implementations and selftest

2018-09-14 Thread Jason A. Donenfeld
This contains two formally verified C implementations of the Curve25519
scalar multiplication function, one for 32-bit systems, and one for
64-bit systems whose compiler supports efficient 128-bit integer types.
Not only are these implementations formally verified, but they are also
the fastest available C implementations. They have been modified to be
friendly to kernel space and to be generally less horrendous looking,
but still an effort has been made to retain their formally verified
characteristic, and so the C might look slightly unidiomatic.

The 64-bit version comes from HACL*: 
https://github.com/project-everest/hacl-star
The 32-bit version comes from Fiat: https://github.com/mit-plv/fiat-crypto

Information: https://cr.yp.to/ecdh.html

Signed-off-by: Jason A. Donenfeld 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Karthikeyan Bhargavan 
---
 include/zinc/curve25519.h   |   28 +
 lib/zinc/Kconfig|5 +
 lib/zinc/Makefile   |4 +
 lib/zinc/curve25519/curve25519-fiat32.h |  862 +++
 lib/zinc/curve25519/curve25519-hacl64.h |  785 ++
 lib/zinc/curve25519/curve25519.c|   83 ++
 lib/zinc/main.c |5 +
 lib/zinc/selftest/curve25519.h  | 1321 +++
 8 files changed, 3093 insertions(+)
 create mode 100644 include/zinc/curve25519.h
 create mode 100644 lib/zinc/curve25519/curve25519-fiat32.h
 create mode 100644 lib/zinc/curve25519/curve25519-hacl64.h
 create mode 100644 lib/zinc/curve25519/curve25519.c
 create mode 100644 lib/zinc/selftest/curve25519.h

diff --git a/include/zinc/curve25519.h b/include/zinc/curve25519.h
new file mode 100644
index ..0e1caf02f1d8
--- /dev/null
+++ b/include/zinc/curve25519.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#ifndef _ZINC_CURVE25519_H
+#define _ZINC_CURVE25519_H
+
+#include 
+
+enum curve25519_lengths {
+   CURVE25519_POINT_SIZE = 32
+};
+
+bool __must_check curve25519(u8 mypublic[CURVE25519_POINT_SIZE],
+const u8 secret[CURVE25519_POINT_SIZE],
+const u8 basepoint[CURVE25519_POINT_SIZE]);
+void curve25519_generate_secret(u8 secret[CURVE25519_POINT_SIZE]);
+bool __must_check curve25519_generate_public(
+   u8 pub[CURVE25519_POINT_SIZE], const u8 secret[CURVE25519_POINT_SIZE]);
+
+void curve25519_fpu_init(void);
+
+#ifdef DEBUG
+bool curve25519_selftest(void);
+#endif
+
+#endif /* _ZINC_CURVE25519_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index 4db30012c2d6..e4e10a1f4027 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -21,6 +21,11 @@ config ZINC_BLAKE2S
bool
select ZINC
 
+config ZINC_CURVE25519
+   bool
+   select ZINC
+   select CONFIG_CRYPTO
+
 config ZINC_DEBUG
bool "Zinc cryptography library debugging and self-tests"
depends on ZINC
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 49455abfdf5e..25826d3eb74a 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -51,6 +51,10 @@ ifeq ($(CONFIG_ZINC_CHACHA20POLY1305),y)
 zinc-y += chacha20poly1305.o
 endif
 
+ifeq ($(CONFIG_ZINC_CURVE25519),y)
+zinc-y += curve25519/curve25519.o
+endif
+
 ifeq ($(CONFIG_ZINC_BLAKE2S),y)
 zinc-y += blake2s/blake2s.o
 ifeq ($(CONFIG_ZINC_ARCH_X86_64),y)
diff --git a/lib/zinc/curve25519/curve25519-fiat32.h 
b/lib/zinc/curve25519/curve25519-fiat32.h
new file mode 100644
index ..8fa327cb1edc
--- /dev/null
+++ b/lib/zinc/curve25519/curve25519-fiat32.h
@@ -0,0 +1,862 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2016 The fiat-crypto Authors.
+ * Copyright (C) 2018 Jason A. Donenfeld . All Rights 
Reserved.
+ *
+ * This is a machine-generated formally verified implementation of Curve25519
+ * ECDH from: . Though originally
+ * machine generated, it has been tweaked to be suitable for use in the kernel.
+ * It is optimized for 32-bit machines and machines that cannot work 
efficiently
+ * with 128-bit integer types.
+ */
+
+/* fe means field element. Here the field is \Z/(2^255-19). An element t,
+ * entries t[0]...t[9], represents the integer t[0]+2^26 t[1]+2^51 t[2]+2^77
+ * t[3]+2^102 t[4]+...+2^230 t[9].
+ * fe limbs are bounded by 1.125*2^26,1.125*2^25,1.125*2^26,1.125*2^25,etc.
+ * Multiplication and carrying produce fe from fe_loose.
+ */
+typedef struct fe { u32 v[10]; } fe;
+
+/* fe_loose limbs are bounded by 
3.375*2^26,3.375*2^25,3.375*2^26,3.375*2^25,etc
+ * Addition and subtraction produce fe_loose from (fe, fe).
+ */
+typedef struct fe_loose { u32 v[10]; } fe_loose;
+
+static __always_inline void fe_frombytes_impl(u32 h[10], const u8 *s)
+{
+   /* Ignores top bit of s. */
+   u32 a0 = get_unaligned_le32(s);
+   u32 a1 = get_unaligned_le32(s+4);
+   u32 a2 = get_unaligned_le

Re: [PATCH] crypto: chacha20 - Fix chacha20_block() keystream alignment (again)

2018-09-14 Thread Eric Biggers
Hi Yann,

On Wed, Sep 12, 2018 at 11:50:00AM +0200, Yann Droneaud wrote:
> Hi,
> 
> Le mardi 11 septembre 2018 à 20:05 -0700, Eric Biggers a écrit :
> > From: Eric Biggers 
> > 
> > In commit 9f480faec58c ("crypto: chacha20 - Fix keystream alignment for
> > chacha20_block()"), I had missed that chacha20_block() can be called
> > directly on the buffer passed to get_random_bytes(), which can have any
> > alignment.  So, while my commit didn't break anything, it didn't fully
> > solve the alignment problems.
> > 
> > Revert my solution and just update chacha20_block() to use
> > put_unaligned_le32(), so the output buffer need not be aligned.
> > This is simpler, and on many CPUs it's the same speed.
> > 
> > But, I kept the 'tmp' buffers in extract_crng_user() and
> > _get_random_bytes() 4-byte aligned, since that alignment is actually
> > needed for _crng_backtrack_protect() too.
> > 
> > Reported-by: Stephan Müller 
> > Cc: Theodore Ts'o 
> > Signed-off-by: Eric Biggers 
> > ---
> >  crypto/chacha20_generic.c |  7 ---
> >  drivers/char/random.c | 24 
> >  include/crypto/chacha20.h |  3 +--
> >  lib/chacha20.c|  6 +++---
> >  4 files changed, 20 insertions(+), 20 deletions(-)
> > 
> > diff --git a/crypto/chacha20_generic.c b/crypto/chacha20_generic.c
> > index e451c3cb6a56..3ae96587caf9 100644
> > --- a/crypto/chacha20_generic.c
> > +++ b/crypto/chacha20_generic.c
> > @@ -18,20 +18,21 @@
> >  static void chacha20_docrypt(u32 *state, u8 *dst, const u8 *src,
> >  unsigned int bytes)
> >  {
> > -   u32 stream[CHACHA20_BLOCK_WORDS];
> > +   /* aligned to potentially speed up crypto_xor() */
> > +   u8 stream[CHACHA20_BLOCK_SIZE] __aligned(sizeof(long));
> >  
> > if (dst != src)
> > memcpy(dst, src, bytes);
> >  
> > while (bytes >= CHACHA20_BLOCK_SIZE) {
> > chacha20_block(state, stream);
> > -   crypto_xor(dst, (const u8 *)stream, CHACHA20_BLOCK_SIZE);
> > +   crypto_xor(dst, stream, CHACHA20_BLOCK_SIZE);
> > bytes -= CHACHA20_BLOCK_SIZE;
> > dst += CHACHA20_BLOCK_SIZE;
> > }
> > if (bytes) {
> > chacha20_block(state, stream);
> > -   crypto_xor(dst, (const u8 *)stream, bytes);
> > +   crypto_xor(dst, stream, bytes);
> > }
> >  }
> >  
> > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > index bf5f99fc36f1..d22d967c50f0 100644
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -1003,7 +1003,7 @@ static void extract_crng(__u32 
> > out[CHACHA20_BLOCK_WORDS])
> >   * enough) to mutate the CRNG key to provide backtracking protection.
> >   */
> >  static void _crng_backtrack_protect(struct crng_state *crng,
> > -   __u32 tmp[CHACHA20_BLOCK_WORDS], int used)
> > +   __u8 tmp[CHACHA20_BLOCK_SIZE], int used)
> >  {
> > unsigned long   flags;
> > __u32   *s, *d;
> > @@ -1015,14 +1015,14 @@ static void _crng_backtrack_protect(struct 
> > crng_state *crng,
> > used = 0;
> > }
> > spin_lock_irqsave(&crng->lock, flags);
> > -   s = &tmp[used / sizeof(__u32)];
> > +   s = (__u32 *) &tmp[used];
> 
> This introduces a alignment issue: tmp is not aligned for __u32, but is
> dereferenced as such later.
> 
> > d = &crng->state[4];
> > for (i=0; i < 8; i++)
> > *d++ ^= *s++;
> > spin_unlock_irqrestore(&crng->lock, flags);
> >  }
> >  
> 

I explained this in the patch; the callers ensure the buffer is aligned.

- Eric


[PATCH net-next v4 18/20] crypto: port ChaCha20 to Zinc

2018-09-14 Thread Jason A. Donenfeld
Now that ChaCha20 is in Zinc, we can have the crypto API code simply
call into it. The crypto API expects to have a stored key per instance
and independent nonces, so we follow suite and store the key and
initialize the nonce independently.

Signed-off-by: Jason A. Donenfeld 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Eric Biggers 
---
 arch/arm/configs/exynos_defconfig   |   1 -
 arch/arm/configs/multi_v7_defconfig |   1 -
 arch/arm/configs/omap2plus_defconfig|   1 -
 arch/arm/crypto/Kconfig |   6 -
 arch/arm/crypto/Makefile|   2 -
 arch/arm/crypto/chacha20-neon-core.S| 521 
 arch/arm/crypto/chacha20-neon-glue.c| 127 -
 arch/arm64/configs/defconfig|   1 -
 arch/arm64/crypto/Kconfig   |   6 -
 arch/arm64/crypto/Makefile  |   3 -
 arch/arm64/crypto/chacha20-neon-core.S  | 450 -
 arch/arm64/crypto/chacha20-neon-glue.c  | 133 -
 arch/x86/crypto/Makefile|   3 -
 arch/x86/crypto/chacha20-avx2-x86_64.S  | 448 -
 arch/x86/crypto/chacha20-ssse3-x86_64.S | 630 
 arch/x86/crypto/chacha20_glue.c | 146 --
 crypto/Kconfig  |  16 -
 crypto/Makefile |   2 +-
 crypto/chacha20_generic.c   | 136 -
 crypto/chacha20_zinc.c  | 100 
 crypto/chacha20poly1305.c   |   2 +-
 include/crypto/chacha20.h   |  12 -
 22 files changed, 102 insertions(+), 2645 deletions(-)
 delete mode 100644 arch/arm/crypto/chacha20-neon-core.S
 delete mode 100644 arch/arm/crypto/chacha20-neon-glue.c
 delete mode 100644 arch/arm64/crypto/chacha20-neon-core.S
 delete mode 100644 arch/arm64/crypto/chacha20-neon-glue.c
 delete mode 100644 arch/x86/crypto/chacha20-avx2-x86_64.S
 delete mode 100644 arch/x86/crypto/chacha20-ssse3-x86_64.S
 delete mode 100644 arch/x86/crypto/chacha20_glue.c
 delete mode 100644 crypto/chacha20_generic.c
 create mode 100644 crypto/chacha20_zinc.c

diff --git a/arch/arm/configs/exynos_defconfig 
b/arch/arm/configs/exynos_defconfig
index 27ea6dfcf2f2..95929b5e7b10 100644
--- a/arch/arm/configs/exynos_defconfig
+++ b/arch/arm/configs/exynos_defconfig
@@ -350,7 +350,6 @@ CONFIG_CRYPTO_SHA1_ARM_NEON=m
 CONFIG_CRYPTO_SHA256_ARM=m
 CONFIG_CRYPTO_SHA512_ARM=m
 CONFIG_CRYPTO_AES_ARM_BS=m
-CONFIG_CRYPTO_CHACHA20_NEON=m
 CONFIG_CRC_CCITT=y
 CONFIG_FONTS=y
 CONFIG_FONT_7x14=y
diff --git a/arch/arm/configs/multi_v7_defconfig 
b/arch/arm/configs/multi_v7_defconfig
index fc33444e94f0..63be07724db3 100644
--- a/arch/arm/configs/multi_v7_defconfig
+++ b/arch/arm/configs/multi_v7_defconfig
@@ -1000,4 +1000,3 @@ CONFIG_CRYPTO_AES_ARM_BS=m
 CONFIG_CRYPTO_AES_ARM_CE=m
 CONFIG_CRYPTO_GHASH_ARM_CE=m
 CONFIG_CRYPTO_CRC32_ARM_CE=m
-CONFIG_CRYPTO_CHACHA20_NEON=m
diff --git a/arch/arm/configs/omap2plus_defconfig 
b/arch/arm/configs/omap2plus_defconfig
index 6491419b1dad..f585a8ecc336 100644
--- a/arch/arm/configs/omap2plus_defconfig
+++ b/arch/arm/configs/omap2plus_defconfig
@@ -547,7 +547,6 @@ CONFIG_CRYPTO_SHA512_ARM=m
 CONFIG_CRYPTO_AES_ARM=m
 CONFIG_CRYPTO_AES_ARM_BS=m
 CONFIG_CRYPTO_GHASH_ARM_CE=m
-CONFIG_CRYPTO_CHACHA20_NEON=m
 CONFIG_CRC_CCITT=y
 CONFIG_CRC_T10DIF=y
 CONFIG_CRC_ITU_T=y
diff --git a/arch/arm/crypto/Kconfig b/arch/arm/crypto/Kconfig
index 925d1364727a..fb80fd89f0e7 100644
--- a/arch/arm/crypto/Kconfig
+++ b/arch/arm/crypto/Kconfig
@@ -115,12 +115,6 @@ config CRYPTO_CRC32_ARM_CE
depends on KERNEL_MODE_NEON && CRC32
select CRYPTO_HASH
 
-config CRYPTO_CHACHA20_NEON
-   tristate "NEON accelerated ChaCha20 symmetric cipher"
-   depends on KERNEL_MODE_NEON
-   select CRYPTO_BLKCIPHER
-   select CRYPTO_CHACHA20
-
 config CRYPTO_SPECK_NEON
tristate "NEON accelerated Speck cipher algorithms"
depends on KERNEL_MODE_NEON
diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 8de542c48ade..bbfa98447063 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -9,7 +9,6 @@ obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
 obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 obj-$(CONFIG_CRYPTO_SHA256_ARM) += sha256-arm.o
 obj-$(CONFIG_CRYPTO_SHA512_ARM) += sha512-arm.o
-obj-$(CONFIG_CRYPTO_CHACHA20_NEON) += chacha20-neon.o
 obj-$(CONFIG_CRYPTO_SPECK_NEON) += speck-neon.o
 
 ce-obj-$(CONFIG_CRYPTO_AES_ARM_CE) += aes-arm-ce.o
@@ -53,7 +52,6 @@ aes-arm-ce-y  := aes-ce-core.o aes-ce-glue.o
 ghash-arm-ce-y := ghash-ce-core.o ghash-ce-glue.o
 crct10dif-arm-ce-y := crct10dif-ce-core.o crct10dif-ce-glue.o
 crc32-arm-ce-y:= crc32-ce-core.o crc32-ce-glue.o
-chacha20-neon-y := chacha20-neon-core.o chacha20-neon-glue.o
 speck-neon-y := speck-neon-core.o speck-neon-glue.o
 
 ifdef REGENERATE_ARM_CRYPTO
diff --git a/arch/arm/crypto/chacha20-neon-core.S 
b/arch/arm/crypto/chacha20-neon-core.S
deleted file mode 100644
index 451a849ad518..0

[PATCH net-next v4 17/20] crypto: port Poly1305 to Zinc

2018-09-14 Thread Jason A. Donenfeld
Now that Poly1305 is in Zinc, we can have the crypto API code simply
call into it. We have to do a little bit of book keeping here, because
the crypto API receives the key in the first few calls to update.

Signed-off-by: Jason A. Donenfeld 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Eric Biggers 
---
 arch/x86/crypto/Makefile   |   3 -
 arch/x86/crypto/poly1305-avx2-x86_64.S | 388 
 arch/x86/crypto/poly1305-sse2-x86_64.S | 584 -
 arch/x86/crypto/poly1305_glue.c| 205 -
 crypto/Kconfig |  15 +-
 crypto/Makefile|   2 +-
 crypto/chacha20poly1305.c  |  12 +-
 crypto/poly1305_generic.c  | 304 -
 crypto/poly1305_zinc.c |  92 
 include/crypto/poly1305.h  |  40 --
 10 files changed, 101 insertions(+), 1544 deletions(-)
 delete mode 100644 arch/x86/crypto/poly1305-avx2-x86_64.S
 delete mode 100644 arch/x86/crypto/poly1305-sse2-x86_64.S
 delete mode 100644 arch/x86/crypto/poly1305_glue.c
 delete mode 100644 crypto/poly1305_generic.c
 create mode 100644 crypto/poly1305_zinc.c
 delete mode 100644 include/crypto/poly1305.h

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index a450ad573dcb..cf830219846b 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -34,7 +34,6 @@ obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o
 obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
 obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
 obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
-obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o
 
 obj-$(CONFIG_CRYPTO_AEGIS128_AESNI_SSE2) += aegis128-aesni.o
 obj-$(CONFIG_CRYPTO_AEGIS128L_AESNI_SSE2) += aegis128l-aesni.o
@@ -110,10 +109,8 @@ aesni-intel-y := aesni-intel_asm.o aesni-intel_glue.o fpu.o
 aesni-intel-$(CONFIG_64BIT) += aesni-intel_avx-x86_64.o aes_ctrby8_avx-x86_64.o
 ghash-clmulni-intel-y := ghash-clmulni-intel_asm.o ghash-clmulni-intel_glue.o
 sha1-ssse3-y := sha1_ssse3_asm.o sha1_ssse3_glue.o
-poly1305-x86_64-y := poly1305-sse2-x86_64.o poly1305_glue.o
 ifeq ($(avx2_supported),yes)
 sha1-ssse3-y += sha1_avx2_x86_64_asm.o
-poly1305-x86_64-y += poly1305-avx2-x86_64.o
 endif
 ifeq ($(sha1_ni_supported),yes)
 sha1-ssse3-y += sha1_ni_asm.o
diff --git a/arch/x86/crypto/poly1305-avx2-x86_64.S 
b/arch/x86/crypto/poly1305-avx2-x86_64.S
deleted file mode 100644
index 3b6e70d085da..
--- a/arch/x86/crypto/poly1305-avx2-x86_64.S
+++ /dev/null
@@ -1,388 +0,0 @@
-/*
- * Poly1305 authenticator algorithm, RFC7539, x64 AVX2 functions
- *
- * Copyright (C) 2015 Martin Willi
- *
- * This program is free software; you can redistribute it and/or modify
- * it under the terms of the GNU General Public License as published by
- * the Free Software Foundation; either version 2 of the License, or
- * (at your option) any later version.
- */
-
-#include 
-
-.section   .rodata.cst32.ANMASK, "aM", @progbits, 32
-.align 32
-ANMASK:.octa 0x03ff03ff
-   .octa 0x03ff03ff
-
-.section   .rodata.cst32.ORMASK, "aM", @progbits, 32
-.align 32
-ORMASK:.octa 0x01000100
-   .octa 0x01000100
-
-.text
-
-#define h0 0x00(%rdi)
-#define h1 0x04(%rdi)
-#define h2 0x08(%rdi)
-#define h3 0x0c(%rdi)
-#define h4 0x10(%rdi)
-#define r0 0x00(%rdx)
-#define r1 0x04(%rdx)
-#define r2 0x08(%rdx)
-#define r3 0x0c(%rdx)
-#define r4 0x10(%rdx)
-#define u0 0x00(%r8)
-#define u1 0x04(%r8)
-#define u2 0x08(%r8)
-#define u3 0x0c(%r8)
-#define u4 0x10(%r8)
-#define w0 0x14(%r8)
-#define w1 0x18(%r8)
-#define w2 0x1c(%r8)
-#define w3 0x20(%r8)
-#define w4 0x24(%r8)
-#define y0 0x28(%r8)
-#define y1 0x2c(%r8)
-#define y2 0x30(%r8)
-#define y3 0x34(%r8)
-#define y4 0x38(%r8)
-#define m %rsi
-#define hc0 %ymm0
-#define hc1 %ymm1
-#define hc2 %ymm2
-#define hc3 %ymm3
-#define hc4 %ymm4
-#define hc0x %xmm0
-#define hc1x %xmm1
-#define hc2x %xmm2
-#define hc3x %xmm3
-#define hc4x %xmm4
-#define t1 %ymm5
-#define t2 %ymm6
-#define t1x %xmm5
-#define t2x %xmm6
-#define ruwy0 %ymm7
-#define ruwy1 %ymm8
-#define ruwy2 %ymm9
-#define ruwy3 %ymm10
-#define ruwy4 %ymm11
-#define ruwy0x %xmm7
-#define ruwy1x %xmm8
-#define ruwy2x %xmm9
-#define ruwy3x %xmm10
-#define ruwy4x %xmm11
-#define svxz1 %ymm12
-#define svxz2 %ymm13
-#define svxz3 %ymm14
-#define svxz4 %ymm15
-#define d0 %r9
-#define d1 %r10
-#define d2 %r11
-#define d3 %r12
-#define d4 %r13
-
-ENTRY(poly1305_4block_avx2)
-   # %rdi: Accumulator h[5]
-   # %rsi: 64 byte input block m
-   # %rdx: Poly1305 key r[5]
-   # %rcx: Quadblock count
-   # %r8:  Poly1305 derived key r^2 u[5], r^3 w[5], r^4 y[5],
-
-   # This four-block variant uses loop unrolled block processing. It
-   # requires 4 Poly1305 keys: r, r^2, r^3 and r^4:
-   # h = (h + m) * r  => 

[PATCH net-next v4 19/20] security/keys: rewrite big_key crypto to use Zinc

2018-09-14 Thread Jason A. Donenfeld
A while back, I noticed that the crypto and crypto API usage in big_keys
were entirely broken in multiple ways, so I rewrote it. Now, I'm
rewriting it again, but this time using Zinc's ChaCha20Poly1305
function. This makes the file considerably more simple; the diffstat
alone should justify this commit. It also should be faster, since it no
longer requires a mutex around the "aead api object" (nor allocations),
allowing us to encrypt multiple items in parallel. We also benefit from
being able to pass any type of pointer to Zinc, so we can get rid of the
ridiculously complex custom page allocator that big_key really doesn't
need.

Signed-off-by: Jason A. Donenfeld 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Eric Biggers 
Cc: David Howells 
---
 security/keys/Kconfig   |   4 +-
 security/keys/big_key.c | 230 +---
 2 files changed, 28 insertions(+), 206 deletions(-)

diff --git a/security/keys/Kconfig b/security/keys/Kconfig
index 6462e6654ccf..66ff26298fb3 100644
--- a/security/keys/Kconfig
+++ b/security/keys/Kconfig
@@ -45,9 +45,7 @@ config BIG_KEYS
bool "Large payload keys"
depends on KEYS
depends on TMPFS
-   select CRYPTO
-   select CRYPTO_AES
-   select CRYPTO_GCM
+   select ZINC_CHACHA20POLY1305
help
  This option provides support for holding large keys within the kernel
  (for example Kerberos ticket caches).  The data may be stored out to
diff --git a/security/keys/big_key.c b/security/keys/big_key.c
index 2806e70d7f8f..934497ecbf65 100644
--- a/security/keys/big_key.c
+++ b/security/keys/big_key.c
@@ -1,6 +1,6 @@
 /* Large capacity key type
  *
- * Copyright (C) 2017 Jason A. Donenfeld . All Rights 
Reserved.
+ * Copyright (C) 2017-2018 Jason A. Donenfeld . All Rights 
Reserved.
  * Copyright (C) 2013 Red Hat, Inc. All Rights Reserved.
  * Written by David Howells (dhowe...@redhat.com)
  *
@@ -16,20 +16,10 @@
 #include 
 #include 
 #include 
-#include 
 #include 
-#include 
 #include 
 #include 
-#include 
-#include 
-
-struct big_key_buf {
-   unsigned intnr_pages;
-   void*virt;
-   struct scatterlist  *sg;
-   struct page *pages[];
-};
+#include 
 
 /*
  * Layout of key payload words.
@@ -41,14 +31,6 @@ enum {
big_key_len,
 };
 
-/*
- * Crypto operation with big_key data
- */
-enum big_key_op {
-   BIG_KEY_ENC,
-   BIG_KEY_DEC,
-};
-
 /*
  * If the data is under this limit, there's no point creating a shm file to
  * hold it as the permanently resident metadata for the shmem fs will be at
@@ -56,16 +38,6 @@ enum big_key_op {
  */
 #define BIG_KEY_FILE_THRESHOLD (sizeof(struct inode) + sizeof(struct dentry))
 
-/*
- * Key size for big_key data encryption
- */
-#define ENC_KEY_SIZE 32
-
-/*
- * Authentication tag length
- */
-#define ENC_AUTHTAG_SIZE 16
-
 /*
  * big_key defined keys take an arbitrary string as the description and an
  * arbitrary blob of data as the payload
@@ -79,136 +51,20 @@ struct key_type key_type_big_key = {
.destroy= big_key_destroy,
.describe   = big_key_describe,
.read   = big_key_read,
-   /* no ->update(); don't add it without changing big_key_crypt() nonce */
+   /* no ->update(); don't add it without changing chacha20poly1305's 
nonce */
 };
 
-/*
- * Crypto names for big_key data authenticated encryption
- */
-static const char big_key_alg_name[] = "gcm(aes)";
-#define BIG_KEY_IV_SIZEGCM_AES_IV_SIZE
-
-/*
- * Crypto algorithms for big_key data authenticated encryption
- */
-static struct crypto_aead *big_key_aead;
-
-/*
- * Since changing the key affects the entire object, we need a mutex.
- */
-static DEFINE_MUTEX(big_key_aead_lock);
-
-/*
- * Encrypt/decrypt big_key data
- */
-static int big_key_crypt(enum big_key_op op, struct big_key_buf *buf, size_t 
datalen, u8 *key)
-{
-   int ret;
-   struct aead_request *aead_req;
-   /* We always use a zero nonce. The reason we can get away with this is
-* because we're using a different randomly generated key for every
-* different encryption. Notably, too, key_type_big_key doesn't define
-* an .update function, so there's no chance we'll wind up reusing the
-* key to encrypt updated data. Simply put: one key, one encryption.
-*/
-   u8 zero_nonce[BIG_KEY_IV_SIZE];
-
-   aead_req = aead_request_alloc(big_key_aead, GFP_KERNEL);
-   if (!aead_req)
-   return -ENOMEM;
-
-   memset(zero_nonce, 0, sizeof(zero_nonce));
-   aead_request_set_crypt(aead_req, buf->sg, buf->sg, datalen, zero_nonce);
-   aead_request_set_callback(aead_req, CRYPTO_TFM_REQ_MAY_SLEEP, NULL, 
NULL);
-   aead_request_set_ad(aead_req, 0);
-
-   mutex_lock(&big_key_aead_lock);
-   if (crypto_aead_setkey(big_key_aead, key, ENC_KEY_SIZE)) {
-

[PATCH net-next v4 09/20] zinc: Poly1305 x86_64 implementation

2018-09-14 Thread Jason A. Donenfeld
This provides AVX, AVX-2, and AVX-512F implementations for Poly1305.
The AVX-512F implementation is disabled on Skylake, due to throttling.
These come from Andy Polyakov's implementation, with the following
modifications from Samuel Neves:

  - Some cosmetic changes, like renaming labels to .Lname, constants,
and other Linux conventions.

  - CPU feature checking is done in C by the glue code, so that has been
removed from the assembly.

  - poly1305_blocks_avx512 jumped to the middle of the poly1305_blocks_avx2
for the final blocks. To appease objtool, the relevant tail avx2 code
was duplicated for the avx512 function.

  - The original uses %rbp as a scratch register. However, the kernel
expects %rbp to be a valid frame pointer at any given time in order
to do proper unwinding. Thus we need to alter the code in order to
preserve it. The most straightforward manner in which this was
accomplished was by replacing $d3, formerly %r10, by %rdi, and
replacing %rbp by %r10. Because %rdi, a pointer to the context
structure, does not change and is not used by poly1305_iteration,
it is safe to use it here, and the overhead of saving and restoring
it should be minimal.

  - The original hardcodes returns as .byte 0xf3,0xc3, aka "rep ret".
We replace this by "ret". "rep ret" was meant to help with AMD K8
chips, cf. http://repzret.org/p/repzret. It makes no sense to
continue to use this kludge for code that won't even run on ancient
AMD chips.

Signed-off-by: Jason A. Donenfeld 
Signed-off-by: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Andy Polyakov 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: x...@kernel.org
---
 lib/zinc/Makefile|4 +
 lib/zinc/poly1305/poly1305-x86_64-glue.h |  109 +
 lib/zinc/poly1305/poly1305-x86_64.S  | 2792 ++
 3 files changed, 2905 insertions(+)
 create mode 100644 lib/zinc/poly1305/poly1305-x86_64-glue.h
 create mode 100644 lib/zinc/poly1305/poly1305-x86_64.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index f37df89a3f87..72112f8ffba1 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -25,6 +25,10 @@ endif
 
 ifeq ($(CONFIG_ZINC_POLY1305),y)
 zinc-y += poly1305/poly1305.o
+ifeq ($(CONFIG_ZINC_ARCH_X86_64),y)
+zinc-y += poly1305/poly1305-x86_64.o
+CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-x86_64-glue.h
+endif
 ifeq ($(CONFIG_ZINC_ARCH_ARM),y)
 zinc-y += poly1305/poly1305-arm.o
 CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-arm-glue.h
diff --git a/lib/zinc/poly1305/poly1305-x86_64-glue.h 
b/lib/zinc/poly1305/poly1305-x86_64-glue.h
new file mode 100644
index ..4ae028101e7c
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-x86_64-glue.h
@@ -0,0 +1,109 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+
+asmlinkage void poly1305_init_x86_64(void *ctx,
+const u8 key[POLY1305_KEY_SIZE]);
+asmlinkage void poly1305_blocks_x86_64(void *ctx, const u8 *inp,
+  const size_t len, const u32 padbit);
+asmlinkage void poly1305_emit_x86_64(void *ctx, u8 mac[POLY1305_MAC_SIZE],
+const u32 nonce[4]);
+#ifdef CONFIG_AS_AVX
+asmlinkage void poly1305_emit_avx(void *ctx, u8 mac[POLY1305_MAC_SIZE],
+ const u32 nonce[4]);
+asmlinkage void poly1305_blocks_avx(void *ctx, const u8 *inp, const size_t len,
+   const u32 padbit);
+#endif
+#ifdef CONFIG_AS_AVX2
+asmlinkage void poly1305_blocks_avx2(void *ctx, const u8 *inp, const size_t 
len,
+const u32 padbit);
+#endif
+#ifdef CONFIG_AS_AVX512
+asmlinkage void poly1305_blocks_avx512(void *ctx, const u8 *inp,
+  const size_t len, const u32 padbit);
+#endif
+
+static bool poly1305_use_avx __ro_after_init;
+static bool poly1305_use_avx2 __ro_after_init;
+static bool poly1305_use_avx512 __ro_after_init;
+
+void __init poly1305_fpu_init(void)
+{
+   poly1305_use_avx =
+   boot_cpu_has(X86_FEATURE_AVX) &&
+   cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
+   poly1305_use_avx2 =
+   boot_cpu_has(X86_FEATURE_AVX) &&
+   boot_cpu_has(X86_FEATURE_AVX2) &&
+   cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
+   poly1305_use_avx512 =
+   boot_cpu_has(X86_FEATURE_AVX) &&
+   boot_cpu_has(X86_FEATURE_AVX2) &&
+   boot_cpu_has(X86_FEATURE_AVX512F) &&
+   cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
+ XFEATURE_MASK_AVX512, NULL) &&
+   /* Skylake downclocks unacceptably much when using zmm. */
+   boot_

[PATCH net-next v4 07/20] zinc: Poly1305 generic C implementations and selftest

2018-09-14 Thread Jason A. Donenfeld
These two C implementations -- a 32x32 one and a 64x64 one, depending on
the platform -- come from Andrew Moon's public domain poly1305-donna
portable code, modified for usage in the kernel and for usage with
accelerated primitives.

Information: https://cr.yp.to/mac.html

Signed-off-by: Jason A. Donenfeld 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
---
 include/zinc/poly1305.h  |  38 ++
 lib/zinc/Kconfig |   4 +
 lib/zinc/Makefile|   4 +
 lib/zinc/main.c  |   5 +
 lib/zinc/poly1305/poly1305-donna32.h | 205 +++
 lib/zinc/poly1305/poly1305-donna64.h | 182 ++
 lib/zinc/poly1305/poly1305.c | 131 
 lib/zinc/selftest/poly1305.h | 876 +++
 8 files changed, 1445 insertions(+)
 create mode 100644 include/zinc/poly1305.h
 create mode 100644 lib/zinc/poly1305/poly1305-donna32.h
 create mode 100644 lib/zinc/poly1305/poly1305-donna64.h
 create mode 100644 lib/zinc/poly1305/poly1305.c
 create mode 100644 lib/zinc/selftest/poly1305.h

diff --git a/include/zinc/poly1305.h b/include/zinc/poly1305.h
new file mode 100644
index ..338430c8477a
--- /dev/null
+++ b/include/zinc/poly1305.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#ifndef _ZINC_POLY1305_H
+#define _ZINC_POLY1305_H
+
+#include 
+#include 
+
+enum poly1305_lengths {
+   POLY1305_BLOCK_SIZE = 16,
+   POLY1305_KEY_SIZE = 32,
+   POLY1305_MAC_SIZE = 16
+};
+
+struct poly1305_ctx {
+   u8 opaque[24 * sizeof(u64)];
+   u32 nonce[4];
+   u8 data[POLY1305_BLOCK_SIZE];
+   size_t num;
+} __aligned(8);
+
+void poly1305_fpu_init(void);
+
+void poly1305_init(struct poly1305_ctx *ctx, const u8 key[POLY1305_KEY_SIZE],
+  simd_context_t simd_context);
+void poly1305_update(struct poly1305_ctx *ctx, const u8 *input, size_t len,
+simd_context_t simd_context);
+void poly1305_final(struct poly1305_ctx *ctx, u8 mac[POLY1305_MAC_SIZE],
+   simd_context_t simd_context);
+
+#ifdef DEBUG
+bool poly1305_selftest(void);
+#endif
+
+#endif /* _ZINC_POLY1305_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index e7d396d61607..bc8c61334362 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -6,6 +6,10 @@ config ZINC_CHACHA20
select ZINC
select CRYPTO_ALGAPI
 
+config ZINC_POLY1305
+   bool
+   select ZINC
+
 config ZINC_DEBUG
bool "Zinc cryptography library debugging and self-tests"
depends on ZINC
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 9f6a5e65d729..d1e3892e06d9 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -23,6 +23,10 @@ CFLAGS_chacha20.o += -include 
$(srctree)/$(src)/chacha20/chacha20-mips-glue.h
 endif
 endif
 
+ifeq ($(CONFIG_ZINC_POLY1305),y)
+zinc-y += poly1305/poly1305.o
+endif
+
 zinc-y += main.o
 
 obj-$(CONFIG_ZINC) := zinc.o
diff --git a/lib/zinc/main.c b/lib/zinc/main.c
index 7e8e84b706b7..d871dd406a5c 100644
--- a/lib/zinc/main.c
+++ b/lib/zinc/main.c
@@ -4,6 +4,7 @@
  */
 
 #include 
+#include 
 
 #include 
 #include 
@@ -21,6 +22,10 @@ static int __init mod_init(void)
 {
 #ifdef CONFIG_ZINC_CHACHA20
chacha20_fpu_init();
+#endif
+#ifdef CONFIG_ZINC_POLY1305
+   poly1305_fpu_init();
+   selftest(poly1305);
 #endif
return 0;
 }
diff --git a/lib/zinc/poly1305/poly1305-donna32.h 
b/lib/zinc/poly1305/poly1305-donna32.h
new file mode 100644
index ..dc32123210f9
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-donna32.h
@@ -0,0 +1,205 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ *
+ * This is based in part on Andrew Moon's poly1305-donna, which is in the
+ * public domain.
+ */
+
+struct poly1305_internal {
+   u32 h[5];
+   u32 r[5];
+   u32 s[4];
+};
+
+static void poly1305_init_generic(void *ctx, const u8 key[16])
+{
+   struct poly1305_internal *st = (struct poly1305_internal *)ctx;
+
+   /* r &= 0xffc0ffc0ffc0fff */
+   st->r[0] = (get_unaligned_le32(&key[0])) & 0x3ff;
+   st->r[1] = (get_unaligned_le32(&key[3]) >> 2) & 0x303;
+   st->r[2] = (get_unaligned_le32(&key[6]) >> 4) & 0x3ffc0ff;
+   st->r[3] = (get_unaligned_le32(&key[9]) >> 6) & 0x3f03fff;
+   st->r[4] = (get_unaligned_le32(&key[12]) >> 8) & 0x00f;
+
+   /* s = 5*r */
+   st->s[0] = st->r[1] * 5;
+   st->s[1] = st->r[2] * 5;
+   st->s[2] = st->r[3] * 5;
+   st->s[3] = st->r[4] * 5;
+
+   /* h = 0 */
+   st->h[0] = 0;
+   st->h[1] = 0;
+   st->h[2] = 0;
+   st->h[3] = 0;
+   st->h[4] = 0;
+}
+
+static void poly1305_blocks_generic(void *ctx, const u8 *input, size_t len,
+   const u32 padbit)
+{
+   struct poly1305_internal *st = (struct poly1305_inter

[PATCH net-next v4 13/20] zinc: BLAKE2s x86_64 implementation

2018-09-14 Thread Jason A. Donenfeld
These implementations from Samuel Neves support AVX and AVX-512VL.
Originally this used AVX-512F, but Skylake thermal throttling made
AVX-512VL more attractive and possible to do with negligable difference.

Signed-off-by: Jason A. Donenfeld 
Signed-off-by: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: x...@kernel.org
---
 lib/zinc/Makefile  |   4 +
 lib/zinc/blake2s/blake2s-x86_64-glue.h |  62 +++
 lib/zinc/blake2s/blake2s-x86_64.S  | 685 +
 3 files changed, 751 insertions(+)
 create mode 100644 lib/zinc/blake2s/blake2s-x86_64-glue.h
 create mode 100644 lib/zinc/blake2s/blake2s-x86_64.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 45817ec5539c..49455abfdf5e 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -53,6 +53,10 @@ endif
 
 ifeq ($(CONFIG_ZINC_BLAKE2S),y)
 zinc-y += blake2s/blake2s.o
+ifeq ($(CONFIG_ZINC_ARCH_X86_64),y)
+zinc-y += blake2s/blake2s-x86_64.o
+CFLAGS_blake2s.o += -include $(srctree)/$(src)/blake2s/blake2s-x86_64-glue.h
+endif
 endif
 
 zinc-y += main.o
diff --git a/lib/zinc/blake2s/blake2s-x86_64-glue.h 
b/lib/zinc/blake2s/blake2s-x86_64-glue.h
new file mode 100644
index ..2f5b74bd9117
--- /dev/null
+++ b/lib/zinc/blake2s/blake2s-x86_64-glue.h
@@ -0,0 +1,62 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_AS_AVX
+asmlinkage void blake2s_compress_avx(struct blake2s_state *state,
+const u8 *block, const size_t nblocks,
+const u32 inc);
+#endif
+#ifdef CONFIG_AS_AVX512
+asmlinkage void blake2s_compress_avx512(struct blake2s_state *state,
+   const u8 *block, const size_t nblocks,
+   const u32 inc);
+#endif
+
+static bool blake2s_use_avx __ro_after_init;
+static bool blake2s_use_avx512 __ro_after_init;
+
+void __init blake2s_fpu_init(void)
+{
+   blake2s_use_avx =
+   boot_cpu_has(X86_FEATURE_AVX) &&
+   cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
+   blake2s_use_avx512 =
+   boot_cpu_has(X86_FEATURE_AVX) &&
+   boot_cpu_has(X86_FEATURE_AVX2) &&
+   boot_cpu_has(X86_FEATURE_AVX512F) &&
+   boot_cpu_has(X86_FEATURE_AVX512VL) &&
+   cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
+ XFEATURE_MASK_AVX512, NULL);
+}
+
+static inline bool blake2s_arch(struct blake2s_state *state, const u8 *block,
+   size_t nblocks, const u32 inc)
+{
+#ifdef CONFIG_AS_AVX512
+   if (blake2s_use_avx512 && irq_fpu_usable()) {
+   kernel_fpu_begin();
+   blake2s_compress_avx512(state, block, nblocks, inc);
+   kernel_fpu_end();
+   return true;
+   }
+#endif
+#ifdef CONFIG_AS_AVX
+   if (blake2s_use_avx && irq_fpu_usable()) {
+   kernel_fpu_begin();
+   blake2s_compress_avx(state, block, nblocks, inc);
+   kernel_fpu_end();
+   return true;
+   }
+#endif
+   return false;
+}
+
+#define HAVE_BLAKE2S_ARCH_IMPLEMENTATION
diff --git a/lib/zinc/blake2s/blake2s-x86_64.S 
b/lib/zinc/blake2s/blake2s-x86_64.S
new file mode 100644
index ..6a84b9f1f2c4
--- /dev/null
+++ b/lib/zinc/blake2s/blake2s-x86_64.S
@@ -0,0 +1,685 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ * Copyright (C) 2017 Samuel Neves . All Rights Reserved.
+ */
+
+#include 
+
+.section .rodata.cst32.BLAKE2S_IV, "aM", @progbits, 32
+.align 32
+IV:.octa 0xA54FF53A3C6EF372BB67AE856A09E667
+   .octa 0x5BE0CD191F83D9AB9B05688C510E527F
+.section .rodata.cst16.ROT16, "aM", @progbits, 16
+.align 16
+ROT16: .octa 0x0D0C0F0E09080B0A0504070601000302
+.section .rodata.cst16.ROR328, "aM", @progbits, 16
+.align 16
+ROR328:.octa 0x0C0F0E0D080B0A090407060500030201
+#ifdef CONFIG_AS_AVX512
+.section .rodata.cst64.BLAKE2S_SIGMA, "aM", @progbits, 640
+.align 64
+SIGMA:
+.long 0, 2, 4, 6, 1, 3, 5, 7, 8, 10, 12, 14, 9, 11, 13, 15
+.long 11, 2, 12, 14, 9, 8, 15, 3, 4, 0, 13, 6, 10, 1, 7, 5
+.long 10, 12, 11, 6, 5, 9, 13, 3, 4, 15, 14, 2, 0, 7, 8, 1
+.long 10, 9, 7, 0, 11, 14, 1, 12, 6, 2, 15, 3, 13, 8, 5, 4
+.long 4, 9, 8, 13, 14, 0, 10, 11, 7, 3, 12, 1, 5, 6, 15, 2
+.long 2, 10, 4, 14, 13, 3, 9, 11, 6, 5, 7, 12, 15, 1, 8, 0
+.long 4, 11, 14, 8, 13, 10, 12, 5, 2, 1, 15, 3, 9, 7, 0, 6
+.long 6, 12, 0, 13, 15, 2, 1, 10, 4, 5, 11, 14, 8, 3, 9, 7
+.long 14, 5, 4, 12, 9, 7, 3, 10, 2, 0, 6, 15, 11, 1, 13, 8
+.long 11, 7, 13, 10, 12, 14, 0, 15, 4, 5, 6, 9, 2, 1, 8, 3
+#endif /* CONFIG_AS_AVX512 */
+
+.text
+#ifdef CONFIG_AS_AVX
+ENTRY(blake2s_compress_avx)
+   m

[PATCH net-next v4 10/20] zinc: Poly1305 MIPS32r2 and MIPS64 implementations

2018-09-14 Thread Jason A. Donenfeld
This MIPS32r2 implementation comes from René van Dorst and me and
results in a nice speedup on the usual OpenWRT targets. The MIPS64
implementation comes from Andy Polyakov results in a nice speedup on
commodity Octeon hardware, and has been modified slightly from the
original:

- The function names have been renamed to fit kernel conventions.
- A comment has been added.

No changes have been made to the actual instructions.

Signed-off-by: Jason A. Donenfeld 
Signed-off-by: René van Dorst 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Andy Polyakov 
Cc: Ralf Baechle 
Cc: Paul Burton 
Cc: James Hogan 
Cc: linux-m...@linux-mips.org
---
 lib/zinc/Makefile  |   8 +
 lib/zinc/poly1305/poly1305-mips-glue.h |  40 +++
 lib/zinc/poly1305/poly1305-mips.S  | 417 +
 lib/zinc/poly1305/poly1305-mips64.S| 359 +
 4 files changed, 824 insertions(+)
 create mode 100644 lib/zinc/poly1305/poly1305-mips-glue.h
 create mode 100644 lib/zinc/poly1305/poly1305-mips.S
 create mode 100644 lib/zinc/poly1305/poly1305-mips64.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 72112f8ffba1..c8c49d59794b 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -37,6 +37,14 @@ ifeq ($(CONFIG_ZINC_ARCH_ARM64),y)
 zinc-y += poly1305/poly1305-arm64.o
 CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-arm-glue.h
 endif
+ifeq ($(CONFIG_ZINC_ARCH_MIPS)$(CONFIG_CPU_MIPS32_R2),yy)
+zinc-y += poly1305/poly1305-mips.o
+CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-mips-glue.h
+endif
+ifeq ($(CONFIG_ZINC_ARCH_MIPS64),y)
+zinc-y += poly1305/poly1305-mips64.o
+CFLAGS_poly1305.o += -include $(srctree)/$(src)/poly1305/poly1305-mips-glue.h
+endif
 endif
 
 zinc-y += main.o
diff --git a/lib/zinc/poly1305/poly1305-mips-glue.h 
b/lib/zinc/poly1305/poly1305-mips-glue.h
new file mode 100644
index ..e29f85915eec
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-mips-glue.h
@@ -0,0 +1,40 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#include 
+
+asmlinkage void poly1305_init_mips(void *ctx, const u8 key[16]);
+asmlinkage void poly1305_blocks_mips(void *ctx, const u8 *inp, const size_t 
len,
+const u32 padbit);
+asmlinkage void poly1305_emit_mips(void *ctx, u8 mac[16], const u32 nonce[4]);
+void __init poly1305_fpu_init(void)
+{
+}
+
+static inline bool poly1305_init_arch(void *ctx,
+ const u8 key[POLY1305_KEY_SIZE],
+ simd_context_t simd_context)
+{
+   poly1305_init_mips(ctx, key);
+   return true;
+}
+
+static inline bool poly1305_blocks_arch(void *ctx, const u8 *inp,
+   const size_t len, const u32 padbit,
+   simd_context_t simd_context)
+{
+   poly1305_blocks_mips(ctx, inp, len, padbit);
+   return true;
+}
+
+static inline bool poly1305_emit_arch(void *ctx, u8 mac[POLY1305_MAC_SIZE],
+ const u32 nonce[4],
+ simd_context_t simd_context)
+{
+   poly1305_emit_mips(ctx, mac, nonce);
+   return true;
+}
+
+#define HAVE_POLY1305_ARCH_IMPLEMENTATION
diff --git a/lib/zinc/poly1305/poly1305-mips.S 
b/lib/zinc/poly1305/poly1305-mips.S
new file mode 100644
index ..32d8558d8601
--- /dev/null
+++ b/lib/zinc/poly1305/poly1305-mips.S
@@ -0,0 +1,417 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2016-2018 René van Dorst  All Rights 
Reserved.
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+#define MSB 0
+#define LSB 3
+#else
+#define MSB 3
+#define LSB 0
+#endif
+
+#define POLY1305_BLOCK_SIZE 16
+.text
+#define H0 $t0
+#define H1 $t1
+#define H2 $t2
+#define H3 $t3
+#define H4 $t4
+
+#define R0 $t5
+#define R1 $t6
+#define R2 $t7
+#define R3 $t8
+
+#define O0 $s0
+#define O1 $s4
+#define O2 $v1
+#define O3 $t9
+#define O4 $s5
+
+#define S1 $s1
+#define S2 $s2
+#define S3 $s3
+
+#define SC $at
+#define CA $v0
+
+/* Input arguments */
+#define poly   $a0
+#define src$a1
+#define srclen $a2
+#define hibit  $a3
+
+/* Location in the opaque buffer
+ * R[0..3], CA, H[0..4]
+ */
+#define PTR_POLY1305_R(n) ( 0 + (n*4)) ## ($a0)
+#define PTR_POLY1305_CA   (16) ## ($a0)
+#define PTR_POLY1305_H(n) (20 + (n*4)) ## ($a0)
+
+#define POLY1305_BLOCK_SIZE 16
+#define POLY1305_STACK_SIZE 8 * 4
+
+.set reorder
+.set noat
+.align 4
+.globl poly1305_blocks_mips
+.ent poly1305_blocks_mips
+poly1305_blocks_mips:
+   .frame  $sp,POLY1305_STACK_SIZE,$31
+   /* srclen &= 0xFFF0 */
+   ins srclen, $zero, 0, 4
+
+   .set noreorder
+   /* check srclen >= 16 bytes */
+   beqzsrclen, .Lpoly1305_blocks_mips_end
+   addiu   $sp, -(POLY1305_STAC

[PATCH net-next v4 08/20] zinc: Poly1305 ARM and ARM64 implementations

2018-09-14 Thread Jason A. Donenfeld
These NEON and non-NEON implementations come from Andy Polyakov's
implementation. They are exactly the same as Andy Polyakov's original,
with the following exceptions:

- Entries and exits use the proper kernel convention macro.
- CPU feature checking is done in C by the glue code, so that has been
  removed from the assembly.
- The function names have been renamed to fit kernel conventions.
- Labels have been renamed to fit kernel conventions.
- The neon code can jump to the scalar code when it makes sense to do
  so.

After '/^#/d;/^\..*[^:]$/d', the code has the following diff in actual
instructions from the original.

ARM:

-poly1305_init:
-.Lpoly1305_init:
+ENTRY(poly1305_init_arm)
stmdb   sp!,{r4-r11}

eor r3,r3,r3
@@ -18,8 +25,6 @@
moveq   r0,#0
beq .Lno_key

-   adr r11,.Lpoly1305_init
-   ldr r12,.LOPENSSL_armcap
ldrbr4,[r1,#0]
mov r10,#0x0fff
ldrbr5,[r1,#1]
@@ -34,8 +39,6 @@
ldrbr7,[r1,#6]
and r4,r4,r10

-   ldr r12,[r11,r12]   @ OPENSSL_armcap_P
-   ldr r12,[r12]
ldrbr8,[r1,#7]
orr r5,r5,r6,lsl#8
ldrbr6,[r1,#8]
@@ -45,22 +48,6 @@
ldrbr8,[r1,#10]
and r5,r5,r3

-   tst r12,#ARMV7_NEON @ check for NEON
-   adr r9,poly1305_blocks_neon
-   adr r11,poly1305_blocks
-   it  ne
-   movne   r11,r9
-   adr r12,poly1305_emit
-   adr r10,poly1305_emit_neon
-   it  ne
-   movne   r12,r10
-   itete   eq
-   addeq   r12,r11,#(poly1305_emit-.Lpoly1305_init)
-   addne   r12,r11,#(poly1305_emit_neon-.Lpoly1305_init)
-   addeq   r11,r11,#(poly1305_blocks-.Lpoly1305_init)
-   addne   r11,r11,#(poly1305_blocks_neon-.Lpoly1305_init)
-   orr r12,r12,#1  @ thumb-ify address
-   orr r11,r11,#1
ldrbr9,[r1,#11]
orr r6,r6,r7,lsl#8
ldrbr7,[r1,#12]
@@ -79,17 +66,16 @@
str r6,[r0,#8]
and r7,r7,r3
str r7,[r0,#12]
-   stmia   r2,{r11,r12}@ fill functions table
-   mov r0,#1
-   mov r0,#0
 .Lno_key:
ldmia   sp!,{r4-r11}
bx  lr  @ bxlr
tst lr,#1
moveq   pc,lr   @ be binary compatible with V4, yet
.word   0xe12fff1e  @ interoperable with Thumb 
ISA:-)
-poly1305_blocks:
-.Lpoly1305_blocks:
+ENDPROC(poly1305_init_arm)
+
+ENTRY(poly1305_blocks_arm)
+.Lpoly1305_blocks_arm:
stmdb   sp!,{r3-r11,lr}

andsr2,r2,#-16
@@ -231,10 +217,11 @@
tst lr,#1
moveq   pc,lr   @ be binary compatible with V4, yet
.word   0xe12fff1e  @ interoperable with Thumb 
ISA:-)
-poly1305_emit:
+ENDPROC(poly1305_blocks_arm)
+
+ENTRY(poly1305_emit_arm)
stmdb   sp!,{r4-r11}
 .Lpoly1305_emit_enter:
-
ldmia   r0,{r3-r7}
addsr8,r3,#5@ compare to modulus
adcsr9,r4,#0
@@ -305,8 +292,12 @@
tst lr,#1
moveq   pc,lr   @ be binary compatible with V4, yet
.word   0xe12fff1e  @ interoperable with Thumb 
ISA:-)
+ENDPROC(poly1305_emit_arm)
+
+

-poly1305_init_neon:
+ENTRY(poly1305_init_neon)
+.Lpoly1305_init_neon:
ldr r4,[r0,#20] @ load key base 2^32
ldr r5,[r0,#24]
ldr r6,[r0,#28]
@@ -515,8 +506,9 @@
vst1.32 {d8[1]},[r7]

bx  lr  @ bxlr
+ENDPROC(poly1305_init_neon)

-poly1305_blocks_neon:
+ENTRY(poly1305_blocks_neon)
ldr ip,[r0,#36] @ is_base2_26
andsr2,r2,#-16
beq .Lno_data_neon
@@ -524,7 +516,7 @@
cmp r2,#64
bhs .Lenter_neon
tst ip,ip   @ is_base2_26?
-   beq .Lpoly1305_blocks
+   beq .Lpoly1305_blocks_arm

 .Lenter_neon:
stmdb   sp!,{r4-r7}
@@ -534,7 +526,7 @@
bne .Lbase2_26_neon

stmdb   sp!,{r1-r3,lr}
-   bl  poly1305_init_neon
+   bl  .Lpoly1305_init_neon

ldr r4,[r0,#0]  @ load hash value base 2^32
ldr r5,[r0,#4]
@@ -989,8 +981,9 @@
ldmia   sp!,{r4-r7}
 .Lno_data_neon:
bx  lr  @ bxlr
+ENDPROC(poly1305_blocks_neon)

-poly1305_emit_neon:
+ENTRY(poly1305_emit_neon)
ldr ip,[r0,#36] @ is_base2_26

stmdb   sp!,{r4-r11}
@@ -1055,6 +1048,6 @@

ldmia   sp!,{r4-r11}
bx  lr  @ bxlr
+ENDPROC(poly1305_emit_neon)

ARM64:

-poly1305_init:
+ENTRY(poly1305_init_arm)
cmp x1,xzr
stp xzr,xzr,[x0]// zero hash value
stp xzr,xzr,[x0,#16]// [along with is_base2_26]
@@ -11,14 +1

[PATCH net-next v4 05/20] zinc: ChaCha20 x86_64 implementation

2018-09-14 Thread Jason A. Donenfeld
This provides SSSE3, AVX-2, AVX-512F, and AVX-512VL implementations for
ChaCha20. The AVX-512F implementation is disabled on Skylake, due to
throttling, and the VL ymm implementation is used instead. These come
from Andy Polyakov's implementation, with the following modifications
from Samuel Neves:

  - Some cosmetic changes, like renaming labels to .Lname, constants,
and other Linux conventions.

  - CPU feature checking is done in C by the glue code, so that has been
removed from the assembly.

  - Eliminate translating certain instructions, such as pshufb, palignr,
vprotd, etc, to .byte directives. This is meant for compatibility
with ancient toolchains, but presumably it is unnecessary here,
since the build system already does checks on what GNU as can
assemble.

  - When aligning the stack, the original code was saving %rsp to %r9.
To keep objtool happy, we use instead the DRAP idiom to save %rsp
to %r10:

  leaq8(%rsp),%r10
  ... code here ...
  leaq-8(%r10),%rsp

  - The original code assumes the stack comes aligned to 16 bytes. This
is not necessarily the case, and to avoid crashes,
`andq $-alignment, %rsp` was added in the prolog of a few functions.

  - The original hardcodes returns as .byte 0xf3,0xc3, aka "rep ret".
We replace this by "ret". "rep ret" was meant to help with AMD K8
chips, cf. http://repzret.org/p/repzret. It makes no sense to
continue to use this kludge for code that won't even run on ancient
AMD chips.

Signed-off-by: Jason A. Donenfeld 
Signed-off-by: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Andy Polyakov 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: x...@kernel.org
---
 lib/zinc/Makefile|4 +
 lib/zinc/chacha20/chacha20-x86_64-glue.h |  102 +
 lib/zinc/chacha20/chacha20-x86_64.S  | 2632 ++
 3 files changed, 2738 insertions(+)
 create mode 100644 lib/zinc/chacha20/chacha20-x86_64-glue.h
 create mode 100644 lib/zinc/chacha20/chacha20-x86_64.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 8d14cb13349a..32e4bd94ea0b 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -5,6 +5,10 @@ ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG
 
 ifeq ($(CONFIG_ZINC_CHACHA20),y)
 zinc-y += chacha20/chacha20.o
+ifeq ($(CONFIG_ZINC_ARCH_X86_64),y)
+zinc-y += chacha20/chacha20-x86_64.o
+CFLAGS_chacha20.o += -include $(srctree)/$(src)/chacha20/chacha20-x86_64-glue.h
+endif
 ifeq ($(CONFIG_ZINC_ARCH_ARM),y)
 zinc-y += chacha20/chacha20-arm.o
 CFLAGS_chacha20.o += -include $(srctree)/$(src)/chacha20/chacha20-arm-glue.h
diff --git a/lib/zinc/chacha20/chacha20-x86_64-glue.h 
b/lib/zinc/chacha20/chacha20-x86_64-glue.h
new file mode 100644
index ..e4f6c3162d3f
--- /dev/null
+++ b/lib/zinc/chacha20/chacha20-x86_64-glue.h
@@ -0,0 +1,102 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#ifdef CONFIG_AS_SSSE3
+asmlinkage void hchacha20_ssse3(u8 *derived_key, const u8 *nonce,
+   const u8 *key);
+asmlinkage void chacha20_ssse3(u8 *out, const u8 *in, const size_t len,
+  const u32 key[8], const u32 counter[4]);
+#endif
+#ifdef CONFIG_AS_AVX2
+asmlinkage void chacha20_avx2(u8 *out, const u8 *in, const size_t len,
+ const u32 key[8], const u32 counter[4]);
+#endif
+#ifdef CONFIG_AS_AVX512
+asmlinkage void chacha20_avx512(u8 *out, const u8 *in, const size_t len,
+   const u32 key[8], const u32 counter[4]);
+asmlinkage void chacha20_avx512vl(u8 *out, const u8 *in, const size_t len,
+ const u32 key[8], const u32 counter[4]);
+#endif
+
+static bool chacha20_use_ssse3 __ro_after_init;
+static bool chacha20_use_avx2 __ro_after_init;
+static bool chacha20_use_avx512 __ro_after_init;
+static bool chacha20_use_avx512vl __ro_after_init;
+
+void __init chacha20_fpu_init(void)
+{
+   chacha20_use_ssse3 = boot_cpu_has(X86_FEATURE_SSSE3);
+   chacha20_use_avx2 =
+   boot_cpu_has(X86_FEATURE_AVX) &&
+   boot_cpu_has(X86_FEATURE_AVX2) &&
+   cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM, NULL);
+   chacha20_use_avx512 =
+   boot_cpu_has(X86_FEATURE_AVX) &&
+   boot_cpu_has(X86_FEATURE_AVX2) &&
+   boot_cpu_has(X86_FEATURE_AVX512F) &&
+   cpu_has_xfeatures(XFEATURE_MASK_SSE | XFEATURE_MASK_YMM |
+ XFEATURE_MASK_AVX512, NULL) &&
+   /* Skylake downclocks unacceptably much when using zmm. */
+   boot_cpu_data.x86_model != INTEL_FAM6_SKYLAKE_X;
+   chacha20_use_avx512vl =
+   boot_cpu_has(X86_FEATURE_AVX) &&
+   boot_cpu_has(X86_FEATURE_AVX2) &&
+   boot_cpu_has(X86_FEATURE_AVX512F) 

[PATCH net-next v4 06/20] zinc: ChaCha20 MIPS32r2 implementation

2018-09-14 Thread Jason A. Donenfeld
This MIPS32r2 implementation comes from René van Dorst and me and
results in a nice speedup on the usual OpenWRT targets.

Signed-off-by: Jason A. Donenfeld 
Signed-off-by: René van Dorst 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Ralf Baechle 
Cc: Paul Burton 
Cc: James Hogan 
Cc: linux-m...@linux-mips.org
---
 lib/zinc/Makefile  |   4 +
 lib/zinc/chacha20/chacha20-mips-glue.h |  28 ++
 lib/zinc/chacha20/chacha20-mips.S  | 474 +
 3 files changed, 506 insertions(+)
 create mode 100644 lib/zinc/chacha20/chacha20-mips-glue.h
 create mode 100644 lib/zinc/chacha20/chacha20-mips.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 32e4bd94ea0b..9f6a5e65d729 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -17,6 +17,10 @@ ifeq ($(CONFIG_ZINC_ARCH_ARM64),y)
 zinc-y += chacha20/chacha20-arm64.o
 CFLAGS_chacha20.o += -include $(srctree)/$(src)/chacha20/chacha20-arm-glue.h
 endif
+ifeq ($(CONFIG_ZINC_ARCH_MIPS)$(CONFIG_CPU_MIPS32_R2),yy)
+zinc-y += chacha20/chacha20-mips.o
+CFLAGS_chacha20.o += -include $(srctree)/$(src)/chacha20/chacha20-mips-glue.h
+endif
 endif
 
 zinc-y += main.o
diff --git a/lib/zinc/chacha20/chacha20-mips-glue.h 
b/lib/zinc/chacha20/chacha20-mips-glue.h
new file mode 100644
index ..5b2c8cec36c8
--- /dev/null
+++ b/lib/zinc/chacha20/chacha20-mips-glue.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#include 
+
+asmlinkage void chacha20_mips(u8 *out, const u8 *in, const size_t len,
+ const u32 key[8], const u32 counter[4]);
+void __init chacha20_fpu_init(void)
+{
+}
+
+static inline bool chacha20_arch(u8 *dst, const u8 *src, const size_t len,
+const u32 key[8], const u32 counter[4],
+simd_context_t simd_context)
+{
+   chacha20_mips(dst, src, len, key, counter);
+   return true;
+}
+
+static inline bool hchacha20_arch(u8 *derived_key, const u8 *nonce,
+ const u8 *key, simd_context_t simd_context)
+{
+   return false;
+}
+
+#define HAVE_CHACHA20_ARCH_IMPLEMENTATION
diff --git a/lib/zinc/chacha20/chacha20-mips.S 
b/lib/zinc/chacha20/chacha20-mips.S
new file mode 100644
index ..77da2c2fb240
--- /dev/null
+++ b/lib/zinc/chacha20/chacha20-mips.S
@@ -0,0 +1,474 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2016-2018 René van Dorst . All Rights 
Reserved.
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#define MASK_U32   0x3c
+#define MASK_BYTES 0x03
+#define CHACHA20_BLOCK_SIZE 64
+#define STACK_SIZE 4*16
+
+#define X0  $t0
+#define X1  $t1
+#define X2  $t2
+#define X3  $t3
+#define X4  $t4
+#define X5  $t5
+#define X6  $t6
+#define X7  $t7
+#define X8  $v1
+#define X9  $fp
+#define X10 $s7
+#define X11 $s6
+#define X12 $s5
+#define X13 $s4
+#define X14 $s3
+#define X15 $s2
+/* Use regs which are overwritten on exit for Tx so we don't leak clear data. 
*/
+#define T0  $s1
+#define T1  $s0
+#define T(n) T ## n
+#define X(n) X ## n
+
+/* Input arguments */
+#define OUT$a0
+#define IN $a1
+#define BYTES  $a2
+/* KEY and NONCE argument must be u32 aligned */
+#define KEY$a3
+/* NONCE pointer is given via stack */
+#define NONCE  $t9
+
+/* Output argument */
+/* NONCE[0] is kept in a register and not in memory.
+ * We don't want to touch original value in memory.
+ * Must be incremented every loop iteration.
+ */
+#define NONCE_0$v0
+
+/* SAVED_X and SAVED_CA are set in the jump table.
+ * Use regs which are overwritten on exit else we don't leak clear data.
+ * They are used to handling the last bytes which are not multiple of 4.
+ */
+#define SAVED_XX15
+#define SAVED_CA   $ra
+
+#define PTR_LAST_ROUND $t8
+
+/* ChaCha20 constants and stack location */
+#define CONSTANT_OFS_SP48
+#define UNALIGNED_OFS_SP 40
+
+#define CONSTANT_1 0x61707865
+#define CONSTANT_2 0x3320646e
+#define CONSTANT_3 0x79622d32
+#define CONSTANT_4 0x6b206574
+
+#if __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__
+#define MSB 0
+#define LSB 3
+#define ROTx rotl
+#define ROTR(n) rotr n, 24
+#defineCPU_TO_LE32(n) \
+   wsbhn; \
+   rotrn, 16;
+#else
+#define MSB 3
+#define LSB 0
+#define ROTx rotr
+#define CPU_TO_LE32(n)
+#define ROTR(n)
+#endif
+
+#define STORE_UNALIGNED(x, a, s, o) \
+.Lchacha20_mips_xor_unaligned_ ## x ## _b: ; \
+   .if ((s != NONCE) || (o != 0)); \
+   lw  T0, o(s); \
+   .endif; \
+   lwl T1, x-4+MSB ## (IN); \
+   lwr T1, x-4+LSB ## (IN); \
+   .if ((s == NONCE) && (o == 0)); \
+   adduX ## a, NONCE_0; \
+   .else; \
+   adduX ## a, T0; \
+   .endif; \
+   CPU_TO_LE32(X ## a); \
+   xor  

[PATCH net-next v4 04/20] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-14 Thread Jason A. Donenfeld
These NEON and non-NEON implementations come from Andy Polyakov's
implementation. They are exactly the same as Andy Polyakov's original,
with the following exceptions:

- Entries and exits use the proper kernel convention macro.
- CPU feature checking is done in C by the glue code, so that has been
  removed from the assembly.
- The function names have been renamed to fit kernel conventions.
- Labels have been renamed (prefixed with .L) to fit kernel conventions.
- Constants have been rearranged so that they are closer to the code
  that is using them. [ARM only]
- The neon code can jump to the scalar code when it makes sense to do
  so.
- The neon_512 function as a separate function has been removed, leaving
  the decision up to the main neon entry point. [ARM64 only]

After '/^#/d;/^\..*[^:]$/d', the code has the following diff in actual
instructions from the original.

ARM:

-ChaCha20_ctr32:
-.LChaCha20_ctr32:
+ENTRY(chacha20_arm)
ldr r12,[sp,#0] @ pull pointer to counter and nonce
stmdb   sp!,{r0-r2,r4-r11,lr}
-   sub r14,pc,#16  @ ChaCha20_ctr32
-   adr r14,.LChaCha20_ctr32
cmp r2,#0   @ len==0?
itt eq
addeq   sp,sp,#4*3
-   beq .Lno_data
-   cmp r2,#192 @ test len
-   bls .Lshort
-   ldr r4,[r14,#-32]
-   ldr r4,[r14,r4]
-   ldr r4,[r4]
-   tst r4,#ARMV7_NEON
-   bne .LChaCha20_neon
+   beq .Lno_data_arm
 .Lshort:
ldmia   r12,{r4-r7} @ load counter and nonce
sub sp,sp,#4*(16)   @ off-load area
-   sub r14,r14,#64 @ .Lsigma
+   sub r14,pc,#100 @ .Lsigma
+   adr r14,.Lsigma @ .Lsigma
stmdb   sp!,{r4-r7} @ copy counter and nonce
ldmia   r3,{r4-r11} @ load key
ldmia   r14,{r0-r3} @ load sigma
@@ -617,14 +615,25 @@

 .Ldone:
add sp,sp,#4*(32+3)
-.Lno_data:
+.Lno_data_arm:
ldmia   sp!,{r4-r11,pc}
+ENDPROC(chacha20_arm)

-ChaCha20_neon:
+ENTRY(chacha20_neon)
ldr r12,[sp,#0] @ pull pointer to counter and 
nonce
stmdb   sp!,{r0-r2,r4-r11,lr}
-.LChaCha20_neon:
-   adr r14,.Lsigma
+   cmp r2,#0   @ len==0?
+   itt eq
+   addeq   sp,sp,#4*3
+   beq .Lno_data_neon
+   cmp r2,#192 @ test len
+   bls .Lshort
+.Lchacha20_neon_begin:
+   adr r14,.Lsigma2
vstmdb  sp!,{d8-d15}@ ABI spec says so
stmdb   sp!,{r0-r3}

@@ -1265,4 +1274,6 @@
add sp,sp,#4*(32+4)
vldmia  sp,{d8-d15}
add sp,sp,#4*(16+3)
+.Lno_data_neon:
ldmia   sp!,{r4-r11,pc}
+ENDPROC(chacha20_neon)

ARM64:

-ChaCha20_ctr32:
+ENTRY(chacha20_arm)
cbz x2,.Labort
-   adr x5,.LOPENSSL_armcap_P
-   cmp x2,#192
-   b.lo.Lshort
-   ldrsw   x6,[x5]
-   ldr x6,[x5]
-   ldr w17,[x6,x5]
-   tst w17,#ARMV7_NEON
-   b.neChaCha20_neon
-
 .Lshort:
stp x29,x30,[sp,#-96]!
add x29,sp,#0
@@ -279,8 +274,13 @@
ldp x27,x28,[x29,#80]
ldp x29,x30,[sp],#96
ret
+ENDPROC(chacha20_arm)
+
+ENTRY(chacha20_neon)
+   cbz x2,.Labort_neon
+   cmp x2,#192
+   b.lo.Lshort

-ChaCha20_neon:
stp x29,x30,[sp,#-96]!
add x29,sp,#0

@@ -763,16 +763,6 @@
ldp x27,x28,[x29,#80]
ldp x29,x30,[sp],#96
ret
-ChaCha20_512_neon:
-   stp x29,x30,[sp,#-96]!
-   add x29,sp,#0
-
-   adr x5,.Lsigma
-   stp x19,x20,[sp,#16]
-   stp x21,x22,[sp,#32]
-   stp x23,x24,[sp,#48]
-   stp x25,x26,[sp,#64]
-   stp x27,x28,[sp,#80]

 .L512_or_more_neon:
sub sp,sp,#128+64
@@ -1920,4 +1910,6 @@
ldp x25,x26,[x29,#64]
ldp x27,x28,[x29,#80]
ldp x29,x30,[sp],#96
+.Labort_neon:
ret
+ENDPROC(chacha20_neon)

Signed-off-by: Jason A. Donenfeld 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
Cc: Andy Polyakov 
Cc: Russell King 
Cc: linux-arm-ker...@lists.infradead.org
---
 lib/zinc/Makefile |8 +
 lib/zinc/chacha20/chacha20-arm-glue.h |   50 +
 lib/zinc/chacha20/chacha20-arm.S  | 1473 +++
 lib/zinc/chacha20/chacha20-arm64.S| 1942 +
 4 files changed, 3473 insertions(+)
 create mode 100644 lib/zinc/chacha20/chacha20-arm-glue.h
 create mode 100644 lib/zinc/chacha20/chacha20-arm.S
 create mode 100644 lib/zinc/chacha20/chacha20-arm64.S

diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index 0b5a964bfba6..8d14cb13349a 100644
--- a/lib/z

[PATCH net-next v4 02/20] zinc: introduce minimal cryptography library

2018-09-14 Thread Jason A. Donenfeld
Zinc stands for "Zinc Is Neat Crypto" or "Zinc as IN Crypto" or maybe
just "Zx2c4's INsane Cryptolib." It's also short, easy to type, and
plays nicely with the recent trend of naming crypto libraries after
elements. The guiding principle is "don't overdo it". It's less of a
library and more of a directory tree for organizing well-curated direct
implementations of cryptography primitives.

Zinc is a new cryptography API that is much more minimal and lower-level
than the current one. It intends to complement it and provide a basis
upon which the current crypto API might build, as the provider of
software implementations of cryptographic primitives. It is motivated by
three primary observations in crypto API design:

  * Highly composable "cipher modes" and related abstractions from
90s cryptographers did not turn out to be as terrific an idea as
hoped, leading to a host of API misuse problems.

  * Most programmers are afraid of crypto code, and so prefer to
integrate it into libraries in a highly abstracted manner, so as to
shield themselves from implementation details. Cryptographers, on
the other hand, prefer simple direct implementations, which they're
able to verify for high assurance and optimize in accordance with
their expertise.

  * Overly abstracted and flexible cryptography APIs lead to a host of
dangerous problems and performance issues. The kernel is in the
business usually not of coming up with new uses of crypto, but
rather implementing various constructions, which means it essentially
needs a library of primitives, not a highly abstracted enterprise-ready
pluggable system, with a few particular exceptions.

This last observation has seen itself play out several times over and
over again within the kernel:

  * The perennial move of actual primitives away from crypto/ and into
lib/, so that users can actually call these functions directly with
no overhead and without lots of allocations, function pointers,
string specifier parsing, and general clunkiness. For example:
sha256, chacha20, siphash, sha1, and so forth live in lib/ rather
than in crypto/. Zinc intends to stop the cluttering of lib/ and
introduce these direct primitives into their proper place, lib/zinc/.

  * An abundance of misuse bugs with the present crypto API that have
been very unpleasant to clean up.

  * A hesitance to even use cryptography, because of the overhead and
headaches involved in accessing the routines.

Zinc goes in a rather different direction. Rather than providing a
thoroughly designed and abstracted API, Zinc gives you simple functions,
which implement some primitive, or some particular and specific
construction of primitives. It is not dynamic in the least, though one
could imagine implementing a complex dynamic dispatch mechanism (such as
the current crypto API) on top of these basic functions. After all,
dynamic dispatch is usually needed for applications with cipher agility,
such as IPsec, dm-crypt, AF_ALG, and so forth, and the existing crypto
API will continue to play that role. However, Zinc will provide a non-
haphazard way of directly utilizing crypto routines in applications
that do have neither the need nor desire for abstraction and dynamic
dispatch.

It also organizes the implementations in a simple, straight-forward,
and direct manner, making it enjoyable and intuitive to work on.
Rather than moving optimized assembly implementations into arch/, it
keeps them all together in lib/zinc/, making it simple and obvious to
compare and contrast what's happening. This is, notably, exactly what
the lib/raid6/ tree does, and that seems to work out rather well. It's
also the pattern of most successful crypto libraries. The architecture-
specific glue-code is made a part of each translation unit, rather than
being in a separate one, so that generic and architecture-optimized code
are combined at compile-time, and incompatibility branches compiled out by
the optimizer.

All implementations have been extensively tested and fuzzed, and are
selected for their quality, trustworthiness, and performance. Wherever
possible and performant, formally verified implementations are used,
such as those from HACL* [1] and Fiat-Crypto [2]. The routines also take
special care to zero out secrets using memzero_explicit (and future work
is planned to have gcc do this more reliably and performantly with
compiler plugins). The performance of the selected implementations is
state-of-the-art and unrivaled on a broad array of hardware, though of
course we will continue to fine tune these to the hardware demands
needed by kernel contributors. Each implementation also comes with
extensive self-tests and crafted test vectors, pulled from various
places such as Wycheproof [9].

Regularity of function signatures is important, so that users can easily
"guess" the name of the function they want. Though, individual
primitives are oftentimes not trivially inter

[PATCH net-next v4 00/20] WireGuard: Secure Network Tunnel

2018-09-14 Thread Jason A. Donenfeld
Changes v3->v4:
  - Remove mistaken double 07/17 patch.
  - Fix whitespace issues in blake2s assembly.
  - It's not possible to put compound literals into __initconst, so
we now instead just use boring fixed size struct members.
  - Move away from makefile ifdef maze and instead prefer kconfig values,
which also makes the design a bit more modular too, which could help
in the future.
  - Port old crypto API implementations (ChaCha20 and Poly1305) to Zinc.
  - Port security/keys/big_key to Zinc as second example of a good usage of
Zinc.
  - Document precisely what is different between the kernel code and
CRYPTOGAMS code when the CRYPTOGAMS code is used.
  - Move changelog to top of 00/20 message so that people can
actually find it.

---

This patchset is available on git.kernel.org in this branch, where it may be
pulled directly for inclusion into net-next:

  * 
https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/log/?h=jd/wireguard

---

WireGuard is a secure network tunnel written especially for Linux, which
has faced around three years of serious development, deployment, and
scrutiny. It delivers excellent performance and is extremely easy to
use and configure. It has been designed with the primary goal of being
both easy to audit by virtue of being small and highly secure from a
cryptography and systems security perspective. WireGuard is used by some
massive companies pushing enormous amounts of traffic, and likely
already today you've consumed bytes that at some point transited through
a WireGuard tunnel. Even as an out-of-tree module, WireGuard has been
integrated into various userspace tools, Linux distributions, mobile
phones, and data centers. There are ports in several languages to
several operating systems, and even commercial hardware and services
sold integrating WireGuard. It is time, therefore, for WireGuard to be
properly integrated into Linux.

Ample information, including documentation, installation instructions,
and project details, is available at:

  * https://www.wireguard.com/
  * https://www.wireguard.com/papers/wireguard.pdf

As it is currently an out-of-tree module, it lives in its own git repo
and has its own mailing list, and every commit for the module is tested
against every stable kernel since 3.10 on a variety of architectures
using an extensive test suite:

  * https://git.zx2c4.com/WireGuard
https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/WireGuard.git/
  * https://lists.zx2c4.com/mailman/listinfo/wireguard
  * https://www.wireguard.com/build-status/

The project has been broadly discussed at conferences, and was presented
to the Netdev developers in Seoul last November, where a paper was
released detailing some interesting aspects of the project. Dave asked
me after the talk if I would consider sending in a v1 "sooner rather
than later", hence this patchset. A decision is still waiting from the
Linux Plumbers Conference, but an update on these topics may be presented
in Vancouver in a few months. Prior presentations:

  * https://www.wireguard.com/presentations/
  * https://www.wireguard.com/papers/wireguard-netdev22.pdf

The cryptography in the protocol itself has been formally verified by
several independent academic teams with positive results, and I know of
two additional efforts on their way to further corroborate those
findings. The version 1 protocol is "complete", and so the purpose of
this review is to assess the implementation of the protocol. However, it
still may be of interest to know that the thing you're reviewing uses a
protocol with various nice security properties:

  * https://www.wireguard.com/formal-verification/

This patchset is divided into four segments. The first introduces a very
simple helper for working with the FPU state for the purposes of amortizing
SIMD operations. The second segment is a small collection of cryptographic
primitives, split up into several commits by primitive and by hardware. The
third shows usage of Zinc within the existing crypto API and as a replacement
to the existing crypto API. The last is WireGuard itself, presented as an
unintrusive and self-contained virtual network driver.

It is intended that this entire patch series enter the kernel through
DaveM's net-next tree. Subsequently, WireGuard patches will go through
DaveM's net-next tree, while Zinc patches will go through Greg KH's tree.

Enjoy,
Jason


[PATCH net-next v4 01/20] asm: simd context helper API

2018-09-14 Thread Jason A. Donenfeld
Sometimes it's useful to amortize calls to XSAVE/XRSTOR and the related
FPU/SIMD functions over a number of calls, because FPU restoration is
quite expensive. This adds a simple header for carrying out this pattern:

simd_context_t simd_context = simd_get();
while ((item = get_item_from_queue()) != NULL) {
encrypt_item(item, simd_context);
simd_context = simd_relax(simd_context);
}
simd_put(simd_context);

The relaxation step ensures that we don't trample over preemption, and
the get/put API should be a familiar paradigm in the kernel.

Signed-off-by: Jason A. Donenfeld 
Reviewed-by: Palmer Dabbelt 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Thomas Gleixner 
Cc: Greg KH 
Cc: linux-a...@vger.kernel.org
---
 arch/alpha/include/asm/Kbuild  |  5 ++--
 arch/arc/include/asm/Kbuild|  1 +
 arch/arm/include/asm/simd.h| 42 ++
 arch/arm64/include/asm/simd.h  | 37 +-
 arch/c6x/include/asm/Kbuild|  3 ++-
 arch/h8300/include/asm/Kbuild  |  3 ++-
 arch/hexagon/include/asm/Kbuild|  1 +
 arch/ia64/include/asm/Kbuild   |  1 +
 arch/m68k/include/asm/Kbuild   |  1 +
 arch/microblaze/include/asm/Kbuild |  1 +
 arch/mips/include/asm/Kbuild   |  1 +
 arch/nds32/include/asm/Kbuild  |  7 ++---
 arch/nios2/include/asm/Kbuild  |  1 +
 arch/openrisc/include/asm/Kbuild   |  7 ++---
 arch/parisc/include/asm/Kbuild |  1 +
 arch/powerpc/include/asm/Kbuild|  3 ++-
 arch/riscv/include/asm/Kbuild  |  3 ++-
 arch/s390/include/asm/Kbuild   |  3 ++-
 arch/sh/include/asm/Kbuild |  1 +
 arch/sparc/include/asm/Kbuild  |  1 +
 arch/um/include/asm/Kbuild |  3 ++-
 arch/unicore32/include/asm/Kbuild  |  1 +
 arch/x86/include/asm/simd.h| 30 -
 arch/xtensa/include/asm/Kbuild |  1 +
 include/asm-generic/simd.h | 15 +++
 include/linux/simd.h   | 28 
 26 files changed, 180 insertions(+), 21 deletions(-)
 create mode 100644 arch/arm/include/asm/simd.h
 create mode 100644 include/linux/simd.h

diff --git a/arch/alpha/include/asm/Kbuild b/arch/alpha/include/asm/Kbuild
index 0580cb8c84b2..07b2c1025d34 100644
--- a/arch/alpha/include/asm/Kbuild
+++ b/arch/alpha/include/asm/Kbuild
@@ -2,14 +2,15 @@
 
 
 generic-y += compat.h
+generic-y += current.h
 generic-y += exec.h
 generic-y += export.h
 generic-y += fb.h
 generic-y += irq_work.h
+generic-y += kprobes.h
 generic-y += mcs_spinlock.h
 generic-y += mm-arch-hooks.h
 generic-y += preempt.h
 generic-y += sections.h
+generic-y += simd.h
 generic-y += trace_clock.h
-generic-y += current.h
-generic-y += kprobes.h
diff --git a/arch/arc/include/asm/Kbuild b/arch/arc/include/asm/Kbuild
index feed50ce89fa..a7f4255f1649 100644
--- a/arch/arc/include/asm/Kbuild
+++ b/arch/arc/include/asm/Kbuild
@@ -22,6 +22,7 @@ generic-y += parport.h
 generic-y += pci.h
 generic-y += percpu.h
 generic-y += preempt.h
+generic-y += simd.h
 generic-y += topology.h
 generic-y += trace_clock.h
 generic-y += user.h
diff --git a/arch/arm/include/asm/simd.h b/arch/arm/include/asm/simd.h
new file mode 100644
index ..bf468993bbef
--- /dev/null
+++ b/arch/arm/include/asm/simd.h
@@ -0,0 +1,42 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#include 
+#ifndef _ASM_SIMD_H
+#define _ASM_SIMD_H
+
+static __must_check inline bool may_use_simd(void)
+{
+   return !in_interrupt();
+}
+
+#ifdef CONFIG_KERNEL_MODE_NEON
+#include 
+
+static inline simd_context_t simd_get(void)
+{
+   bool have_simd = may_use_simd();
+   if (have_simd)
+   kernel_neon_begin();
+   return have_simd ? HAVE_FULL_SIMD : HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t prior_context)
+{
+   if (prior_context != HAVE_NO_SIMD)
+   kernel_neon_end();
+}
+#else
+static inline simd_context_t simd_get(void)
+{
+   return HAVE_NO_SIMD;
+}
+
+static inline void simd_put(simd_context_t prior_context)
+{
+}
+#endif
+
+#endif /* _ASM_SIMD_H */
diff --git a/arch/arm64/include/asm/simd.h b/arch/arm64/include/asm/simd.h
index 6495cc51246f..058c336de38d 100644
--- a/arch/arm64/include/asm/simd.h
+++ b/arch/arm64/include/asm/simd.h
@@ -1,11 +1,10 @@
-/*
- * Copyright (C) 2017 Linaro Ltd. 
+/* SPDX-License-Identifier: GPL-2.0
  *
- * This program is free software; you can redistribute it and/or modify it
- * under the terms of the GNU General Public License version 2 as published
- * by the Free Software Foundation.
+ * Copyright (C) 2017 Linaro Ltd. 
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
  */
 
+#include 
 #ifndef __ASM_SIMD_H
 #define __ASM_SIMD_H
 
@@ -16,6 +15,8 @@
 #include 
 
 #ifdef CONFIG_KERNEL_MODE_NEON
+#include 
+#include 
 
 DECLARE_PER_CPU(bool, kernel_neon_busy);
 
@@ -40,12 +41,36 @@ static __must_check inline bool may_use_simd(void)
  

[PATCH net-next v4 03/20] zinc: ChaCha20 generic C implementation

2018-09-14 Thread Jason A. Donenfeld
This implements the ChaCha20 permutation as a single C statement, by way
of the comma operator, which the compiler is able to simplify
terrifically.

Information: https://cr.yp.to/chacha.html

Signed-off-by: Jason A. Donenfeld 
Cc: Samuel Neves 
Cc: Andy Lutomirski 
Cc: Greg KH 
Cc: Jean-Philippe Aumasson 
---
 include/zinc/chacha20.h  |  54 +++
 lib/zinc/Kconfig |   5 ++
 lib/zinc/Makefile|   4 +
 lib/zinc/chacha20/chacha20.c | 168 +++
 lib/zinc/main.c  |   5 ++
 5 files changed, 236 insertions(+)
 create mode 100644 include/zinc/chacha20.h
 create mode 100644 lib/zinc/chacha20/chacha20.c

diff --git a/include/zinc/chacha20.h b/include/zinc/chacha20.h
new file mode 100644
index ..3c2c2f72d88a
--- /dev/null
+++ b/include/zinc/chacha20.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ */
+
+#ifndef _ZINC_CHACHA20_H
+#define _ZINC_CHACHA20_H
+
+#include 
+#include 
+#include 
+#include 
+
+enum {
+   CHACHA20_IV_SIZE = 16,
+   CHACHA20_KEY_SIZE = 32,
+   CHACHA20_BLOCK_SIZE = 64,
+   CHACHA20_BLOCK_WORDS = CHACHA20_BLOCK_SIZE / sizeof(u32),
+   HCHACHA20_KEY_SIZE = 32,
+   HCHACHA20_NONCE_SIZE = 16
+};
+
+struct chacha20_ctx {
+   u32 key[8];
+   u32 counter[4];
+} __aligned(32);
+
+void chacha20_fpu_init(void);
+
+static inline void chacha20_init(struct chacha20_ctx *state,
+const u8 key[CHACHA20_KEY_SIZE],
+const u64 nonce)
+{
+   state->key[0] = get_unaligned_le32(key + 0);
+   state->key[1] = get_unaligned_le32(key + 4);
+   state->key[2] = get_unaligned_le32(key + 8);
+   state->key[3] = get_unaligned_le32(key + 12);
+   state->key[4] = get_unaligned_le32(key + 16);
+   state->key[5] = get_unaligned_le32(key + 20);
+   state->key[6] = get_unaligned_le32(key + 24);
+   state->key[7] = get_unaligned_le32(key + 28);
+   state->counter[0] = state->counter[1] = 0;
+   state->counter[2] = nonce & U32_MAX;
+   state->counter[3] = nonce >> 32;
+}
+void chacha20(struct chacha20_ctx *state, u8 *dst, const u8 *src, u32 len,
+ simd_context_t simd_context);
+
+/* Derived key should be 32-bit aligned */
+void hchacha20(u8 derived_key[CHACHA20_KEY_SIZE],
+  const u8 nonce[HCHACHA20_NONCE_SIZE],
+  const u8 key[HCHACHA20_KEY_SIZE], simd_context_t simd_context);
+
+#endif /* _ZINC_CHACHA20_H */
diff --git a/lib/zinc/Kconfig b/lib/zinc/Kconfig
index 5980c411af0d..e7d396d61607 100644
--- a/lib/zinc/Kconfig
+++ b/lib/zinc/Kconfig
@@ -1,6 +1,11 @@
 config ZINC
tristate
 
+config ZINC_CHACHA20
+   bool
+   select ZINC
+   select CRYPTO_ALGAPI
+
 config ZINC_DEBUG
bool "Zinc cryptography library debugging and self-tests"
depends on ZINC
diff --git a/lib/zinc/Makefile b/lib/zinc/Makefile
index dad47573de42..0b5a964bfba6 100644
--- a/lib/zinc/Makefile
+++ b/lib/zinc/Makefile
@@ -3,6 +3,10 @@ ccflags-y += -Wframe-larger-than=8192
 ccflags-y += -D'pr_fmt(fmt)=KBUILD_MODNAME ": " fmt'
 ccflags-$(CONFIG_ZINC_DEBUG) += -DDEBUG
 
+ifeq ($(CONFIG_ZINC_CHACHA20),y)
+zinc-y += chacha20/chacha20.o
+endif
+
 zinc-y += main.o
 
 obj-$(CONFIG_ZINC) := zinc.o
diff --git a/lib/zinc/chacha20/chacha20.c b/lib/zinc/chacha20/chacha20.c
new file mode 100644
index ..1d9168e6c142
--- /dev/null
+++ b/lib/zinc/chacha20/chacha20.c
@@ -0,0 +1,168 @@
+/* SPDX-License-Identifier: GPL-2.0
+ *
+ * Copyright (C) 2015-2018 Jason A. Donenfeld . All Rights 
Reserved.
+ *
+ * Implementation of the ChaCha20 stream cipher.
+ *
+ * Information: https://cr.yp.to/chacha.html
+ */
+
+#include 
+
+#include 
+#include 
+
+#ifndef HAVE_CHACHA20_ARCH_IMPLEMENTATION
+void __init chacha20_fpu_init(void)
+{
+}
+static inline bool chacha20_arch(u8 *out, const u8 *in, const size_t len,
+const u32 key[8], const u32 counter[4],
+simd_context_t simd_context)
+{
+   return false;
+}
+static inline bool hchacha20_arch(u8 *derived_key, const u8 *nonce,
+ const u8 *key, simd_context_t simd_context)
+{
+   return false;
+}
+#endif
+
+#define EXPAND_32_BYTE_K 0x61707865U, 0x3320646eU, 0x79622d32U, 0x6b206574U
+
+#define QUARTER_ROUND(x, a, b, c, d) ( \
+   x[a] += x[b], \
+   x[d] = rol32((x[d] ^ x[a]), 16), \
+   x[c] += x[d], \
+   x[b] = rol32((x[b] ^ x[c]), 12), \
+   x[a] += x[b], \
+   x[d] = rol32((x[d] ^ x[a]), 8), \
+   x[c] += x[d], \
+   x[b] = rol32((x[b] ^ x[c]), 7) \
+)
+
+#define C(i, j) (i * 4 + j)
+
+#define DOUBLE_ROUND(x) ( \
+   /* Column Round */ \
+   QUARTER_ROUND(x, C(0, 0), C(1, 0), C(2, 0), C(3, 0)), \
+   QUARTER_ROUND(x, C(0, 1), C(1, 1), C(2, 1), C(3, 1)), \
+   QUARTER_ROUND(x, C(0, 2), C(1, 2), C(2, 2), C(3, 2)), \

[PATCH net-next v4 00/20] WireGuard: Secure Network Tunnel

2018-09-14 Thread Jason A. Donenfeld
Changes v3->v4:
  - Remove mistaken double 07/17 patch.
  - Fix whitespace issues in blake2s assembly.
  - It's not possible to put compound literals into __initconst, so
we now instead just use boring fixed size struct members.
  - Move away from makefile ifdef maze and instead prefer kconfig values,
which also makes the design a bit more modular too, which could help
in the future.
  - Port old crypto API implementations (ChaCha20 and Poly1305) to Zinc.
  - Port security/keys/big_key to Zinc as second example of a good usage of
Zinc.
  - Document precisely what is different between the kernel code and
CRYPTOGAMS code when the CRYPTOGAMS code is used.
  - Move changelog to top of 00/20 message so that people can
actually find it.

---

This patchset is available on git.kernel.org in this branch, where it may be
pulled directly for inclusion into net-next:

  * 
https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/linux.git/log/?h=jd/wireguard

---

WireGuard is a secure network tunnel written especially for Linux, which
has faced around three years of serious development, deployment, and
scrutiny. It delivers excellent performance and is extremely easy to
use and configure. It has been designed with the primary goal of being
both easy to audit by virtue of being small and highly secure from a
cryptography and systems security perspective. WireGuard is used by some
massive companies pushing enormous amounts of traffic, and likely
already today you've consumed bytes that at some point transited through
a WireGuard tunnel. Even as an out-of-tree module, WireGuard has been
integrated into various userspace tools, Linux distributions, mobile
phones, and data centers. There are ports in several languages to
several operating systems, and even commercial hardware and services
sold integrating WireGuard. It is time, therefore, for WireGuard to be
properly integrated into Linux.

Ample information, including documentation, installation instructions,
and project details, is available at:

  * https://www.wireguard.com/
  * https://www.wireguard.com/papers/wireguard.pdf

As it is currently an out-of-tree module, it lives in its own git repo
and has its own mailing list, and every commit for the module is tested
against every stable kernel since 3.10 on a variety of architectures
using an extensive test suite:

  * https://git.zx2c4.com/WireGuard
https://git.kernel.org/pub/scm/linux/kernel/git/zx2c4/WireGuard.git/
  * https://lists.zx2c4.com/mailman/listinfo/wireguard
  * https://www.wireguard.com/build-status/

The project has been broadly discussed at conferences, and was presented
to the Netdev developers in Seoul last November, where a paper was
released detailing some interesting aspects of the project. Dave asked
me after the talk if I would consider sending in a v1 "sooner rather
than later", hence this patchset. A decision is still waiting from the
Linux Plumbers Conference, but an update on these topics may be presented
in Vancouver in a few months. Prior presentations:

  * https://www.wireguard.com/presentations/
  * https://www.wireguard.com/papers/wireguard-netdev22.pdf

The cryptography in the protocol itself has been formally verified by
several independent academic teams with positive results, and I know of
two additional efforts on their way to further corroborate those
findings. The version 1 protocol is "complete", and so the purpose of
this review is to assess the implementation of the protocol. However, it
still may be of interest to know that the thing you're reviewing uses a
protocol with various nice security properties:

  * https://www.wireguard.com/formal-verification/

This patchset is divided into four segments. The first introduces a very
simple helper for working with the FPU state for the purposes of amortizing
SIMD operations. The second segment is a small collection of cryptographic
primitives, split up into several commits by primitive and by hardware. The
third shows usage of Zinc within the existing crypto API and as a replacement
to the existing crypto API. The last is WireGuard itself, presented as an
unintrusive and self-contained virtual network driver.

It is intended that this entire patch series enter the kernel through
DaveM's net-next tree. Subsequently, WireGuard patches will go through
DaveM's net-next tree, while Zinc patches will go through Greg KH's tree.

Enjoy,
Jason


[PATCH] crypto: caam/jr - fix ablkcipher_edesc pointer arithmetic

2018-09-14 Thread Horia Geantă
In some cases the zero-length hw_desc array at the end of
ablkcipher_edesc struct requires for 4B of tail padding.

Due to tail padding and the way pointers to S/G table and IV
are computed:
edesc->sec4_sg = (void *)edesc + sizeof(struct ablkcipher_edesc) +
 desc_bytes;
iv = (u8 *)edesc->hw_desc + desc_bytes + sec4_sg_bytes;
first 4 bytes of IV are overwritten by S/G table.

Update computation of pointer to S/G table to rely on offset of hw_desc
member and not on sizeof() operator.

Cc:  # 4.13+
Fixes: 115957bb3e59 ("crypto: caam - fix IV DMA mapping and updating")
Signed-off-by: Horia Geantă 
---

This is for crypto-2.6 tree / current v4.19 release cycle.

Note that it will create merge conflicts later in v4.20 due to commits
cf5448b5c3d8 ("crypto: caam/jr - remove ablkcipher IV generation")
5ca7badb1f62 ("crypto: caam/jr - ablkcipher -> skcipher conversion")
from cryptodev-2.6 tree.

Should I send a similar fix for skcipher-based caam/jr driver
on cryptodev-2.6 tree, or will this be handled while solving the conflicts?

 drivers/crypto/caam/caamalg.c | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/crypto/caam/caamalg.c b/drivers/crypto/caam/caamalg.c
index d67667970f7e..ec40f991e6c6 100644
--- a/drivers/crypto/caam/caamalg.c
+++ b/drivers/crypto/caam/caamalg.c
@@ -1553,8 +1553,8 @@ static struct ablkcipher_edesc 
*ablkcipher_edesc_alloc(struct ablkcipher_request
edesc->src_nents = src_nents;
edesc->dst_nents = dst_nents;
edesc->sec4_sg_bytes = sec4_sg_bytes;
-   edesc->sec4_sg = (void *)edesc + sizeof(struct ablkcipher_edesc) +
-desc_bytes;
+   edesc->sec4_sg = (struct sec4_sg_entry *)((u8 *)edesc->hw_desc +
+ desc_bytes);
edesc->iv_dir = DMA_TO_DEVICE;
 
/* Make sure IV is located in a DMAable area */
@@ -1757,8 +1757,8 @@ static struct ablkcipher_edesc 
*ablkcipher_giv_edesc_alloc(
edesc->src_nents = src_nents;
edesc->dst_nents = dst_nents;
edesc->sec4_sg_bytes = sec4_sg_bytes;
-   edesc->sec4_sg = (void *)edesc + sizeof(struct ablkcipher_edesc) +
-desc_bytes;
+   edesc->sec4_sg = (struct sec4_sg_entry *)((u8 *)edesc->hw_desc +
+ desc_bytes);
edesc->iv_dir = DMA_FROM_DEVICE;
 
/* Make sure IV is located in a DMAable area */
-- 
2.16.2



Re: [PATCH v2 05/17] compat_ioctl: move more drivers to generic_compat_ioctl_ptrarg

2018-09-14 Thread David Sterba
On Wed, Sep 12, 2018 at 05:08:52PM +0200, Arnd Bergmann wrote:
> The .ioctl and .compat_ioctl file operations have the same prototype so
> they can both point to the same function, which works great almost all
> the time when all the commands are compatible.
> 
> One exception is the s390 architecture, where a compat pointer is only
> 31 bit wide, and converting it into a 64-bit pointer requires calling
> compat_ptr(). Most drivers here will ever run in s390, but since we now
> have a generic helper for it, it's easy enough to use it consistently.
> 
> I double-checked all these drivers to ensure that all ioctl arguments
> are used as pointers or are ignored, but are not interpreted as integer
> values.
> 
> Signed-off-by: Arnd Bergmann 
> ---

>  fs/btrfs/super.c| 2 +-

Acked-by: David Sterba 


Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-09-14 Thread Jerome Glisse
On Fri, Sep 14, 2018 at 06:50:55AM +, Tian, Kevin wrote:
> > From: Jerome Glisse
> > Sent: Thursday, September 13, 2018 10:52 PM
> >
> [...]
>  > AFAIK, on x86 and PPC at least, all PCIE devices are in the same group
> > by default at boot or at least all devices behind the same bridge.
> 
> the group thing reflects physical hierarchy limitation, not changed
> cross boot. Please note iommu group defines the minimal isolation
> boundary - all devices within same group must be attached to the
> same iommu domain or address space, because physically IOMMU
> cannot differentiate DMAs out of those devices. devices behind
> legacy PCI-X bridge is one example. other examples include devices
> behind a PCIe switch port which doesn't support ACS thus cannot
> route p2p transaction to IOMMU. If talking about typical PCIe 
> endpoint (with upstreaming ports all supporting ACS), you'll get
> one device per group.
> 
> One iommu group today is attached to only one iommu domain.
> In the future one group may attach to multiple domains, as the
> aux domain concept being discussed in another thread.

Thanks for the info.

> 
> > 
> > Maybe they are kernel option to avoid that and userspace init program
> > can definitly re-arrange that base on sysadmin policy).
> 
> I don't think there is such option, as it may break isolation model
> enabled by IOMMU.
> 
> [...]
> > > > That is why i am being pedantic :) on making sure there is good reasons
> > > > to do what you do inside VFIO. I do believe that we want a common
> > frame-
> > > > work like the one you are proposing but i do not believe it should be
> > > > part of VFIO given the baggages it comes with and that are not relevant
> > > > to the use cases for this kind of devices.
> > >
> 
> The purpose of VFIO is clear - the kernel portal for granting generic 
> device resource (mmio, irq, etc.) to user space. VFIO doesn't care
> what exactly a resource is used for (queue, cmd reg, etc.). If really
> pursuing VFIO path is necessary, maybe such common framework
> should lay down in user space, which gets all granted resource from
> kernel driver thru VFIO and then provides accelerator services to 
> other processes?

Except that many existing device driver falls under that description
(ie exposing mmio, command queues, ...) and are not under VFIO.

Up to mdev VFIO was all about handling a full device to userspace AFAIK.
With the introduction of mdev a host kernel driver can "slice" its
device and share it through VFIO to userspace. Note that in that case
it might never end over any mmio, irq, ... the host driver might just
be handling over memory and would be polling from it to schedule on
the real hardware.


The question i am asking about warpdrive is wether being in VFIO is
necessary ? as i do not see the requirement myself.

Cheers,
Jérôme


Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-09-14 Thread Jerome Glisse
On Fri, Sep 14, 2018 at 11:12:01AM +0800, Kenneth Lee wrote:
> On Thu, Sep 13, 2018 at 10:51:50AM -0400, Jerome Glisse wrote:
> > On Thu, Sep 13, 2018 at 04:32:32PM +0800, Kenneth Lee wrote:
> > > On Tue, Sep 11, 2018 at 09:40:14AM -0400, Jerome Glisse wrote:
> > > > On Tue, Sep 11, 2018 at 02:40:43PM +0800, Kenneth Lee wrote:
> > > > > On Mon, Sep 10, 2018 at 11:33:59PM -0400, Jerome Glisse wrote:
> > > > > > On Tue, Sep 11, 2018 at 10:42:09AM +0800, Kenneth Lee wrote:
> > > > > > > On Mon, Sep 10, 2018 at 10:54:23AM -0400, Jerome Glisse wrote:
> > > > > > > > On Mon, Sep 10, 2018 at 11:28:09AM +0800, Kenneth Lee wrote:
> > > > > > > > > On Fri, Sep 07, 2018 at 12:53:06PM -0400, Jerome Glisse wrote:
> > > > > > > > > > On Fri, Sep 07, 2018 at 12:01:38PM +0800, Kenneth Lee wrote:
> > > > > > > > > > > On Thu, Sep 06, 2018 at 09:31:33AM -0400, Jerome Glisse 
> > > > > > > > > > > wrote:
> > > > > > > > > > > > On Thu, Sep 06, 2018 at 05:45:32PM +0800, Kenneth Lee 
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Tue, Sep 04, 2018 at 10:15:09AM -0600, Alex 
> > > > > > > > > > > > > Williamson wrote:
> > > > > > > > > > > > > > On Tue, 4 Sep 2018 11:00:19 -0400 Jerome Glisse 
> > > > > > > > > > > > > >  wrote:
> > > > > > > > > > > > > > > On Mon, Sep 03, 2018 at 08:51:57AM +0800, Kenneth 
> > > > > > > > > > > > > > > Lee wrote:
> > > > > > > > 
> > > > > > > > [...]
> > > > > > > > 
> > > > > > > > > > > I took a look at i915_gem_execbuffer_ioctl(). It seems it 
> > > > > > > > > > > "copy_from_user" the
> > > > > > > > > > > user memory to the kernel. That is not what we need. What 
> > > > > > > > > > > we try to get is: the
> > > > > > > > > > > user application do something on its data, and push it 
> > > > > > > > > > > away to the accelerator,
> > > > > > > > > > > and says: "I'm tied, it is your turn to do the job...". 
> > > > > > > > > > > Then the accelerator has
> > > > > > > > > > > the memory, referring any portion of it with the same VAs 
> > > > > > > > > > > of the application,
> > > > > > > > > > > even the VAs are stored inside the memory itself.
> > > > > > > > > > 
> > > > > > > > > > You were not looking at right place see 
> > > > > > > > > > drivers/gpu/drm/i915/i915_gem_userptr.c
> > > > > > > > > > It does GUP and create GEM object AFAICR you can wrap that 
> > > > > > > > > > GEM object into a
> > > > > > > > > > dma buffer object.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > Thank you for directing me to this implementation. It is 
> > > > > > > > > interesting:).
> > > > > > > > > 
> > > > > > > > > But it is not yet solve my problem. If I understand it right, 
> > > > > > > > > the userptr in
> > > > > > > > > i915 do the following:
> > > > > > > > > 
> > > > > > > > > 1. The user process sets a user pointer with size to the 
> > > > > > > > > kernel via ioctl.
> > > > > > > > > 2. The kernel wraps it as a dma-buf and keeps the process's 
> > > > > > > > > mm for further
> > > > > > > > >reference.
> > > > > > > > > 3. The user pages are allocated, GUPed or DMA mapped to the 
> > > > > > > > > device. So the data
> > > > > > > > >can be shared between the user space and the hardware.
> > > > > > > > > 
> > > > > > > > > But my scenario is: 
> > > > > > > > > 
> > > > > > > > > 1. The user process has some data in the user space, pointed 
> > > > > > > > > by a pointer, say
> > > > > > > > >ptr1. And within the memory, there may be some other 
> > > > > > > > > pointers, let's say one
> > > > > > > > >of them is ptr2.
> > > > > > > > > 2. Now I need to assign ptr1 *directly* to the hardware MMIO 
> > > > > > > > > space. And the
> > > > > > > > >hardware must refer ptr1 and ptr2 *directly* for data.
> > > > > > > > > 
> > > > > > > > > Userptr lets the hardware and process share the same memory 
> > > > > > > > > space. But I need
> > > > > > > > > them to share the same *address space*. So IOMMU is a MUST 
> > > > > > > > > for WarpDrive,
> > > > > > > > > NOIOMMU mode, as Jean said, is just for verifying some of the 
> > > > > > > > > procedure is OK.
> > > > > > > > 
> > > > > > > > So to be 100% clear should we _ignore_ the non SVA/SVM case ?
> > > > > > > > If so then wait for necessary SVA/SVM to land and do warp drive
> > > > > > > > without non SVA/SVM path.
> > > > > > > > 
> > > > > > > 
> > > > > > > I think we should clear the concept of SVA/SVM here. As my 
> > > > > > > understanding, Share
> > > > > > > Virtual Address/Memory means: any virtual address in a process 
> > > > > > > can be used by
> > > > > > > device at the same time. This requires IOMMU device to support 
> > > > > > > PASID. And
> > > > > > > optionally, it requires the feature of page-fault-from-device.
> > > > > > 
> > > > > > Yes we agree on what SVA/SVM is. There is a one gotcha thought, 
> > > > > > access
> > > > > > to range that are MMIO map ie CPU page table pointing to IO memory, 
> > > > > > IIRC
> > > > > > it is undefined what happens on so

Re: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive

2018-09-14 Thread Kenneth Lee
On Fri, Sep 14, 2018 at 06:50:55AM +, Tian, Kevin wrote:
> Date: Fri, 14 Sep 2018 06:50:55 +
> From: "Tian, Kevin" 
> To: Jerome Glisse , Kenneth Lee 
> CC: Kenneth Lee , Alex Williamson
>  , Herbert Xu ,
>  "k...@vger.kernel.org" , Jonathan Corbet
>  , Greg Kroah-Hartman , Zaibo
>  Xu , "linux-...@vger.kernel.org"
>  , "Kumar, Sanjay K" ,
>  Hao Fang , "linux-ker...@vger.kernel.org"
>  , "linux...@huawei.com"
>  , "io...@lists.linux-foundation.org"
>  , "David S . Miller"
>  , "linux-crypto@vger.kernel.org"
>  , Zhou Wang ,
>  Philippe Ombredanne , Thomas Gleixner
>  , Joerg Roedel ,
>  "linux-accelerat...@lists.ozlabs.org"
>  , Lu Baolu 
> Subject: RE: [RFCv2 PATCH 0/7] A General Accelerator Framework, WarpDrive
> Message-ID: 
> 
> 
> > From: Jerome Glisse
> > Sent: Thursday, September 13, 2018 10:52 PM
> >
> [...]
>  > AFAIK, on x86 and PPC at least, all PCIE devices are in the same group
> > by default at boot or at least all devices behind the same bridge.
> 
> the group thing reflects physical hierarchy limitation, not changed
> cross boot. Please note iommu group defines the minimal isolation
> boundary - all devices within same group must be attached to the
> same iommu domain or address space, because physically IOMMU
> cannot differentiate DMAs out of those devices. devices behind
> legacy PCI-X bridge is one example. other examples include devices
> behind a PCIe switch port which doesn't support ACS thus cannot
> route p2p transaction to IOMMU. If talking about typical PCIe 
> endpoint (with upstreaming ports all supporting ACS), you'll get
> one device per group.
> 
> One iommu group today is attached to only one iommu domain.
> In the future one group may attach to multiple domains, as the
> aux domain concept being discussed in another thread.
> 
> > 
> > Maybe they are kernel option to avoid that and userspace init program
> > can definitly re-arrange that base on sysadmin policy).
> 
> I don't think there is such option, as it may break isolation model
> enabled by IOMMU.
> 
> [...]
> > > > That is why i am being pedantic :) on making sure there is good reasons
> > > > to do what you do inside VFIO. I do believe that we want a common
> > frame-
> > > > work like the one you are proposing but i do not believe it should be
> > > > part of VFIO given the baggages it comes with and that are not relevant
> > > > to the use cases for this kind of devices.
> > >
> 
> The purpose of VFIO is clear - the kernel portal for granting generic 
> device resource (mmio, irq, etc.) to user space. VFIO doesn't care
> what exactly a resource is used for (queue, cmd reg, etc.). If really
> pursuing VFIO path is necessary, maybe such common framework
> should lay down in user space, which gets all granted resource from
> kernel driver thru VFIO and then provides accelerator services to 
> other processes?

Yes. I think this is exactly what WarpDrive is now doing. This patch is just let
the type1 driver use parent IOMMU for mdev.

> 
> Thanks
> Kevin

-- 
-Kenneth(Hisilicon)


本邮件及其附件含有华为公司的保密信息,仅限于发送给上面地址中列出的个人或群组。禁
止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、或散发)本邮件中
的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本邮件!
This e-mail and its attachments contain confidential information from HUAWEI,
which is intended only for the person or entity whose address is listed above.
Any use of the 
information contained herein in any way (including, but not limited to, total or
partial disclosure, reproduction, or dissemination) by persons other than the
intended 
recipient(s) is prohibited. If you receive this e-mail in error, please notify
the sender by phone or email immediately and delete it!



Re: [PATCH net-next v3 02/17] zinc: introduce minimal cryptography library

2018-09-14 Thread Jason A. Donenfeld
On Fri, Sep 14, 2018 at 8:15 AM Ard Biesheuvel
 wrote:
> OK, so given random.c's future dependency on Zinc (for ChaCha20), and
> the fact that Zinc is one monolithic piece of code, all versions of
> all algorithms will always be statically linked into the kernel
> proper. I'm not sure that is acceptable.

v4 already addresses that issue, actually. I'll post it shortly.

> BTW you haven't answered my question yet about what happens when the
> WireGuard protocol version changes: will we need a flag day and switch
> all deployments over at the same time?

No, that won't be necessary, necessarily. Peers are individually
versioned and the protocol is fairly flexible in this regard.