[PATCH] crypto: n2 - Use platform_register/unregister_drivers()

2015-12-02 Thread Thierry Reding
From: Thierry Reding 

These new helpers simplify implementing multi-driver modules and
properly handle failure to register one driver by unregistering all
previously registered drivers.

Signed-off-by: Thierry Reding 
---
 drivers/crypto/n2_core.c | 17 +++--
 1 file changed, 7 insertions(+), 10 deletions(-)

diff --git a/drivers/crypto/n2_core.c b/drivers/crypto/n2_core.c
index 5450880abb7b..739a786b9f08 100644
--- a/drivers/crypto/n2_core.c
+++ b/drivers/crypto/n2_core.c
@@ -2243,22 +2243,19 @@ static struct platform_driver n2_mau_driver = {
.remove =   n2_mau_remove,
 };
 
+static struct platform_driver * const drivers[] = {
+   _crypto_driver,
+   _mau_driver,
+};
+
 static int __init n2_init(void)
 {
-   int err = platform_driver_register(_crypto_driver);
-
-   if (!err) {
-   err = platform_driver_register(_mau_driver);
-   if (err)
-   platform_driver_unregister(_crypto_driver);
-   }
-   return err;
+   return platform_register_drivers(drivers, ARRAY_SIZE(drivers));
 }
 
 static void __exit n2_exit(void)
 {
-   platform_driver_unregister(_mau_driver);
-   platform_driver_unregister(_crypto_driver);
+   platform_unregister_drivers(drivers, ARRAY_SIZE(drivers));
 }
 
 module_init(n2_init);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-02 Thread Sowmini Varadhan
On (12/02/15 12:41), David Laight wrote:
> You are getting 0.7 Gbps with ass-ccm-a-128, scale the esp-null back to
> that and it would use 7/18*71 = 27% of the cpu.
> So 69% of the cpu in the a-128 case is probably caused by the
> encryption itself.
> Even if the rest of the code cost nothing you'd not increase
> above 1Gbps.

Fortunately, the situation is not quite hopeless yet.

Thanks to Rick Jones for supplying the hints for this, but with
some careful manual pinning of irqs and iperf processes to cpus,
I can get to 4.5 Gbps for the esp-null case.

Given that the [clear traffic + GSO without GRO] gets me about 5-7 Gbps,
the 4.5 Gbps is not that far off (and at that point, the nickel-and-dime
tweaks may help even more).

For AES-GCM, I'm able to go from 1.8 Gbps (no GSO) to 2.8 Gbps.
Still not great, but proves that we haven't yet hit any upper bounds
yet.

I think a lot of the manual tweaking of irq/process placement
is needed because the existing rps/rfs flow steering is looking
for TCP/UDP flow numbers to do the steering. It can just as easily
use the IPsec SPI numbers to do this, and that's another place where
we can make this more ipsec-friendly.

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 3/5] crypto: AES CBC multi-buffer scheduler

2015-12-02 Thread Tim Chen

This patch implements in-order scheduler for encrypting multiple buffers
in parallel supporting AES CBC encryption with key sizes of
128, 192 and 256 bits. It uses 8 data lanes by taking advantage of the
SIMD instructions with XMM registers.

The multibuffer manager and scheduler is mostly written in assembly and
the initialization support is written C. The AES CBC multibuffer crypto
driver support interfaces with the multibuffer manager and scheduler
to support AES CBC encryption in parallel. The scheduler supports
job submissions, job flushing and and job retrievals after completion.

The basic flow of usage of the CBC multibuffer scheduler is as follows:

- The caller allocates an aes_cbc_mb_mgr_inorder_x8 object
and initializes it once by calling aes_cbc_init_mb_mgr_inorder_x8().

- The aes_cbc_mb_mgr_inorder_x8 structure has an array of JOB_AES
objects. Allocation and scheduling of JOB_AES objects are managed
by the multibuffer scheduler support routines. The caller allocates
a JOB_AES using aes_cbc_get_next_job_inorder_x8().

- The returned JOB_AES must be filled in with parameters for CBC
encryption (eg: plaintext buffer, ciphertext buffer, key, iv, etc) and
submitted to the manager object using aes_cbc_submit_job_inorder_xx().

- If the oldest JOB_AES is completed during a call to
aes_cbc_submit_job_inorder_x8(), it is returned. Otherwise,
NULL is returned.

- A call to aes_cbc_flush_job_inorder_x8() always returns the
oldest job, unless the multibuffer manager is empty of jobs.

- A call to aes_cbc_get_completed_job_inorder_x8() returns
a completed job. This routine is useful to process completed
jobs instead of waiting for the flusher to engage.

- When a job is returned from submit or flush, the caller extracts
the useful data and returns it to the multibuffer manager implicitly
by the next call to aes_cbc_get_next_job_xx().

Jobs are always returned from submit or flush routines in the order they
were submitted (hence "inorder").A job allocated using
aes_cbc_get_next_job_inorder_x8() must be filled in and submitted before
another call. A job returned by aes_cbc_submit_job_inorder_x8() or
aes_cbc_flush_job_inorder_x8() is 'deallocated' upon the next call to
get a job structure. Calls to get_next_job() cannot fail. If all jobs are
allocated after a call to get_next_job(), the subsequent call to submit
always returns the oldest job in a completed state.

Originally-by: Chandramouli Narayanan 
Signed-off-by: Tim Chen 
---
 arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c   | 145 +++
 arch/x86/crypto/aes-cbc-mb/mb_mgr_inorder_x8_asm.S | 222 +++
 arch/x86/crypto/aes-cbc-mb/mb_mgr_ooo_x8_asm.S | 416 +
 3 files changed, 783 insertions(+)
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c
 create mode 100644 arch/x86/crypto/aes-cbc-mb/mb_mgr_inorder_x8_asm.S
 create mode 100644 arch/x86/crypto/aes-cbc-mb/mb_mgr_ooo_x8_asm.S

diff --git a/arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c 
b/arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c
new file mode 100644
index 000..7a7f8a1
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/aes_mb_mgr_init.c
@@ -0,0 +1,145 @@
+/*
+ * Initialization code for multi buffer AES CBC algorithm
+ *
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * James Guilford 
+ * Sean Gulley 
+ * Tim Chen 
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED 

Re: [PATCH v3 5/5] crypto: AES CBC multi-buffer glue code

2015-12-02 Thread Tim Chen
On Tue, 2015-12-01 at 09:19 -0800, Tim Chen wrote:
> On Thu, 2015-11-26 at 16:49 +0800, Herbert Xu wrote:
> > On Tue, Nov 24, 2015 at 10:30:06AM -0800, Tim Chen wrote:
> > >
> > > On the decrypt path, we don't need to use multi-buffer algorithm
> > > as aes-cbc decrypt can be parallelized inherently on a single
> > > request.  So most of the time the outer layer algorithm
> > > cbc_mb_async_ablk_decrypt can bypass mcryptd and
> > > invoke mb_aes_cbc_decrypt synchronously
> > > to do aes_cbc_dec when fpu is available.
> > > This avoids the overhead of going through mcryptd.  Hence
> > > the use of blkcipher on the inner layer.  For the mcryptd
> > > path, we will complete a decrypt request in one shot so
> > > blkcipher usage should be fine.
> > 
> > I think there is a misunderstanding here.  Just because you're
> > using/exporting through the ablkcipher interface doesn't mean
> > that you are asynchrounous.  For example, all blkcipher algorithms
> > can be accessed through the ablkcipher interface and they of course
> > remain synchrounous.
> > 
> > So I don't see how using an ablkcipher in the inner layer changes
> > anything at all.  You can still return immediately and not bother
> > with completion functions when you are synchrounous.
> > 
> > Cheers,
> 
> OK, I'll try to see if I can cast things back to the original ablkcipher
> request and use that to walk the sg list.
> 

Herbert,

I've sent out a new version of this series to use ablkcipher on the
inner layer of decrypt.  Thanks.

Tim

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] KEYS: Exposing {a,}symmetric key ops to userspace and other bits

2015-12-02 Thread Mimi Zohar
On Sun, 2015-11-22 at 09:41 -0500, Mimi Zohar wrote:
> On Fri, 2015-11-20 at 11:07 +, David Howells wrote:
> > 
> >  (*) Add Mimi's patches to allow keys/keyrings to be marked undeletable.  
> > This
> >  is for the purpose of creating blacklists and to prevent people from
> >  removing entries in the blacklist.  Note that only the kernel can 
> > create
> >  a blacklist - we don't want userspace generating them as a way to take 
> > up
> >  kernel space.
> > 
> >  I think the right way to do this is to not allow marked keys to be
> >  unlinked from marked keyrings, but to allow marked keys to be unlinked
> >  from ordinary keyrings.
> > 
> >  The reason the 'keep' mark is required on individual keys is to prevent
> >  the keys from being directly revoked, expired or invalidated by keyctl
> >  without reference to the keyring.  Marked keys that are set expirable
> >  when they're created will still expire and be subsequently removed and 
> > if
> >  a marked key or marked keyring loses all its references it still gets
> >  gc'd.
> 
> Agreed.  I'll fix and re-post soon.

In addition to Petko's 3 patches, the ima-keyrings branch
(git://git.kernel.org/pub/scm/linux/kernel/git/zohar/linux-integrity.git) 
contains these two patches.

d939a88 IMA: prevent keys on the .ima_blacklist from being removed
77f33b5 KEYS: prevent keys from being removed from specified keyrings

As the IMA patch is dependent on the KEYS patch, do you mind if the KEYS
patch would be upstreamed together with this patch set?

Mimi

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 4/5] crypto: AES CBC by8 encryption

2015-12-02 Thread Tim Chen

This patch introduces the assembly routine to do a by8 AES CBC encryption
in support of the AES CBC multi-buffer implementation.

Encryption of 8 data streams of a key size are done simultaneously.

Originally-by: Chandramouli Narayanan 
Signed-off-by: Tim Chen 
---
 arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S | 774 
 1 file changed, 774 insertions(+)
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S

diff --git a/arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S 
b/arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S
new file mode 100644
index 000..eaffc28
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/aes_cbc_enc_x8.S
@@ -0,0 +1,774 @@
+/*
+ * AES CBC by8 multibuffer optimization (x86_64)
+ * This file implements 128/192/256 bit AES CBC encryption
+ *
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * James Guilford 
+ * Sean Gulley 
+ * Tim Chen 
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#include 
+
+/* stack size needs to be an odd multiple of 8 for alignment */
+
+#define AES_KEYSIZE_12816
+#define AES_KEYSIZE_19224
+#define AES_KEYSIZE_25632
+
+#define XMM_SAVE_SIZE  16*10
+#define GPR_SAVE_SIZE  8*9
+#define STACK_SIZE (XMM_SAVE_SIZE + GPR_SAVE_SIZE)
+
+#define GPR_SAVE_REG   %rsp
+#define GPR_SAVE_AREA  %rsp + XMM_SAVE_SIZE
+#define LEN_AREA_OFFSETXMM_SAVE_SIZE + 8*8
+#define LEN_AREA_REG   %rsp
+#define LEN_AREA   %rsp + XMM_SAVE_SIZE + 8*8
+
+#define IN_OFFSET  0
+#define OUT_OFFSET 8*8
+#define KEYS_OFFSET16*8
+#define IV_OFFSET  24*8
+
+
+#define IDX%rax
+#define TMP%rbx
+#define ARG%rdi
+#define LEN%rsi
+
+#define KEYS0  %r14
+#define KEYS1  %r15
+#define KEYS2  %rbp
+#define KEYS3  %rdx
+#define KEYS4  %rcx
+#define KEYS5  %r8
+#define KEYS6  %r9
+#define KEYS7  %r10
+
+#define IN0%r11
+#define IN2%r12
+#define IN4%r13
+#define IN6LEN
+
+#define XDATA0 %xmm0
+#define XDATA1 %xmm1
+#define XDATA2 %xmm2
+#define XDATA3 %xmm3
+#define XDATA4 %xmm4
+#define XDATA5 %xmm5
+#define XDATA6 %xmm6
+#define XDATA7 %xmm7
+
+#define XKEY0_3%xmm8
+#define XKEY1_4%xmm9
+#define XKEY2_5%xmm10
+#define XKEY3_6%xmm11
+#define XKEY4_7%xmm12
+#define XKEY5_8%xmm13
+#define XKEY6_9%xmm14
+#define XTMP   %xmm15
+
+#defineMOVDQ movdqu /* assume buffers not aligned */
+#define CONCAT(a, b)   a##b
+#define INPUT_REG_SUFX 1   /* IN */
+#define XDATA_REG_SUFX 2   /* XDAT */
+#define KEY_REG_SUFX   3   /* KEY */
+#define XMM_REG_SUFX   4   /* XMM */
+
+/*
+ * To avoid positional parameter errors while compiling
+ * three registers 

[PATCH v4 2/5] crypto: AES CBC multi-buffer data structures

2015-12-02 Thread Tim Chen

This patch introduces the data structures and prototypes of functions
needed for doing AES CBC encryption using multi-buffer. Included are
the structures of the multi-buffer AES CBC job, job scheduler in C and
data structure defines in x86 assembly code.

Originally-by: Chandramouli Narayanan 
Signed-off-by: Tim Chen 
---
 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h|  96 +
 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h| 131 
 arch/x86/crypto/aes-cbc-mb/mb_mgr_datastruct.S | 270 +
 arch/x86/crypto/aes-cbc-mb/reg_sizes.S | 125 
 4 files changed, 622 insertions(+)
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h
 create mode 100644 arch/x86/crypto/aes-cbc-mb/mb_mgr_datastruct.S
 create mode 100644 arch/x86/crypto/aes-cbc-mb/reg_sizes.S

diff --git a/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h 
b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h
new file mode 100644
index 000..5493f83
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_ctx.h
@@ -0,0 +1,96 @@
+/*
+ * Header file for multi buffer AES CBC algorithm manager
+ * that deals with 8 buffers at a time
+ *
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * James Guilford 
+ * Sean Gulley 
+ * Tim Chen 
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ * Redistributions of source code must retain the above copyright
+ * notice, this list of conditions and the following disclaimer.
+ * Redistributions in binary form must reproduce the above copyright
+ * notice, this list of conditions and the following disclaimer in
+ * the documentation and/or other materials provided with the
+ * distribution.
+ * Neither the name of Intel Corporation nor the names of its
+ * contributors may be used to endorse or promote products derived
+ * from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#ifndef __AES_CBC_MB_CTX_H
+#define __AES_CBC_MB_CTX_H
+
+
+#include 
+
+#include "aes_cbc_mb_mgr.h"
+
+#define CBC_ENCRYPT0x01
+#define CBC_DECRYPT0x02
+#define CBC_START  0x04
+#define CBC_DONE   0x08
+
+#define CBC_CTX_STS_IDLE   0x00
+#define CBC_CTX_STS_PROCESSING 0x01
+#define CBC_CTX_STS_LAST   0x02
+#define CBC_CTX_STS_COMPLETE   0x04
+
+enum cbc_ctx_error {
+   CBC_CTX_ERROR_NONE   =  0,
+   CBC_CTX_ERROR_INVALID_FLAGS  = -1,
+   CBC_CTX_ERROR_ALREADY_PROCESSING = -2,
+   CBC_CTX_ERROR_ALREADY_COMPLETED  = -3,
+};
+
+#define cbc_ctx_init(ctx, nbytes, op) \
+   do { \
+   (ctx)->flag = (op) | CBC_START; \
+   (ctx)->nbytes = nbytes; \
+   } while (0)
+
+/* AESNI routines to perform cbc decrypt and key expansion */
+
+asmlinkage void aesni_cbc_dec(struct crypto_aes_ctx *ctx, u8 *out,
+ const u8 *in, unsigned int len, u8 *iv);
+asmlinkage int aesni_set_key(struct crypto_aes_ctx *ctx, const u8 *in_key,
+unsigned int key_len);
+
+#endif /* __AES_CBC_MB_CTX_H */
diff --git a/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h 
b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb_mgr.h
new file mode 100644
index 000..0def82e
--- /dev/null
+++ 

[PATCH v4 5/5] crypto: AES CBC multi-buffer glue code

2015-12-02 Thread Tim Chen

This patch introduces the multi-buffer job manager which is responsible
for submitting scatter-gather buffers from several AES CBC jobs
to the multi-buffer algorithm. The glue code interfaces with the
underlying algorithm that handles 8 data streams of AES CBC encryption
in parallel. AES key expansion and CBC decryption requests are performed
in a manner similar to the existing AESNI Intel glue driver.

The outline of the algorithm for AES CBC encryption requests is
sketched below:

Any driver requesting the crypto service will place an async crypto
request on the workqueue.  The multi-buffer crypto daemon will pull an
AES CBC encryption request from work queue and put each request in an
empty data lane for multi-buffer crypto computation.  When all the empty
lanes are filled, computation will commence on the jobs in parallel and
the job with the shortest remaining buffer will get completed and be
returned. To prevent prolonged stall, when no new jobs arrive, we will
flush workqueue of jobs after a maximum allowable delay has elapsed.

To accommodate the fragmented nature of scatter-gather, we will keep
submitting the next scatter-buffer fragment for a job for multi-buffer
computation until a job is completed and no more buffer fragments remain.
At that time we will pull a new job to fill the now empty data slot.
We check with the multibuffer scheduler to see if there are other
completed jobs to prevent extraneous delay in returning any completed
jobs.

This multi-buffer algorithm should be used for cases where we get at
least 8 streams of crypto jobs submitted at a reasonably high rate.
For low crypto job submission rate and low number of data streams, this
algorithm will not be beneficial. The reason is at low rate, we do not
fill out the data lanes before flushing the jobs instead of processing
them with all the data lanes full.  We will miss the benefit of parallel
computation, and adding delay to the processing of the crypto job at the
same time.  Some tuning of the maximum latency parameter may be needed
to get the best performance.

Originally-by: Chandramouli Narayanan 
Signed-off-by: Tim Chen 
---
 arch/x86/crypto/Makefile|   1 +
 arch/x86/crypto/aes-cbc-mb/Makefile |  22 +
 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c | 835 
 3 files changed, 858 insertions(+)
 create mode 100644 arch/x86/crypto/aes-cbc-mb/Makefile
 create mode 100644 arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c

diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index b9b912a..000db49 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -33,6 +33,7 @@ obj-$(CONFIG_CRYPTO_CRC32_PCLMUL) += crc32-pclmul.o
 obj-$(CONFIG_CRYPTO_SHA256_SSSE3) += sha256-ssse3.o
 obj-$(CONFIG_CRYPTO_SHA512_SSSE3) += sha512-ssse3.o
 obj-$(CONFIG_CRYPTO_CRCT10DIF_PCLMUL) += crct10dif-pclmul.o
+obj-$(CONFIG_CRYPTO_AES_CBC_MB) += aes-cbc-mb/
 obj-$(CONFIG_CRYPTO_POLY1305_X86_64) += poly1305-x86_64.o
 
 # These modules require assembler to support AVX.
diff --git a/arch/x86/crypto/aes-cbc-mb/Makefile 
b/arch/x86/crypto/aes-cbc-mb/Makefile
new file mode 100644
index 000..b642bd8
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/Makefile
@@ -0,0 +1,22 @@
+#
+# Arch-specific CryptoAPI modules.
+#
+
+avx_supported := $(call as-instr,vpxor %xmm0$(comma)%xmm0$(comma)%xmm0,yes,no)
+
+# we need decryption and key expansion routine symbols
+# if either AESNI_NI_INTEL or AES_CBC_MB is a module
+
+ifeq ($(CONFIG_CRYPTO_AES_NI_INTEL),m)
+   dec_support := ../aesni-intel_asm.o
+endif
+ifeq ($(CONFIG_CRYPTO_AES_CBC_MB),m)
+   dec_support := ../aesni-intel_asm.o
+endif
+
+ifeq ($(avx_supported),yes)
+   obj-$(CONFIG_CRYPTO_AES_CBC_MB) += aes-cbc-mb.o
+   aes-cbc-mb-y := $(dec_support) aes_cbc_mb.o aes_mb_mgr_init.o \
+   mb_mgr_inorder_x8_asm.o mb_mgr_ooo_x8_asm.o \
+   aes_cbc_enc_x8.o
+endif
diff --git a/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c 
b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c
new file mode 100644
index 000..4d16a5d
--- /dev/null
+++ b/arch/x86/crypto/aes-cbc-mb/aes_cbc_mb.c
@@ -0,0 +1,835 @@
+/*
+ * Multi buffer AES CBC algorithm glue code
+ *
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Contact Information:
+ * James Guilford 
+ 

[PATCH v4 1/5] crypto: Multi-buffer encryption infrastructure support

2015-12-02 Thread Tim Chen

In this patch, the infrastructure needed to support multibuffer
encryption implementation is added:

a) Enhace mcryptd daemon to support blkcipher requests.

b) Update configuration to include multi-buffer encryption build support.

For an introduction to the multi-buffer implementation, please see
http://www.intel.com/content/www/us/en/communications/communications-ia-multi-buffer-paper.html

Originally-by: Chandramouli Narayanan 
Signed-off-by: Tim Chen 
---
 crypto/Kconfig   |  16 +++
 crypto/mcryptd.c | 256 ++-
 include/crypto/algapi.h  |   1 +
 include/crypto/mcryptd.h |  36 +++
 4 files changed, 308 insertions(+), 1 deletion(-)

diff --git a/crypto/Kconfig b/crypto/Kconfig
index 7240821..6b51084 100644
--- a/crypto/Kconfig
+++ b/crypto/Kconfig
@@ -888,6 +888,22 @@ config CRYPTO_AES_NI_INTEL
  ECB, CBC, LRW, PCBC, XTS. The 64 bit version has additional
  acceleration for CTR.
 
+config CRYPTO_AES_CBC_MB
+   tristate "AES CBC algorithm (x86_64 Multi-Buffer, Experimental)"
+   depends on X86 && 64BIT
+   select CRYPTO_ABLK_HELPER
+   select CRYPTO_MCRYPTD
+   help
+ AES CBC encryption implemented using multi-buffer technique.
+ This algorithm computes on multiple data lanes concurrently with
+ SIMD instructions for better throughput.  It should only be
+ used when there is significant work to generate many separate
+ crypto requests that keep all the data lanes filled to get
+ the performance benefit.  If the data lanes are unfilled, a
+ flush operation will be initiated after some delay to process
+ the exisiting crypto jobs, adding some extra latency at low
+ load case.
+
 config CRYPTO_AES_SPARC64
tristate "AES cipher algorithms (SPARC64)"
depends on SPARC64
diff --git a/crypto/mcryptd.c b/crypto/mcryptd.c
index fe5b495a..01f747c 100644
--- a/crypto/mcryptd.c
+++ b/crypto/mcryptd.c
@@ -116,8 +116,28 @@ static int mcryptd_enqueue_request(struct mcryptd_queue 
*queue,
return err;
 }
 
+static int mcryptd_enqueue_blkcipher_request(struct mcryptd_queue *queue,
+ struct crypto_async_request *request,
+ struct mcryptd_blkcipher_request_ctx *rctx)
+{
+   int cpu, err;
+   struct mcryptd_cpu_queue *cpu_queue;
+
+   cpu = get_cpu();
+   cpu_queue = this_cpu_ptr(queue->cpu_queue);
+   rctx->tag.cpu = cpu;
+
+   err = crypto_enqueue_request(_queue->queue, request);
+   pr_debug("enqueue request: cpu %d cpu_queue %p request %p\n",
+cpu, cpu_queue, request);
+   queue_work_on(cpu, kcrypto_wq, _queue->work);
+   put_cpu();
+
+   return err;
+}
+
 /*
- * Try to opportunisticlly flush the partially completed jobs if
+ * Try to opportunistically flush the partially completed jobs if
  * crypto daemon is the only task running.
  */
 static void mcryptd_opportunistic_flush(void)
@@ -225,6 +245,130 @@ static inline struct mcryptd_queue 
*mcryptd_get_queue(struct crypto_tfm *tfm)
return ictx->queue;
 }
 
+static int mcryptd_blkcipher_setkey(struct crypto_ablkcipher *parent,
+  const u8 *key, unsigned int keylen)
+{
+   struct mcryptd_blkcipher_ctx *ctx = crypto_ablkcipher_ctx(parent);
+   struct crypto_blkcipher *child = ctx->child;
+   int err;
+
+   crypto_blkcipher_clear_flags(child, CRYPTO_TFM_REQ_MASK);
+   crypto_blkcipher_set_flags(child, crypto_ablkcipher_get_flags(parent) &
+ CRYPTO_TFM_REQ_MASK);
+   err = crypto_blkcipher_setkey(child, key, keylen);
+   crypto_ablkcipher_set_flags(parent, crypto_blkcipher_get_flags(child) &
+   CRYPTO_TFM_RES_MASK);
+   return err;
+}
+
+static void mcryptd_blkcipher_crypt(struct ablkcipher_request *req,
+  struct crypto_blkcipher *child,
+  int err,
+  int (*crypt)(struct blkcipher_desc *desc,
+   struct scatterlist *dst,
+   struct scatterlist *src,
+   unsigned int len))
+{
+   struct mcryptd_blkcipher_request_ctx *rctx;
+   struct blkcipher_desc desc;
+
+   rctx = ablkcipher_request_ctx(req);
+
+   if (unlikely(err == -EINPROGRESS))
+   goto out;
+
+   /* set up the blkcipher request to work on */
+   desc.tfm = child;
+   desc.info = req->info;
+   desc.flags = CRYPTO_TFM_REQ_MAY_SLEEP;
+   rctx->desc = desc;
+
+   /*
+* pass addr of descriptor stored in the request context
+* so that the callee can get to the request context
+*/
+   err = crypt(>desc, 

Re: ipsec impact on performance

2015-12-02 Thread Tom Herbert
On Wed, Dec 2, 2015 at 1:12 PM, Sowmini Varadhan
 wrote:
> On (12/02/15 13:07), Tom Herbert wrote:
>> That's easy enough to add to flow dissector, but is SPI really
>> intended to be used an L4 entropy value? We would need to consider the
>
> yes. To quote https://en.wikipedia.org/wiki/Security_Parameter_Index
> "This works like port numbers in TCP and UDP connections. What it means
>  is that there could be different SAs used to provide security to one
>  connection. An SA could therefore act as a set of rules."
>
>> effects of running multiple TCP connections over an IPsec. Also, you
>> might want to try IPv6, the flow label should provide a good L4 hash
>> for RPS/RFS, it would be interesting to see what the effects are with
>> IPsec processing. (ESP/UDP could also if RSS/ECMP is critical)
>
> IPv6 would be an interesting academic exercise, but it's going
> to be a while before we get RDS-TCP to go over IPv6.
>
Huh? Who said anything about RDS-TCP? I thought you were trying to
improve IPsec performance...
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-02 Thread Sowmini Varadhan
On (12/02/15 13:07), Tom Herbert wrote:
> That's easy enough to add to flow dissector, but is SPI really
> intended to be used an L4 entropy value? We would need to consider the

yes. To quote https://en.wikipedia.org/wiki/Security_Parameter_Index
"This works like port numbers in TCP and UDP connections. What it means
 is that there could be different SAs used to provide security to one
 connection. An SA could therefore act as a set of rules."

> effects of running multiple TCP connections over an IPsec. Also, you
> might want to try IPv6, the flow label should provide a good L4 hash
> for RPS/RFS, it would be interesting to see what the effects are with
> IPsec processing. (ESP/UDP could also if RSS/ECMP is critical)

IPv6 would be an interesting academic exercise, but it's going
to be a while before we get RDS-TCP to go over IPv6.

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-02 Thread Sowmini Varadhan
On (12/02/15 14:01), Tom Herbert wrote:
> No, please don't persist is this myopic "we'll get to IPv6 later"
> model! IPv6 is a real protocol, it has significant deployment of the
> Internet, and there are now whole data centers that are IPv6 only
> (e.g. FB), and there are plenty of use cases of IPSEC/IPv6 that could
> benefit for performance improvements just as much IPv4. This vendor
> mentality that IPv6 is still not important simply doesn't help
> matters. :-(

Ok, I'll get you the numbers for this later, and sure, if we do
this, we should solve the ipv6 problem too.

BTW, the ipv6 nov3 paths have severe alignment issues. I flagged
this a long time ago http://www.spinics.net/lists/netdev/msg336257.html

I think all of it is triggered by mld. Someone needs to do
something about that too. I dont think those paths are using 
NET_ALIGN very well, and I dont think this is the most wholesome
thing for perf.

--Sowmini
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] crypto: n2 - Use platform_register/unregister_drivers()

2015-12-02 Thread David Miller
From: Thierry Reding 
Date: Wed,  2 Dec 2015 17:16:36 +0100

> From: Thierry Reding 
> 
> These new helpers simplify implementing multi-driver modules and
> properly handle failure to register one driver by unregistering all
> previously registered drivers.
> 
> Signed-off-by: Thierry Reding 

Acked-by: David S. Miller 
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-02 Thread Tom Herbert
On Wed, Dec 2, 2015 at 12:50 PM, Sowmini Varadhan
 wrote:
> On (12/02/15 12:41), David Laight wrote:
>> You are getting 0.7 Gbps with ass-ccm-a-128, scale the esp-null back to
>> that and it would use 7/18*71 = 27% of the cpu.
>> So 69% of the cpu in the a-128 case is probably caused by the
>> encryption itself.
>> Even if the rest of the code cost nothing you'd not increase
>> above 1Gbps.
>
> Fortunately, the situation is not quite hopeless yet.
>
> Thanks to Rick Jones for supplying the hints for this, but with
> some careful manual pinning of irqs and iperf processes to cpus,
> I can get to 4.5 Gbps for the esp-null case.
>
> Given that the [clear traffic + GSO without GRO] gets me about 5-7 Gbps,
> the 4.5 Gbps is not that far off (and at that point, the nickel-and-dime
> tweaks may help even more).
>
> For AES-GCM, I'm able to go from 1.8 Gbps (no GSO) to 2.8 Gbps.
> Still not great, but proves that we haven't yet hit any upper bounds
> yet.
>
> I think a lot of the manual tweaking of irq/process placement
> is needed because the existing rps/rfs flow steering is looking
> for TCP/UDP flow numbers to do the steering. It can just as easily
> use the IPsec SPI numbers to do this, and that's another place where
> we can make this more ipsec-friendly.
>
That's easy enough to add to flow dissector, but is SPI really
intended to be used an L4 entropy value? We would need to consider the
effects of running multiple TCP connections over an IPsec. Also, you
might want to try IPv6, the flow label should provide a good L4 hash
for RPS/RFS, it would be interesting to see what the effects are with
IPsec processing. (ESP/UDP could also if RSS/ECMP is critical)

Tom
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-02 Thread Eric Dumazet
On Wed, 2015-12-02 at 16:12 -0500, Sowmini Varadhan wrote:

> IPv6 would be an interesting academic exercise

Really, you made my day !


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-02 Thread Sowmini Varadhan
On (12/02/15 07:53), Steffen Klassert wrote:
> 
> I'm currently working on a GRO/GSO codepath for IPsec too. The GRO part
> works already. I decapsulate/decrypt the packets on layer2 with a esp GRO
> callback function and reinject them into napi_gro_receive(). So in case
> the decapsulated packet is TCP, GRO can aggregate big packets.

Would you be able to share your patch with me? I'd like to give that a try
just to get preliminary numbers (and I could massage it as needed
for transport mode too).

> My approach to GSO is a bit different to yours. I focused on tunnel mode,
> but transport mode should work too. I encapsulate the big GSO packets
> but don't do the encryption. Then I've added a esp_gso_segment() function,
> so the (still not encrypted ESP packets) get segmented with GSO. Finally I
> do encryption for all segments. This works well as long as I do sync crypto.
> The hard part is when crypto returns async. This is what I'm working on now.
> I hope to get this ready during the next weeks that I can post a RFC version
> and some numbers.

I see. My thought for attacking tunnel mode would have been to 
callout the esp code at the tail of gre_gso_segment, but I did not
yet consider this carefully - clearly you've spent more time on it,
and know more about all the gotchas there.

> Also I tried to consider the IPsec GRO/GSO codepath as a software fallback.
> So I added hooks for the encapsulation, encryption etc. If a NIC can do
> IPsec, it can use this hooks to prepare the packets the way it needs it.
> There are NICs that can do IPsec, it's just that our stack does not support
> it.

yes, this is one of the things I wanted to bring up at netdev 1.1.
Evidently many of the 10G NICS (Niantic, Twinville, Sageville) already
support ipsec offload but that feature is not enabled for BSD or linux
because the stack does not support it (though Microsoft does. The intel
folks pointed me at this doc:
https://msdn.microsoft.com/en-us/library/windows/hardware/ff556996%28v=vs.85%29.aspx)

But quite independant of h/w offload, the s/w stack can already do
a very good job for 10G with just GSO and GRO, so being able to extend
that path to do encryption after segmentation should at least bridge
the huge gap between the ipsec and non-ipsec mech.

And that gap should be as small as possible for esp-null, so that
the only big hit we take is for the complexity of encryption itself!

> Another thing, I thought about setting up an IPsec BoF/workshop at
> netdev1.1. My main topic is GRO/GSO for IPsec. I'll send out a mail
> to the list later this week to see if there is enough interest and
> maybe some additional topics.

Sounds like an excellent idea. I'm certainly interested.

--Sowmini
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-02 Thread Sowmini Varadhan
On (12/02/15 11:56), David Laight wrote:
> > Gbps  peak cpu util
> > esp-null 1.8   71%
> > aes-gcm-c-2561.6   79%
> > aes-ccm-a-1280.7   96%
> > 
> > That trend made me think that if we can get esp-null to be as close
> > as possible to GSO/GRO, the rest will follow closely behind.
> 
> That's not how I read those figures.
> They imply to me that there is a massive cost for the actual encryption
> (particularly for aes-ccm-a-128) - so whatever you do to the esp-null
> case won't help.

I'm not a crypto expert, but my understanding is that the CCM mode
is the "older" encryption algorithm, and GCM is the way of the future.
Plus, I think the GCM mode has some type of h/w support (hence the
lower cpu util)

I'm sure that crypto has a cost, not disputing that, but my point
was that 1.8 -> 1.6 -> 0.7 is a curve with a much gentler slope than
the 9 Gbps (clear traffic, GSO, GRO) 
-> 4 Gbps (clear, no gro, gso) 
   -> 1.8 (esp-null)
That steeper slope smells of s/w perf that we need to resolve first,
before getting into the work of faster crypto?

> One way to get a view of the cost of the encryption (and copies)
> is to do the operation twice.

I could also just instrument it with perf tracepoints, if that 
data is interesting

--Sowmini


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ipsec impact on performance

2015-12-02 Thread David Laight
From: Sowmini Varadhan
> Sent: 01 December 2015 18:37
...
> I was using esp-null merely to not have the crypto itself perturb
> the numbers (i.e., just focus on the s/w overhead for now), but here
> are the numbers for the stock linux kernel stack
> Gbps  peak cpu util
> esp-null 1.8   71%
> aes-gcm-c-2561.6   79%
> aes-ccm-a-1280.7   96%
> 
> That trend made me think that if we can get esp-null to be as close
> as possible to GSO/GRO, the rest will follow closely behind.

That's not how I read those figures.
They imply to me that there is a massive cost for the actual encryption
(particularly for aes-ccm-a-128) - so whatever you do to the esp-null
case won't help.

One way to get a view of the cost of the encryption (and copies)
is to do the operation twice.

David

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: ipsec impact on performance

2015-12-02 Thread David Laight
From: Sowmini Varadhan
> Sent: 02 December 2015 12:12
> On (12/02/15 11:56), David Laight wrote:
> > > Gbps  peak cpu util
> > > esp-null 1.8   71%
> > > aes-gcm-c-2561.6   79%
> > > aes-ccm-a-1280.7   96%
> > >
> > > That trend made me think that if we can get esp-null to be as close
> > > as possible to GSO/GRO, the rest will follow closely behind.
> >
> > That's not how I read those figures.
> > They imply to me that there is a massive cost for the actual encryption
> > (particularly for aes-ccm-a-128) - so whatever you do to the esp-null
> > case won't help.
> 
> I'm not a crypto expert, but my understanding is that the CCM mode
> is the "older" encryption algorithm, and GCM is the way of the future.
> Plus, I think the GCM mode has some type of h/w support (hence the
> lower cpu util)
> 
> I'm sure that crypto has a cost, not disputing that, but my point
> was that 1.8 -> 1.6 -> 0.7 is a curve with a much gentler slope than
> the 9 Gbps (clear traffic, GSO, GRO)
> -> 4 Gbps (clear, no gro, gso)
>-> 1.8 (esp-null)
> That steeper slope smells of s/w perf that we need to resolve first,
> before getting into the work of faster crypto?

That isn't the way cpu cost works.
You are getting 0.7 Gbps with ass-ccm-a-128, scale the esp-null back to
that and it would use 7/18*71 = 27% of the cpu.
So 69% of the cpu in the a-128 case is probably caused by the
encryption itself.
Even if the rest of the code cost nothing you'd not increase
above 1Gbps.

The sums for aes-gcm-c-256 are slightly better, about 15%.

Ok, things aren't quite that simple since you are probably changing
the way data flows through the system as well.

Also what/how are you measuring cpu use.
I'm not sure anything on Linux gives you a truly accurate value
when processes are running for very short periods.

On an SMP system you also get big effects when work is switched
between cpus. I've got some tests that run a lot faster if I
put all but one of the cpus into a busy-loop in userspace
(eg: while :; do :; done)!

David


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-02 Thread Sowmini Varadhan
On (12/02/15 12:41), David Laight wrote:
> 
> Also what/how are you measuring cpu use.
> I'm not sure anything on Linux gives you a truly accurate value
> when processes are running for very short periods.

I was using mpstat, while running iperf. Should I be using
something else? or running it for longer intervals?

but I hope we are not doomed at 1 Gbps, or else security itself would
come at a very unattractive cost. Anyway, even aside from crypto.
we need to have some way to add TCP options (that depend on the 
contents of the tcp header) etc post-GSO, in the interest of not 
ossifying the stack.

> On an SMP system you also get big effects when work is switched
> between cpus. I've got some tests that run a lot faster if I
> put all but one of the cpus into a busy-loop in userspace
> (eg: while :; do :; done)!

yes Rick Jones also pointed the same thing to me, and one of the
things I was going to try out later today is to instrument the
effects of pinning irqs and iperf threads to a specific cpu.

--Sowmini


--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: ipsec impact on performance

2015-12-02 Thread Rick Jones

On 12/02/2015 03:56 AM, David Laight wrote:

From: Sowmini Varadhan

Sent: 01 December 2015 18:37

...

I was using esp-null merely to not have the crypto itself perturb
the numbers (i.e., just focus on the s/w overhead for now), but here
are the numbers for the stock linux kernel stack
 Gbps  peak cpu util
esp-null 1.8   71%
aes-gcm-c-2561.6   79%
aes-ccm-a-1280.7   96%

That trend made me think that if we can get esp-null to be as close
as possible to GSO/GRO, the rest will follow closely behind.


That's not how I read those figures.
They imply to me that there is a massive cost for the actual encryption
(particularly for aes-ccm-a-128) - so whatever you do to the esp-null
case won't help.



To build on the whole "importance of normalizing throughput and CPU 
utilization in some way" theme, the following are some non-IPSec netperf 
TCP_STREAM runs between a pair of 2xIntel E5-2603 v3 systems using 
Broadcom BCM57810-based NICs, 4.2.0-19 kernel, 7.10.72 firmware and 
bnx2x driver version 1.710.51-0:



root@htx-scale300-258:~# ./take_numbers.sh
Baseline
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 
10.12.49.1 () port 0 AF_INET : +/-2.500% @ 99% conf.  : demo : cpu bind
Throughput Local Local   Local   Remote Remote  Remote  Throughput Local 
 Remote
   CPU   Service PeakCPUService PeakConfidence CPU 
   CPU
   Util  Demand  Per CPU Util   Demand  Per CPU Width (%) 
Confidence Confidence
   % Util %  %  Util % 
Width (%)  Width (%)
9414.111.87  0.195   26.54   3.70   0.387   45.42   0.002  7.073 
 1.276

Disable TSO/GSO
5651.258.36  1.454   100.00  2.46   0.428   30.35   1.093  1.101 
 4.889

Disable tx CKO
5287.698.46  1.573   100.00  2.34   0.435   29.66   0.428  7.710 
 3.518

Disable remote LRO/GRO
4148.768.32  1.971   99.97   5.95   1.409   71.98   3.656  0.735 
 3.491

Disable remote rx CKO
4204.498.31  1.942   100.00  6.68   1.563   82.05   2.015  0.437 
 4.921


You can see that as the offloads are disabled, the service demands (usec 
of CPU time consumed systemwide per KB of data transferred) go up, and 
until one hits a bottleneck (eg one of the CPUs pegs at 100%), go up 
faster than the throughputs go down.


To aid in reproducibility those tests were with irqbalance disabled, all 
the IRQs for the NICs pointed at CPU 0, netperf/netserver bound to CPU 
0, and the power management set to static high performance.


Assuming I've created a "matching" ipsec.conf, here is what I see with 
esp=null-null on the TCP_STREAM test - again, keeping all the binding in 
place etc:


3077.378.01  2.560   97.78   8.21   2.625   99.41   4.869  1.876 
 0.955


You can see that even with the null-null, there is a rather large 
increase in service demand.


And this is what I see when I run netperf TCP_RR (first is without 
ipsec, second is with. I didn't ask for confidence intervals this time 
around and I didn't try to tweak interrupt coalescing settings)


# HDR="-P 1";for i in 10.12.49.1 192.168.0.2; do ./netperf -H $i -t 
TCP_RR -c -C -l 30 -T 0 $HDR; HDR="-P 0"; done
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET 
to 10.12.49.1 () port 0 AF_INET : demo : first burst 0 : cpu bind

Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPUCPUS.dem   S.dem
Send   Recv   SizeSize   TimeRate local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  % S% Sus/Tr   us/Tr

16384  87380  1   1  30.00   30419.75  1.72   1.68   6.783   6.617
16384  87380
16384  87380  1   1  30.00   20711.39  2.15   2.05   12.450  11.882
16384  87380

The service demand increases ~83% on the netperf side and almost 80% on 
the netserver side.  That is pure "effective" path-length increase.


happy benchmarking,

rick jones

PS - the netperf commands were varations on this theme:
./netperf -P 0 -T 0 -H 10.12.49.1 -c -C -l 30 -i 30,3 -- -O 
throughput,local_cpu_util,local_sd,local_cpu_peak_util,remote_cpu_util,remote_sd,remote_cpu_peak_util,throughput_confid,local_cpu_confid,remote_cpu_confid
altering IP address or test as appropriate.  -P 0 disables printing the 
test banner/headers.  -T 0 binds netperf and netserver to CPU0 on their 
respective systems.  -H sets the destination, -c and -C ask for local 
and remote CPU measurements respectively.  -l 30 says each test 
iteration should be 30 seconds long and -i 30,3 says to run at least 
three iterations and no more than 30 when trying to hit the confidence 
interval - by default 99% confident the average reported is within +/- 
2.5% of the "actual" average.  The -O stuff is selecting specific values 
to be emitted.

--
To unsubscribe from this list: send the line "unsubscribe linux-crypto" in
the body of a message to majord...@vger.kernel.org
More majordomo info at