Re: [PATCH 1/6] mailbox: Add new API mbox_channel_device() for clients

2017-02-05 Thread Anup Patel
On Fri, Feb 3, 2017 at 5:35 PM, Jassi Brar  wrote:
> On Thu, Feb 2, 2017 at 10:17 AM, Anup Patel  wrote:
>> The remote processor can have DMAENGINE capabilities and client
>> can pass data to be processed via main memory. In such cases,
>> the client will require DMAble memory for remote processor.
>>
>> This patch adds new API mbox_channel_device() which can be
>> used by clients to get struct device pointer of underlying
>> mailbox controller. This struct device pointer of mailbox
>> controller can be used by clients to allocate DMAble memory
>> for remote processor.
>>
> IIUC, DT already provides a way for what you need.

Thanks for the suggestion. I will explore in this direction and
try to avoid this patch in next revision.

Can you please have a look at FlexRM driver which I
had submitted previously?
https://lkml.org/lkml/2017/1/5/291
https://lkml.org/lkml/2017/1/5/293

Regards,
Anup


Re: [PATCH 3/6] async_tx: Handle DMA devices having support for fewer PQ coefficients

2017-02-05 Thread Anup Patel
On Sat, Feb 4, 2017 at 12:12 AM, Dan Williams  wrote:
> On Fri, Feb 3, 2017 at 2:59 AM, Anup Patel  wrote:
>>
>>
>> On Thu, Feb 2, 2017 at 11:31 AM, Dan Williams 
>> wrote:
>>>
>>> On Wed, Feb 1, 2017 at 8:47 PM, Anup Patel 
>>> wrote:
>>> > The DMAENGINE framework assumes that if PQ offload is supported by a
>>> > DMA device then all 256 PQ coefficients are supported. This assumption
>>> > does not hold anymore because we now have BCM-SBA-RAID offload engine
>>> > which supports PQ offload with limited number of PQ coefficients.
>>> >
>>> > This patch extends async_tx APIs to handle DMA devices with support
>>> > for fewer PQ coefficients.
>>> >
>>> > Signed-off-by: Anup Patel 
>>> > Reviewed-by: Scott Branden 
>>> > ---
>>> >  crypto/async_tx/async_pq.c  |  3 +++
>>> >  crypto/async_tx/async_raid6_recov.c | 12 ++--
>>> >  include/linux/dmaengine.h   | 19 +++
>>> >  include/linux/raid/pq.h |  3 +++
>>> >  4 files changed, 35 insertions(+), 2 deletions(-)
>>>
>>> So, I hate the way async_tx does these checks on each operation, and
>>> it's ok for me to say that because it's my fault. Really it's md that
>>> should be validating engine offload capabilities once at the beginning
>>> of time. I'd rather we move in that direction than continue to pile
>>> onto a bad design.
>>
>>
>> Yes, indeed. All async_tx APIs have lot of checks and for high throughput
>> RAID offload engine these checks can add some overhead.
>>
>> I think doing checks in Linux md would be certainly better but this would
>> mean lot of changes in Linux md as well as remove checks in async_tx.
>>
>> Also, async_tx APIs should not find DMA channel on its own instead it
>> should rely on Linux md to provide DMA channel pointer as parameter.
>>
>> It's better to do checks cleanup in async_tx as separate patchset and
>> keep this patchset simple.
>
> That's been the problem with async_tx being broken like this for
> years. Once you get this "small / simple" patch upstream, that
> arguably makes async_tx a little bit worse, there is no longer any
> motivation to fix the underlying issues. If you care about the long
> term health of raid offload and are enabling new hardware support you
> should first tackle the known problems with it before adding new
> features.

Apart from the checks related issue you pointed there are other
issues with async_tx APIs such as:

1. The mechanism to do update PQ (or RAID6 update) operation
in current async_tx APIs is to call async_gen_syndrome() twice
with ASYNC_TX_PQ_XOR_DST flag set. Also, async_gen_syndrome()
will always prefer SW approach when ASYNC_TX_PQ_XOR_DST flag
is set. This means async_tx API is forcing SW approach for update
PQ operation and in-addition we require two async_gen_syndrome()
calls to achieve update PQ. This limitations of async_gen_syndrome()
reduces performance of async_tx APIs. Instead of this we should
have a dedicated async_update_pq() API which will allow RAID
offload engine drivers (such as BCM-FS4-RAID) to implement
update PQ using HW offload and this new API will fall-back to
SW approach using async_gen_syndrome() if no DMA channel
provides update PQ HW offload.

2. In our stress testing, we have observed that dma_map_page()
and dma_unmap_page() used in various async_tx APIs are the
major cause of overhead. If we directly call DMA channel callbacks
with pre-DMA-mapped pages then we get very high throughput.
The async_tx APIs should provide a way for pre-DMA-mapped
pages so that Linux MD can exploit this fact for better performance.

3. We really don't have a test module to stress/benchmark all
async_tx APIs using multi-threading and batching large number
of request in each thread. This kind of test module is very much
required for performance benchmarking and stressing high
throughput (hundreds of Gbps) RAID offload engines (such as
BCM-FS4-RAID).

>From the above, we already have async_tx_test module to
address point3. We also plan to address point1 above but
this would also require changes in Linux MD to use new
async_update_pq() API.

As you can see, this patchset is not end of story of us if we
want best possible utilization of BCM-FS4-RAID.

Regards,
Anup


[PATCH v7 5/5] lib/lz4: Remove back-compat wrappers

2017-02-05 Thread Sven Schmidt
This patch removes the functions introduced as wrappers for providing
backwards compatibility to the prior LZ4 version.
They're not needed anymore since there's no callers left.

Signed-off-by: Sven Schmidt <4ssch...@informatik.uni-hamburg.de>
---
 include/linux/lz4.h  | 69 
 lib/lz4/lz4_compress.c   | 22 ---
 lib/lz4/lz4_decompress.c | 42 -
 lib/lz4/lz4hc_compress.c | 23 
 4 files changed, 156 deletions(-)

diff --git a/include/linux/lz4.h b/include/linux/lz4.h
index 1b7ab2a..a3912d7 100644
--- a/include/linux/lz4.h
+++ b/include/linux/lz4.h
@@ -173,18 +173,6 @@ static inline int LZ4_compressBound(size_t isize)
 }
 
 /**
- * lz4_compressbound() - For backwards compatibility; see LZ4_compressBound
- * @isize: Size of the input data
- *
- * Return: Max. size LZ4 may output in a "worst case" szenario
- * (data not compressible)
- */
-static inline int lz4_compressbound(size_t isize)
-{
-   return LZ4_COMPRESSBOUND(isize);
-}
-
-/**
  * LZ4_compress_default() - Compress data from source to dest
  * @source: source address of the original data
  * @dest: output buffer address of the compressed data
@@ -257,20 +245,6 @@ int LZ4_compress_fast(const char *source, char *dest, int 
inputSize,
 int LZ4_compress_destSize(const char *source, char *dest, int *sourceSizePtr,
int targetDestSize, void *wrkmem);
 
-/*
- * lz4_compress() - For backward compatibility, see LZ4_compress_default
- * @src: source address of the original data
- * @src_len: size of the original data
- * @dst: output buffer address of the compressed data. This requires 'dst'
- * of size LZ4_COMPRESSBOUND
- * @dst_len: is the output size, which is returned after compress done
- * @workmem: address of the working memory.
- *
- * Return: Success if return 0, Error if return < 0
- */
-int lz4_compress(const unsigned char *src, size_t src_len, unsigned char *dst,
-   size_t *dst_len, void *wrkmem);
-
 /*-
  * Decompression Functions
  **/
@@ -346,34 +320,6 @@ int LZ4_decompress_safe(const char *source, char *dest, 
int compressedSize,
 int LZ4_decompress_safe_partial(const char *source, char *dest,
int compressedSize, int targetOutputSize, int maxDecompressedSize);
 
-/*
- * lz4_decompress_unknownoutputsize() - For backwards compatibility,
- * see LZ4_decompress_safe
- * @src: source address of the compressed data
- * @src_len: is the input size, therefore the compressed size
- * @dest: output buffer address of the decompressed data
- * which must be already allocated
- * @dest_len: is the max size of the destination buffer, which is
- * returned with actual size of decompressed data after decompress done
- *
- * Return: Success if return 0, Error if return (< 0)
- */
-int lz4_decompress_unknownoutputsize(const unsigned char *src, size_t src_len,
-   unsigned char *dest, size_t *dest_len);
-
-/**
- * lz4_decompress() - For backwards cocmpatibility, see LZ4_decompress_fast
- * @src: source address of the compressed data
- * @src_len: is the input size, which is returned after decompress done
- * @dest: output buffer address of the decompressed data,
- * which must be already allocated
- * @actual_dest_len: is the size of uncompressed data, supposing it's known
- *
- * Return: Success if return 0, Error if return (< 0)
- */
-int lz4_decompress(const unsigned char *src, size_t *src_len,
-   unsigned char *dest, size_t actual_dest_len);
-
 /*-
  * LZ4 HC Compression
  **/
@@ -401,21 +347,6 @@ int LZ4_compress_HC(const char *src, char *dst, int 
srcSize, int dstCapacity,
int compressionLevel, void *wrkmem);
 
 /**
- * lz4hc_compress() - For backwards compatibility, see LZ4_compress_HC
- * @src: source address of the original data
- * @src_len: size of the original data
- * @dst: output buffer address of the compressed data. This requires 'dst'
- * of size LZ4_COMPRESSBOUND.
- * @dst_len: is the output size, which is returned after compress done
- * @wrkmem: address of the working memory.
- * This requires 'workmem' of size LZ4HC_MEM_COMPRESS.
- *
- * Return  : Success if return 0, Error if return (< 0)
- */
-int lz4hc_compress(const unsigned char *src, size_t src_len, unsigned char 
*dst,
-   size_t *dst_len, void *wrkmem);
-
-/**
  * LZ4_resetStreamHC() - Init an allocated 'LZ4_streamHC_t' structure
  * @streamHCPtr: pointer to the 'LZ4_streamHC_t' structure
  * @compressionLevel: Recommended values are between 4 and 9, although any
diff --git a/lib/lz4/lz4_compress.c b/lib/lz4/lz4_compress.c
index 6aa7ac3..697dbda 100644
--- a/lib/lz4/lz4_compress.c
+++ b/lib/lz4/lz4_compress.c
@@ 

[PATCH v7 3/5] crypto: Change LZ4 modules to work with new LZ4 module version

2017-02-05 Thread Sven Schmidt
This patch updates the crypto modules using LZ4 compression as well as the
test cases in testmgr.h to work with the new LZ4 module version.

Signed-off-by: Sven Schmidt <4ssch...@informatik.uni-hamburg.de>
---
 crypto/lz4.c |  23 -
 crypto/lz4hc.c   |  23 -
 crypto/testmgr.h | 142 +++
 3 files changed, 120 insertions(+), 68 deletions(-)

diff --git a/crypto/lz4.c b/crypto/lz4.c
index 99c1b2c..71eff9b 100644
--- a/crypto/lz4.c
+++ b/crypto/lz4.c
@@ -66,15 +66,13 @@ static void lz4_exit(struct crypto_tfm *tfm)
 static int __lz4_compress_crypto(const u8 *src, unsigned int slen,
 u8 *dst, unsigned int *dlen, void *ctx)
 {
-   size_t tmp_len = *dlen;
-   int err;
+   int out_len = LZ4_compress_default(src, dst,
+   slen, *dlen, ctx);
 
-   err = lz4_compress(src, slen, dst, _len, ctx);
-
-   if (err < 0)
+   if (!out_len)
return -EINVAL;
 
-   *dlen = tmp_len;
+   *dlen = out_len;
return 0;
 }
 
@@ -96,16 +94,13 @@ static int lz4_compress_crypto(struct crypto_tfm *tfm, 
const u8 *src,
 static int __lz4_decompress_crypto(const u8 *src, unsigned int slen,
   u8 *dst, unsigned int *dlen, void *ctx)
 {
-   int err;
-   size_t tmp_len = *dlen;
-   size_t __slen = slen;
+   int out_len = LZ4_decompress_safe(src, dst, slen, *dlen);
 
-   err = lz4_decompress_unknownoutputsize(src, __slen, dst, _len);
-   if (err < 0)
-   return -EINVAL;
+   if (out_len < 0)
+   return out_len;
 
-   *dlen = tmp_len;
-   return err;
+   *dlen = out_len;
+   return 0;
 }
 
 static int lz4_sdecompress(struct crypto_scomp *tfm, const u8 *src,
diff --git a/crypto/lz4hc.c b/crypto/lz4hc.c
index 75ffc4a..03a34a8 100644
--- a/crypto/lz4hc.c
+++ b/crypto/lz4hc.c
@@ -65,15 +65,13 @@ static void lz4hc_exit(struct crypto_tfm *tfm)
 static int __lz4hc_compress_crypto(const u8 *src, unsigned int slen,
   u8 *dst, unsigned int *dlen, void *ctx)
 {
-   size_t tmp_len = *dlen;
-   int err;
+   int out_len = LZ4_compress_HC(src, dst, slen,
+   *dlen, LZ4HC_DEFAULT_CLEVEL, ctx);
 
-   err = lz4hc_compress(src, slen, dst, _len, ctx);
-
-   if (err < 0)
+   if (!out_len)
return -EINVAL;
 
-   *dlen = tmp_len;
+   *dlen = out_len;
return 0;
 }
 
@@ -97,16 +95,13 @@ static int lz4hc_compress_crypto(struct crypto_tfm *tfm, 
const u8 *src,
 static int __lz4hc_decompress_crypto(const u8 *src, unsigned int slen,
 u8 *dst, unsigned int *dlen, void *ctx)
 {
-   int err;
-   size_t tmp_len = *dlen;
-   size_t __slen = slen;
+   int out_len = LZ4_decompress_safe(src, dst, slen, *dlen);
 
-   err = lz4_decompress_unknownoutputsize(src, __slen, dst, _len);
-   if (err < 0)
-   return -EINVAL;
+   if (out_len < 0)
+   return out_len;
 
-   *dlen = tmp_len;
-   return err;
+   *dlen = out_len;
+   return 0;
 }
 
 static int lz4hc_sdecompress(struct crypto_scomp *tfm, const u8 *src,
diff --git a/crypto/testmgr.h b/crypto/testmgr.h
index 9b656be..98d4be0 100644
--- a/crypto/testmgr.h
+++ b/crypto/testmgr.h
@@ -34498,31 +34498,62 @@ static struct hash_testvec bfin_crc_tv_template[] = {
 
 static struct comp_testvec lz4_comp_tv_template[] = {
{
-   .inlen  = 70,
-   .outlen = 45,
-   .input  = "Join us now and share the software "
- "Join us now and share the software ",
-   .output = "\xf0\x10\x4a\x6f\x69\x6e\x20\x75"
- "\x73\x20\x6e\x6f\x77\x20\x61\x6e"
- "\x64\x20\x73\x68\x61\x72\x65\x20"
- "\x74\x68\x65\x20\x73\x6f\x66\x74"
- "\x77\x0d\x00\x0f\x23\x00\x0b\x50"
- "\x77\x61\x72\x65\x20",
+   .inlen  = 255,
+   .outlen = 218,
+   .input  = "LZ4 is lossless compression algorithm, providing"
+" compression speed at 400 MB/s per core, scalable "
+"with multi-cores CPU. It features an extremely fast "
+"decoder, with speed in multiple GB/s per core, "
+"typically reaching RAM speed limits on multi-core "
+"systems.",
+   .output = "\xf9\x21\x4c\x5a\x34\x20\x69\x73\x20\x6c\x6f\x73\x73"
+ "\x6c\x65\x73\x73\x20\x63\x6f\x6d\x70\x72\x65\x73\x73"
+ "\x69\x6f\x6e\x20\x61\x6c\x67\x6f\x72\x69\x74\x68\x6d"
+ "\x2c\x20\x70\x72\x6f\x76\x69\x64\x69\x6e\x67\x21\x00"
+ "\xf0\x21\x73\x70\x65\x65\x64\x20\x61\x74\x20\x34\x30"
+ 

[PATCH v7 2/5] lib/decompress_unlz4: Change module to work with new LZ4 module version

2017-02-05 Thread Sven Schmidt
This patch updates the unlz4 wrapper to work with the
updated LZ4 kernel module version.

Signed-off-by: Sven Schmidt <4ssch...@informatik.uni-hamburg.de>
---
 lib/decompress_unlz4.c | 13 -
 1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/lib/decompress_unlz4.c b/lib/decompress_unlz4.c
index 036fc88..1b0baf3 100644
--- a/lib/decompress_unlz4.c
+++ b/lib/decompress_unlz4.c
@@ -72,7 +72,7 @@ STATIC inline int INIT unlz4(u8 *input, long in_len,
error("NULL input pointer and missing fill function");
goto exit_1;
} else {
-   inp = large_malloc(lz4_compressbound(uncomp_chunksize));
+   inp = large_malloc(LZ4_compressBound(uncomp_chunksize));
if (!inp) {
error("Could not allocate input buffer");
goto exit_1;
@@ -136,7 +136,7 @@ STATIC inline int INIT unlz4(u8 *input, long in_len,
inp += 4;
size -= 4;
} else {
-   if (chunksize > lz4_compressbound(uncomp_chunksize)) {
+   if (chunksize > LZ4_compressBound(uncomp_chunksize)) {
error("chunk length is longer than allocated");
goto exit_2;
}
@@ -152,11 +152,14 @@ STATIC inline int INIT unlz4(u8 *input, long in_len,
out_len -= dest_len;
} else
dest_len = out_len;
-   ret = lz4_decompress(inp, , outp, dest_len);
+
+   ret = LZ4_decompress_fast(inp, outp, dest_len);
+   chunksize = ret;
 #else
dest_len = uncomp_chunksize;
-   ret = lz4_decompress_unknownoutputsize(inp, chunksize, outp,
-   _len);
+
+   ret = LZ4_decompress_safe(inp, outp, chunksize, dest_len);
+   dest_len = ret;
 #endif
if (ret < 0) {
error("Decoding failed");
-- 
2.1.4



[PATCH v7 4/5] fs/pstore: fs/squashfs: Change usage of LZ4 to work with new LZ4 version

2017-02-05 Thread Sven Schmidt
This patch updates fs/pstore and fs/squashfs to use the updated
functions from the new LZ4 module.

Signed-off-by: Sven Schmidt <4ssch...@informatik.uni-hamburg.de>
---
 fs/pstore/platform.c  | 22 +-
 fs/squashfs/lz4_wrapper.c | 12 ++--
 2 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/fs/pstore/platform.c b/fs/pstore/platform.c
index 729677e..efab7b6 100644
--- a/fs/pstore/platform.c
+++ b/fs/pstore/platform.c
@@ -342,31 +342,35 @@ static int compress_lz4(const void *in, void *out, size_t 
inlen, size_t outlen)
 {
int ret;
 
-   ret = lz4_compress(in, inlen, out, , workspace);
-   if (ret) {
-   pr_err("lz4_compress error, ret = %d!\n", ret);
+   ret = LZ4_compress_default(in, out, inlen, outlen, workspace);
+   if (!ret) {
+   pr_err("LZ4_compress_default error; compression failed!\n");
return -EIO;
}
 
-   return outlen;
+   return ret;
 }
 
 static int decompress_lz4(void *in, void *out, size_t inlen, size_t outlen)
 {
int ret;
 
-   ret = lz4_decompress_unknownoutputsize(in, inlen, out, );
-   if (ret) {
-   pr_err("lz4_decompress error, ret = %d!\n", ret);
+   ret = LZ4_decompress_safe(in, out, inlen, outlen);
+   if (ret < 0) {
+   /*
+* LZ4_decompress_safe will return an error code
+* (< 0) if decompression failed
+*/
+   pr_err("LZ4_decompress_safe error, ret = %d!\n", ret);
return -EIO;
}
 
-   return outlen;
+   return ret;
 }
 
 static void allocate_lz4(void)
 {
-   big_oops_buf_sz = lz4_compressbound(psinfo->bufsize);
+   big_oops_buf_sz = LZ4_compressBound(psinfo->bufsize);
big_oops_buf = kmalloc(big_oops_buf_sz, GFP_KERNEL);
if (big_oops_buf) {
workspace = kmalloc(LZ4_MEM_COMPRESS, GFP_KERNEL);
diff --git a/fs/squashfs/lz4_wrapper.c b/fs/squashfs/lz4_wrapper.c
index ff4468b..95da653 100644
--- a/fs/squashfs/lz4_wrapper.c
+++ b/fs/squashfs/lz4_wrapper.c
@@ -97,7 +97,6 @@ static int lz4_uncompress(struct squashfs_sb_info *msblk, 
void *strm,
struct squashfs_lz4 *stream = strm;
void *buff = stream->input, *data;
int avail, i, bytes = length, res;
-   size_t dest_len = output->length;
 
for (i = 0; i < b; i++) {
avail = min(bytes, msblk->devblksize - offset);
@@ -108,12 +107,13 @@ static int lz4_uncompress(struct squashfs_sb_info *msblk, 
void *strm,
put_bh(bh[i]);
}
 
-   res = lz4_decompress_unknownoutputsize(stream->input, length,
-   stream->output, _len);
-   if (res)
+   res = LZ4_decompress_safe(stream->input, stream->output,
+   length, output->length);
+
+   if (res < 0)
return -EIO;
 
-   bytes = dest_len;
+   bytes = res;
data = squashfs_first_page(output);
buff = stream->output;
while (data) {
@@ -128,7 +128,7 @@ static int lz4_uncompress(struct squashfs_sb_info *msblk, 
void *strm,
}
squashfs_finish_page(output);
 
-   return dest_len;
+   return res;
 }
 
 const struct squashfs_decompressor squashfs_lz4_comp_ops = {
-- 
2.1.4



[PATCH v7 0/5] Update LZ4 compressor module

2017-02-05 Thread Sven Schmidt

This patchset is for updating the LZ4 compression module to a version based
on LZ4 v1.7.3 allowing to use the fast compression algorithm aka LZ4 fast
which provides an "acceleration" parameter as a tradeoff between
high compression ratio and high compression speed.

We want to use LZ4 fast in order to support compression in lustre
and (mostly, based on that) investigate data reduction techniques in behalf of
storage systems.

Also, it will be useful for other users of LZ4 compression, as with LZ4 fast
it is possible to enable applications to use fast and/or high compression
depending on the usecase.
For instance, ZRAM is offering a LZ4 backend and could benefit from an updated
LZ4 in the kernel.

LZ4 homepage: http://www.lz4.org/
LZ4 source repository: https://github.com/lz4/lz4
Source version: 1.7.3

Benchmark (taken from [1], Core i5-4300U @1.9GHz):
|--||--
Compressor  | Compression  | Decompression  | Ratio
|--||--
memcpy  |  4200 MB/s   |  4200 MB/s | 1.000
LZ4 fast 50 |  1080 MB/s   |  2650 MB/s | 1.375
LZ4 fast 17 |   680 MB/s   |  2220 MB/s | 1.607
LZ4 fast 5  |   475 MB/s   |  1920 MB/s | 1.886
LZ4 default |   385 MB/s   |  1850 MB/s | 2.101

[1] http://fastcompression.blogspot.de/2015/04/sampling-or-faster-lz4.html

[PATCH 1/5] lib: Update LZ4 compressor module
[PATCH 2/5] lib/decompress_unlz4: Change module to work with new LZ4 module 
version
[PATCH 3/5] crypto: Change LZ4 modules to work with new LZ4 module version
[PATCH 4/5] fs/pstore: fs/squashfs: Change usage of LZ4 to work with new LZ4 
version
[PATCH 5/5] lib/lz4: Remove back-compat wrappers

Changes:
v7:
- Fixed errors reported by the Smatch tool
- Changed function documentation comments in lz4.h to match kernel-doc style
- Fixed a misbehaviour of LZ4HC caused by the wrong level of indentation
  concerning two for loops introduced after I refactored the code style using
  checkpatch.pl (upstream LZ4 put dozens of stuff in just one line, gnah)
- Updated the crypto tests for LZ4 since they did fail for the new code
  and hence zram did fail to allocate memory for LZ4

v6:
- Fixed LZ4_NBCOMMONBYTES() for 64-bit little endian
- Reset LZ4_MEMORY_USAGE to 14 (which is the value used in
  upstream LZ4 as well as the previous kernel module)
- Fixed that weird double-indentation in lz4defs.h and lz4.h
- Adjusted general styling issues in lz4defs.h
  (e.g. lines consisting of more than one instruction)
- Removed the architecture-dependent typedef to reg_t
  since upstream LZ4 is just using size_t and that works fine
- Changed error messages in pstore/platform.c:
  * LZ4_compress_default always returns 0 in case of an error
(no need to print the return value)
  * LZ4_decompress_safe returns a negative error message
(return value _does_ matter)

v5:
- Added a fifth patch to remove the back-compat wrappers introduced
  to ensure bisectibility between the patches (the functions are no longer
  needed since there's no callers left)

v4:
- Fixed kbuild errors
- Re-added lz4_compressbound as alias for LZ4_compressBound
  to ensure backwards compatibility
- Wrapped LZ4_hash5 with check for LZ4_ARCH64 since it is only used there
  and triggers an unused function warning when false

v3:
- Adjusted the code to satisfy kernel coding style (checkpatch.pl)
- Made sure the changes to LZ4 in Kernel (overflow checks etc.)
  are included in the new module (they are)
- Removed the second LZ4_compressBound function with related name but
  different return type
- Corrected version number (was LZ4 1.7.3)
- Added missing LZ4 streaming functions

v2:
- Changed order of the patches since in the initial patchset the lz4.h was in 
the
  last patch but was referenced by the other ones
- Split lib/decompress_unlz4.c in an own patch
- Fixed errors reported by the buildbot
- Further refactorings
- Added more appropriate copyright note to include/linux/lz4.h


[PATCH v3] crypto: algapi - make crypto_xor() and crypto_inc() alignment agnostic

2017-02-05 Thread Ard Biesheuvel
Instead of unconditionally forcing 4 byte alignment for all generic
chaining modes that rely on crypto_xor() or crypto_inc() (which may
result in unnecessary copying of data when the underlying hardware
can perform unaligned accesses efficiently), make those functions
deal with unaligned input explicitly, but only if the Kconfig symbol
HAVE_EFFICIENT_UNALIGNED_ACCESS is set. This will allow us to drop
the alignmasks from the CBC, CMAC, CTR, CTS, PCBC and SEQIV drivers.

For crypto_inc(), this simply involves making the 4-byte stride
conditional on HAVE_EFFICIENT_UNALIGNED_ACCESS being set, given that
it typically operates on 16 byte buffers.

For crypto_xor(), an algorithm is implemented that simply runs through
the input using the largest strides possible if unaligned accesses are
allowed. If they are not, an optimal sequence of memory accesses is
emitted that takes the relative alignment of the input buffers into
account, e.g., if the relative misalignment of dst and src is 4 bytes,
the entire xor operation will be completed using 4 byte loads and stores
(modulo unaligned bits at the start and end). Note that all expressions
involving misalign are simply eliminated by the compiler when
HAVE_EFFICIENT_UNALIGNED_ACCESS is defined.

Signed-off-by: Ard Biesheuvel 
---
v3: fix thinko in processing of unaligned leading chunk
inline common case where the input size is a constant multiple of the word
size on architectures with h/w handling of unaligned accesses

 crypto/algapi.c | 68 ++--
 crypto/cbc.c|  3 -
 crypto/cmac.c   |  3 +-
 crypto/ctr.c|  2 +-
 crypto/cts.c|  3 -
 crypto/pcbc.c   |  3 -
 crypto/seqiv.c  |  2 -
 include/crypto/algapi.h | 20 +-
 8 files changed, 70 insertions(+), 34 deletions(-)

diff --git a/crypto/algapi.c b/crypto/algapi.c
index 1fad2a6b3bbb..6b52e8f0b95f 100644
--- a/crypto/algapi.c
+++ b/crypto/algapi.c
@@ -962,34 +962,66 @@ void crypto_inc(u8 *a, unsigned int size)
__be32 *b = (__be32 *)(a + size);
u32 c;
 
-   for (; size >= 4; size -= 4) {
-   c = be32_to_cpu(*--b) + 1;
-   *b = cpu_to_be32(c);
-   if (c)
-   return;
-   }
+   if (IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) ||
+   !((unsigned long)b & (__alignof__(*b) - 1)))
+   for (; size >= 4; size -= 4) {
+   c = be32_to_cpu(*--b) + 1;
+   *b = cpu_to_be32(c);
+   if (c)
+   return;
+   }
 
crypto_inc_byte(a, size);
 }
 EXPORT_SYMBOL_GPL(crypto_inc);
 
-static inline void crypto_xor_byte(u8 *a, const u8 *b, unsigned int size)
+void __crypto_xor(u8 *dst, const u8 *src, unsigned int len)
 {
-   for (; size; size--)
-   *a++ ^= *b++;
-}
+   int relalign = 0;
+
+   if (!IS_ENABLED(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)) {
+   int size = sizeof(unsigned long);
+   int d = ((unsigned long)dst ^ (unsigned long)src) & (size - 1);
+
+   relalign = d ? 1 << __ffs(d) : size;
+
+   /*
+* If we care about alignment, process as many bytes as
+* needed to advance dst and src to values whose alignments
+* equal their relative alignment. This will allow us to
+* process the remainder of the input using optimal strides.
+*/
+   while (((unsigned long)dst & (relalign - 1)) && len > 0) {
+   *dst++ ^= *src++;
+   len--;
+   }
+   }
 
-void crypto_xor(u8 *dst, const u8 *src, unsigned int size)
-{
-   u32 *a = (u32 *)dst;
-   u32 *b = (u32 *)src;
+   while (IS_ENABLED(CONFIG_64BIT) && len >= 8 && !(relalign & 7)) {
+   *(u64 *)dst ^= *(u64 *)src;
+   dst += 8;
+   src += 8;
+   len -= 8;
+   }
 
-   for (; size >= 4; size -= 4)
-   *a++ ^= *b++;
+   while (len >= 4 && !(relalign & 3)) {
+   *(u32 *)dst ^= *(u32 *)src;
+   dst += 4;
+   src += 4;
+   len -= 4;
+   }
+
+   while (len >= 2 && !(relalign & 1)) {
+   *(u16 *)dst ^= *(u16 *)src;
+   dst += 2;
+   src += 2;
+   len -= 2;
+   }
 
-   crypto_xor_byte((u8 *)a, (u8 *)b, size);
+   while (len--)
+   *dst++ ^= *src++;
 }
-EXPORT_SYMBOL_GPL(crypto_xor);
+EXPORT_SYMBOL_GPL(__crypto_xor);
 
 unsigned int crypto_alg_extsize(struct crypto_alg *alg)
 {
diff --git a/crypto/cbc.c b/crypto/cbc.c
index 68f751a41a84..bc160a3186dc 100644
--- a/crypto/cbc.c
+++ b/crypto/cbc.c
@@ -145,9 +145,6 @@ static int crypto_cbc_create(struct crypto_template *tmpl, 
struct rtattr **tb)
inst->alg.base.cra_blocksize = alg->cra_blocksize;

Re: [PATCH v2 4/6] crypto: ecdsa: add ECDSA SW implementation

2017-02-05 Thread Stephan Müller
Am Freitag, 3. Februar 2017, 16:42:53 CET schrieb Nitin Kumbhar:

Hi Nitin,

> +
> +int ecdsa_set_pub_key(struct crypto_akcipher *tfm, const void *key,
> +   unsigned int keylen)
> +{
> + struct ecdsa_ctx *ctx = ecdsa_get_ctx(tfm);
> + struct ecdsa params;
> + unsigned int ndigits;
> + unsigned int nbytes;
> + u8 *params_qx, *params_qy;
> + u64 *ctx_qx, *ctx_qy;
> + int err = 0;
> +
> + if (crypto_ecdsa_parse_pub_key(key, keylen, ))
> + return -EINVAL;
> +
> + ndigits = ecdsa_supported_curve(params.curve_id);
> + if (!ndigits)
> + return -EINVAL;
> +
> + err = ecc_is_pub_key_valid(params.curve_id, ndigits,
> +params.key, params.key_size);
> + if (err)
> + return err;
> +
> + ctx->curve_id = params.curve_id;
> + ctx->ndigits = ndigits;
> + nbytes = ndigits << ECC_DIGITS_TO_BYTES_SHIFT;
> +
> + params_qx = params.key;
> + params_qy = params_qx + ECC_MAX_DIGIT_BYTES;
> +
> + ctx_qx = ctx->public_key;
> + ctx_qy = ctx_qx + ECC_MAX_DIGITS;
> +
> + vli_copy_from_buf(ctx_qx, ndigits, params_qx, nbytes);
> + vli_copy_from_buf(ctx_qy, ndigits, params_qy, nbytes);
> +
> + memset(, 0, sizeof(params));
> + return 0;
> +}
> +
> +int ecdsa_set_priv_key(struct crypto_akcipher *tfm, const void *key,
> +unsigned int keylen)
> +{
> + struct ecdsa_ctx *ctx = ecdsa_get_ctx(tfm);
> + struct ecdsa params;
> + unsigned int ndigits;
> + unsigned int nbytes;
> +
> + if (crypto_ecdsa_parse_priv_key(key, keylen, ))
> + return -EINVAL;
> +
> + ndigits = ecdsa_supported_curve(params.curve_id);
> + if (!ndigits)
> + return -EINVAL;
> +
> + ctx->curve_id = params.curve_id;
> + ctx->ndigits = ndigits;
> + nbytes = ndigits << ECC_DIGITS_TO_BYTES_SHIFT;
> +
> + if (ecc_is_key_valid(ctx->curve_id, ctx->ndigits,
> +  (const u8 *)params.key, params.key_size) < 0)
> + return -EINVAL;
> +
> + vli_copy_from_buf(ctx->private_key, ndigits, params.key, nbytes);
> +
> + memset(, 0, sizeof(params));

Please use memzero_explicit as otherwise this memset will be optimized away. I 
think it could be used for the set_pub_key too, but there we do not have 
sensitive data and thus it would not be strictly needed.

> + return 0;
> +}


Ciao
Stephan