Re: [PATCH 2/2] [v2] crypto: sha1: add ARM NEON implementation

2014-06-30 Thread Ard Biesheuvel
On 29 June 2014 16:33, Jussi Kivilinna jussi.kivili...@iki.fi wrote:
 This patch adds ARM NEON assembly implementation of SHA-1 algorithm.

 tcrypt benchmark results on Cortex-A8, sha1-arm-asm vs sha1-neon-asm:

 block-size  bytes/updateold-vs-new
 16  16  1.04x
 64  16  1.02x
 64  64  1.05x
 256 16  1.03x
 256 64  1.04x
 256 256 1.30x
 102416  1.03x
 1024256 1.36x
 102410241.52x
 204816  1.03x
 2048256 1.39x
 204810241.55x
 204820481.59x
 409616  1.03x
 4096256 1.40x
 409610241.57x
 409640961.62x
 819216  1.03x
 8192256 1.40x
 819210241.58x
 819240961.63x
 819281921.63x

 Changes in v2:
  - Use ENTRY/ENDPROC
  - Don't provide Thumb2 version
  - Move contants to .text section
  - Further tweaks to implementation for ~10% speed-up.


Please move the changelog to below the '---' so it doesn't end up in
the kernel commit log.

 Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi

Acked-by: Ard Biesheuvel ard.biesheu...@linaro.org
Tested-by: Ard Biesheuvel ard.biesheu...@linaro.org

Tested on Exynos-5250 (Cortex-A15)

ARM asm
===
[ 1478.699012] testing speed of sha1
[ 1478.699040] test  0 (   16 byte blocks,   16 bytes per update,   1
updates): 873594 opers/sec,  13977514 bytes/sec
[ 1481.694959] test  1 (   64 byte blocks,   16 bytes per update,   4
updates): 386415 opers/sec,  24730581 bytes/sec
[ 1484.694958] test  2 (   64 byte blocks,   64 bytes per update,   1
updates): 543196 opers/sec,  34764586 bytes/sec
[ 1487.694959] test  3 (  256 byte blocks,   16 bytes per update,  16
updates): 141109 opers/sec,  36123989 bytes/sec
[ 1490.694959] test  4 (  256 byte blocks,   64 bytes per update,   4
updates): 218391 opers/sec,  55908266 bytes/sec
[ 1493.694958] test  5 (  256 byte blocks,  256 bytes per update,   1
updates): 256225 opers/sec,  65593685 bytes/sec
[ 1496.694959] test  6 ( 1024 byte blocks,   16 bytes per update,  64
updates):  39845 opers/sec,  40801280 bytes/sec
[ 1499.694973] test  7 ( 1024 byte blocks,  256 bytes per update,   4
updates):  78594 opers/sec,  80480597 bytes/sec
[ 1502.694966] test  8 ( 1024 byte blocks, 1024 bytes per update,   1
updates):  83790 opers/sec,  85801642 bytes/sec
[ 1505.694966] test  9 ( 2048 byte blocks,   16 bytes per update, 128
updates):  20204 opers/sec,  41379157 bytes/sec
[ 1508.694989] test 10 ( 2048 byte blocks,  256 bytes per update,   8
updates):  41075 opers/sec,  84121600 bytes/sec
[ 1511.694979] test 11 ( 2048 byte blocks, 1024 bytes per update,   2
updates):  43358 opers/sec,  88797184 bytes/sec
[ 1514.694960] test 12 ( 2048 byte blocks, 2048 bytes per update,   1
updates):  44168 opers/sec,  90457429 bytes/sec
[ 1517.694968] test 13 ( 4096 byte blocks,   16 bytes per update, 256
updates):  10331 opers/sec,  42315776 bytes/sec
[ 1520.694967] test 14 ( 4096 byte blocks,  256 bytes per update,  16
updates):  21004 opers/sec,  86032384 bytes/sec
[ 1523.694955] test 15 ( 4096 byte blocks, 1024 bytes per update,   4
updates):  22193 opers/sec,  90903893 bytes/sec
[ 1526.694989] test 16 ( 4096 byte blocks, 4096 bytes per update,   1
updates):  22671 opers/sec,  92860416 bytes/sec
[ 1529.695000] test 17 ( 8192 byte blocks,   16 bytes per update, 512
updates):   5192 opers/sec,  42538325 bytes/sec
[ 1532.695110] test 18 ( 8192 byte blocks,  256 bytes per update,  32
updates):  10628 opers/sec,  87067306 bytes/sec
[ 1535.695015] test 19 ( 8192 byte blocks, 1024 bytes per update,   8
updates):  11233 opers/sec,  92026197 bytes/sec
[ 1538.694997] test 20 ( 8192 byte blocks, 4096 bytes per update,   2
updates):  11393 opers/sec,  93334186 bytes/sec
[ 1541.694980] test 21 ( 8192 byte blocks, 8192 bytes per update,   1
updates):  11427 opers/sec,  93615445 bytes/sec

ARM neon

[ 1582.519068] testing speed of sha1
[ 1582.519097] test  0 (   16 byte blocks,   16 bytes per update,   1
updates): 900970 opers/sec,  14415520 bytes/sec
[ 1585.514959] test  1 (   64 byte blocks,   16 bytes per update,   4
updates): 406465 opers/sec,  26013802 bytes/sec
[ 1588.514961] test  2 (   64 byte blocks,   64 bytes per update,   1
updates): 579712 opers/sec,  37101610 bytes/sec
[ 1591.514958] test  3 (  256 byte blocks,   16 bytes per update,  16
updates): 139189 opers/sec,  35632554 bytes/sec
[ 1594.514964] test  4 (  256 byte blocks,   64 bytes per update,   4
updates): 234671 opers/sec,  60075861 bytes/sec
[ 1597.514960] test  5 (  256 byte blocks,  256 bytes per update,   1
updates): 347872 opers/sec,  89055402 bytes/sec
[ 1600.514959] 

Re: [PATCH] [v2] crypto: sha512: add ARM NEON implementation

2014-06-30 Thread Ard Biesheuvel
On 29 June 2014 16:34, Jussi Kivilinna jussi.kivili...@iki.fi wrote:
 This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384
 algorithms.

 tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm:

 block-size  bytes/updateold-vs-new
 16  16  2.99x
 64  16  2.67x
 64  64  3.00x
 256 16  2.64x
 256 64  3.06x
 256 256 3.33x
 102416  2.53x
 1024256 3.39x
 102410243.52x
 204816  2.50x
 2048256 3.41x
 204810243.54x
 204820483.57x
 409616  2.49x
 4096256 3.42x
 409610243.56x
 409640963.59x
 819216  2.48x
 8192256 3.42x
 819210243.56x
 819240963.60x
 819281923.60x


Nice speedup!

 Changes in v2:
  - Use ENTRY/ENDPROC
  - Don't provide Thumb2 version


Please move Changelog below '---'

 Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi

Acked-by: Ard Biesheuvel ard.biesheu...@linaro.org
Tested-by: Ard Biesheuvel ard.biesheu...@linaro.org

Tested on Exynos-5250 (Cortex-A15)

ARM-asm

[ 1715.164122] testing speed of sha512
[ 1715.164150] test  0 (   16 byte blocks,   16 bytes per update,   1
updates): 136277 opers/sec,   2180437 bytes/sec
[ 1718.159959] test  1 (   64 byte blocks,   16 bytes per update,   4
updates): 126636 opers/sec,   8104746 bytes/sec
[ 1721.159962] test  2 (   64 byte blocks,   64 bytes per update,   1
updates): 136605 opers/sec,   8742720 bytes/sec
[ 1724.159958] test  3 (  256 byte blocks,   16 bytes per update,  16
updates):  41576 opers/sec,  10643541 bytes/sec
[ 1727.159957] test  4 (  256 byte blocks,   64 bytes per update,   4
updates):  45984 opers/sec,  11771989 bytes/sec
[ 1730.159959] test  5 (  256 byte blocks,  256 bytes per update,   1
updates):  47479 opers/sec,  12154794 bytes/sec
[ 1733.159977] test  6 ( 1024 byte blocks,   16 bytes per update,  64
updates):  13410 opers/sec,  13731840 bytes/sec
[ 1736.160027] test  7 ( 1024 byte blocks,  256 bytes per update,   4
updates):  15916 opers/sec,  16298325 bytes/sec
[ 1739.159975] test  8 ( 1024 byte blocks, 1024 bytes per update,   1
updates):  16095 opers/sec,  16481280 bytes/sec
[ 1742.159993] test  9 ( 2048 byte blocks,   16 bytes per update, 128
updates):   7042 opers/sec,  14423381 bytes/sec
[ 1745.159994] test 10 ( 2048 byte blocks,  256 bytes per update,   8
updates):   8438 opers/sec,  17281024 bytes/sec
[ 1748.159995] test 11 ( 2048 byte blocks, 1024 bytes per update,   2
updates):   8541 opers/sec,  17492650 bytes/sec
[ 1751.160001] test 12 ( 2048 byte blocks, 2048 bytes per update,   1
updates):   8560 opers/sec,  17531562 bytes/sec
[ 1754.159975] test 13 ( 4096 byte blocks,   16 bytes per update, 256
updates):   3612 opers/sec,  14794752 bytes/sec
[ 1757.160103] test 14 ( 4096 byte blocks,  256 bytes per update,  16
updates):   4350 opers/sec,  17820330 bytes/sec
[ 1760.160122] test 15 ( 4096 byte blocks, 1024 bytes per update,   4
updates):   4405 opers/sec,  18042880 bytes/sec
[ 1763.159957] test 16 ( 4096 byte blocks, 4096 bytes per update,   1
updates):   4463 opers/sec,  18280448 bytes/sec
[ 1766.160049] test 17 ( 8192 byte blocks,   16 bytes per update, 512
updates):   1829 opers/sec,  14988629 bytes/sec
[ 1769.160328] test 18 ( 8192 byte blocks,  256 bytes per update,  32
updates):   2209 opers/sec,  18101589 bytes/sec
[ 1772.160318] test 19 ( 8192 byte blocks, 1024 bytes per update,   8
updates):   2238 opers/sec,  18333696 bytes/sec
[ 1775.160278] test 20 ( 8192 byte blocks, 4096 bytes per update,   2
updates):   2245 opers/sec,  18393770 bytes/sec
[ 1778.160025] test 21 ( 8192 byte blocks, 8192 bytes per update,   1
updates):   2267 opers/sec,  18576725 bytes/sec

ARM-neon
=
[ 1810.729100] testing speed of sha512
[ 1810.729130] test  0 (   16 byte blocks,   16 bytes per update,   1
updates): 330941 opers/sec,   5295066 bytes/sec
[ 1813.724958] test  1 (   64 byte blocks,   16 bytes per update,   4
updates): 277607 opers/sec,  17766890 bytes/sec
[ 1816.724958] test  2 (   64 byte blocks,   64 bytes per update,   1
updates): 330251 opers/sec,  21136085 bytes/sec
[ 1819.724956] test  3 (  256 byte blocks,   16 bytes per update,  16
updates):  89849 opers/sec,  23001429 bytes/sec
[ 1822.724961] test  4 (  256 byte blocks,   64 bytes per update,   4
updates): 113344 opers/sec,  29016149 bytes/sec
[ 1825.724963] test  5 (  256 byte blocks,  256 bytes per update,   1
updates): 127466 opers/sec,  32631381 bytes/sec
[ 1828.724960] test  6 ( 1024 byte blocks,   16 bytes per update,  64
updates):  27818 opers/sec,  28485632 bytes/sec
[ 

Re: [crypto] BUG: unable to handle kernel paging request at ffff88000bb88000

2014-06-30 Thread Stephan Mueller
Am Montag, 30. Juni 2014, 13:31:26 schrieb Fengguang Wu:

Hi Fengguang,

Hi Stephan,

On Sun, Jun 29, 2014 at 09:45:48PM +0200, Stephan Mueller wrote:
 Am Sonntag, 29. Juni 2014, 22:52:46 schrieb Fengguang Wu:
 
 Hi Fengguang,
 
  Greetings,
  
  0day kernel testing robot got the below dmesg and the first bad
  commit is 
 May I ask whether there is anything special in your kernel config?

It's an x86_64 randconfig. You may find it in the attachment of the
original report email.

Thanks, I used that config. I was just wondering whether there were some 
special config options that changed the memory allocation mechanism. The 
kernel configs I used never triggered the issue albeit it should have 
had.

I ran stress tests months ago (with the bug present) where I invoked the 
DRBG for one day, causing billions of rounds of RNG operation where each 
round should have triggered the bug.

 This very bug should have been triggered already in all previous code
 levels! I am seriously wondering why this bug was not triggered
 before -- does kalloc somehow allocates more memory than you
 requested? And only your specific kernel config made kalloc to
 allocate the exact amount of memory that was requested?

Yeah the bug may have been triggered in other places. If you see
anything valuable from this bisect result, it would be great. Judging
from the comparison of 64d1cdfbe2 and its parent commit 3332ee2a17,
it's pretty reproducible, so easy to verify the possible fixes.

Well, it is not so reproducible as you may think. And I as far as I can 
see the other oops that you send was caused by the same issue.

When I was debugging the issue and just adding some printk statements, 
the crasher went away (reliably) or it crashed at some other random 
places. It was very bizarre. But after adding my fix, I did not see any 
crash any more.

Ciao
Stephan
--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 1/3] ima: use ahash API for file hash calculation

2014-06-30 Thread Dmitry Kasatkin
On 26/06/14 14:54, Mimi Zohar wrote:
 On Thu, 2014-06-19 at 18:20 +0300, Dmitry Kasatkin wrote:
 Async hash API allows to use HW acceleration for hash calculation.
 It may give significant performance gain or/and reduce power consumption,
 which might be very beneficial for battery powered devices.

 This patch introduces hash calculation using ahash API.

 ahash peformance depends on data size and particular HW. Under certain
 limit, depending on the system, shash performance may be better.

 This patch also introduces 'ima_ahash_size' kernel parameter which can
 be used to defines minimal data size to use with ahash. When this
 parameter is not set or file size is smaller than defined by this
 parameter, shash will be used. Thus, by defult, original shash
 implementation is used.

 Signed-off-by: Dmitry Kasatkin d.kasat...@samsung.com
 ---
  Documentation/kernel-parameters.txt |   3 +
  security/integrity/ima/ima_crypto.c | 182 
 +++-
  2 files changed, 181 insertions(+), 4 deletions(-)

 diff --git a/Documentation/kernel-parameters.txt 
 b/Documentation/kernel-parameters.txt
 index a0c155c..f8efb01 100644
 --- a/Documentation/kernel-parameters.txt
 +++ b/Documentation/kernel-parameters.txt
 @@ -1286,6 +1286,9 @@ bytes respectively. Such letter suffixes can also be 
 entirely omitted.
  ihash_entries=  [KNL]
  Set number of hash buckets for inode cache.

 +ima_ahash_size=size [IMA]
 +Set the minimal file size when use ahash API.
 +
  ima_appraise=   [IMA] appraise integrity measurements
  Format: { off | enforce | fix }
  default: enforce
 diff --git a/security/integrity/ima/ima_crypto.c 
 b/security/integrity/ima/ima_crypto.c
 index ccd0ac8..b7a8650 100644
 --- a/security/integrity/ima/ima_crypto.c
 +++ b/security/integrity/ima/ima_crypto.c
 @@ -25,7 +25,25 @@
  #include crypto/hash_info.h
  #include ima.h

 +
 +struct ahash_completion {
 +struct completion completion;
 +int err;
 +};
 +
  static struct crypto_shash *ima_shash_tfm;
 +static struct crypto_ahash *ima_ahash_tfm;
 +
 +/* data size for ahash use */
 +static loff_t ima_ahash_size;
 +
 +static int __init ima_ahash_setup(char *str)
 +{
 +int rc = kstrtoll(str, 10, ima_ahash_size);
 In general, variable definitions should be separated from code.  A
 simple initialization is fine.  Please separate variable definitions
 from code with a blank line. 

 +pr_info(ima_ahash_size = %lld\n, ima_ahash_size);
 +return !rc;
 +}
 +__setup(ima_ahash_size=, ima_ahash_setup);
 This boot parameter name doesn't reflect its purpose, defining the
 minimum file size for using ahash. The next patch defines an additional
 boot parameter ima_ahash_bufsize. Perhaps defining a single boot
 parameter (eg. ima_ahash=) with multiple fields would be better. 

  /**
   * ima_kernel_read - read file content
 @@ -68,6 +86,14 @@ int ima_init_crypto(void)
 hash_algo_name[ima_hash_algo], rc);
  return rc;
  }
 +ima_ahash_tfm = crypto_alloc_ahash(hash_algo_name[ima_hash_algo], 0, 0);
 +if (IS_ERR(ima_ahash_tfm)) {
 +rc = PTR_ERR(ima_ahash_tfm);
 +crypto_free_shash(ima_shash_tfm);
 Only crypto_alloc_ahash() failed, not crypto_alloc_shash(). shash has
 worked fine up to now. Why require both shash and ahash to succeed? 


 +pr_err(Can not allocate %s (reason: %ld)\n,
 +   hash_algo_name[ima_hash_algo], rc);
 +return rc;
 +}
  return 0;
  }

 @@ -93,9 +119,143 @@ static void ima_free_tfm(struct crypto_shash *tfm)
  crypto_free_shash(tfm);
  }

 -/*
 - * Calculate the MD5/SHA1 file digest
 - */
 +static struct crypto_ahash *ima_alloc_atfm(enum hash_algo algo)
 +{
 +struct crypto_ahash *tfm = ima_ahash_tfm;
 +int rc;
 +
 +if (algo != ima_hash_algo  algo  HASH_ALGO__LAST) {
 +tfm = crypto_alloc_ahash(hash_algo_name[algo], 0, 0);
 +if (IS_ERR(tfm)) {
 +rc = PTR_ERR(tfm);
 +pr_err(Can not allocate %s (reason: %d)\n,
 +   hash_algo_name[algo], rc);
 +}
 +}
 +return tfm;
 +}
 +
 +static void ima_free_atfm(struct crypto_ahash *tfm)
 +{
 +if (tfm != ima_ahash_tfm)
 +crypto_free_ahash(tfm);
 +}
 +
 +static void ahash_complete(struct crypto_async_request *req, int err)
 +{
 +struct ahash_completion *res = req-data;
 +
 +if (err == -EINPROGRESS)
 +return;
 +res-err = err;
 +complete(res-completion);
 +}
 +
 +static int ahash_wait(int err, struct ahash_completion *res)
 +{
 +switch (err) {
 +case 0:
 +break;
 +case -EINPROGRESS:
 +case -EBUSY:
 +wait_for_completion(res-completion);
 +reinit_completion(res-completion);
 +err = res-err;
 +/* fall through */
 +default:
 +

Re: [PATCH v1 1/3] ima: use ahash API for file hash calculation

2014-06-30 Thread Mimi Zohar
On Mon, 2014-06-30 at 17:58 +0300, Dmitry Kasatkin wrote: 
 On 26/06/14 14:54, Mimi Zohar wrote:
  On Thu, 2014-06-19 at 18:20 +0300, Dmitry Kasatkin wrote:

  @@ -156,7 +316,7 @@ out:
 return rc;
   }
 
  -int ima_calc_file_hash(struct file *file, struct ima_digest_data *hash)
  +static int ima_calc_file_shash(struct file *file, struct ima_digest_data 
  *hash)
   {
 struct crypto_shash *tfm;
 int rc;
  @@ -172,6 +332,20 @@ int ima_calc_file_hash(struct file *file, struct 
  ima_digest_data *hash)
 return rc;
   }
 
  +int ima_calc_file_hash(struct file *file, struct ima_digest_data *hash)
  +{
  +  loff_t i_size = i_size_read(file_inode(file));
  +
  +  /* shash is more efficient small data
  +   * ahash performance depends on data size and particular HW
  +   * ima_ahash_size allows to specify the best value for the system
  +   */
  +  if (ima_ahash_size  i_size = ima_ahash_size)
  +  return ima_calc_file_ahash(file, hash);
  +  else
  +  return ima_calc_file_shash(file, hash);
  +}
  If calculating the file hash using ahash fails, should it fall back to
  using shash?
 
 If ahash fails, then it could be a HW error, which should not happen.
 IF HW fails device is broken.

I would assume it depends on the HW, if the entire device/system is
broken.

 Do you really want to fallback to shash?

Yes, in this case, there is no downside to letting it to continue
working, just slower, using the software crypto implementation.  In any
case, it shouldn't be hard coded.

Mimi

--
To unsubscribe from this list: send the line unsubscribe linux-crypto in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] [v3] crypto: sha1: add ARM NEON implementation

2014-06-30 Thread Jussi Kivilinna
This patch adds ARM NEON assembly implementation of SHA-1 algorithm.

tcrypt benchmark results on Cortex-A8, sha1-arm-asm vs sha1-neon-asm:

block-size  bytes/updateold-vs-new
16  16  1.04x
64  16  1.02x
64  64  1.05x
256 16  1.03x
256 64  1.04x
256 256 1.30x
102416  1.03x
1024256 1.36x
102410241.52x
204816  1.03x
2048256 1.39x
204810241.55x
204820481.59x
409616  1.03x
4096256 1.40x
409610241.57x
409640961.62x
819216  1.03x
8192256 1.40x
819210241.58x
819240961.63x
819281921.63x

Acked-by: Ard Biesheuvel ard.biesheu...@linaro.org
Tested-by: Ard Biesheuvel ard.biesheu...@linaro.org
Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi

---

Changes in v2:
 - Use ENTRY/ENDPROC
 - Don't provide Thumb2 version
 - Move contants to .text section
 - Further tweaks to implementation for ~10% speed-up.

v3:
 - Changelog moved below '---'
---
 arch/arm/crypto/Makefile   |2 
 arch/arm/crypto/sha1-armv7-neon.S  |  634 
 arch/arm/crypto/sha1_glue.c|8 
 arch/arm/crypto/sha1_neon_glue.c   |  197 +++
 arch/arm/include/asm/crypto/sha1.h |   10 +
 crypto/Kconfig |   11 +
 6 files changed, 859 insertions(+), 3 deletions(-)
 create mode 100644 arch/arm/crypto/sha1-armv7-neon.S
 create mode 100644 arch/arm/crypto/sha1_neon_glue.c
 create mode 100644 arch/arm/include/asm/crypto/sha1.h

diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 81cda39..374956d 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -5,10 +5,12 @@
 obj-$(CONFIG_CRYPTO_AES_ARM) += aes-arm.o
 obj-$(CONFIG_CRYPTO_AES_ARM_BS) += aes-arm-bs.o
 obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
+obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 
 aes-arm-y  := aes-armv4.o aes_glue.o
 aes-arm-bs-y   := aesbs-core.o aesbs-glue.o
 sha1-arm-y := sha1-armv4-large.o sha1_glue.o
+sha1-arm-neon-y:= sha1-armv7-neon.o sha1_neon_glue.o
 
 quiet_cmd_perl = PERL$@
   cmd_perl = $(PERL) $()  $(@)
diff --git a/arch/arm/crypto/sha1-armv7-neon.S 
b/arch/arm/crypto/sha1-armv7-neon.S
new file mode 100644
index 000..50013c0
--- /dev/null
+++ b/arch/arm/crypto/sha1-armv7-neon.S
@@ -0,0 +1,634 @@
+/* sha1-armv7-neon.S - ARM/NEON accelerated SHA-1 transform function
+ *
+ * Copyright © 2013-2014 Jussi Kivilinna jussi.kivili...@iki.fi
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include linux/linkage.h
+
+
+.syntax unified
+.code   32
+.fpu neon
+
+.text
+
+
+/* Context structure */
+
+#define state_h0 0
+#define state_h1 4
+#define state_h2 8
+#define state_h3 12
+#define state_h4 16
+
+
+/* Constants */
+
+#define K1  0x5A827999
+#define K2  0x6ED9EBA1
+#define K3  0x8F1BBCDC
+#define K4  0xCA62C1D6
+.align 4
+.LK_VEC:
+.LK1:  .long K1, K1, K1, K1
+.LK2:  .long K2, K2, K2, K2
+.LK3:  .long K3, K3, K3, K3
+.LK4:  .long K4, K4, K4, K4
+
+
+/* Register macros */
+
+#define RSTATE r0
+#define RDATA r1
+#define RNBLKS r2
+#define ROLDSTACK r3
+#define RWK lr
+
+#define _a r4
+#define _b r5
+#define _c r6
+#define _d r7
+#define _e r8
+
+#define RT0 r9
+#define RT1 r10
+#define RT2 r11
+#define RT3 r12
+
+#define W0 q0
+#define W1 q1
+#define W2 q2
+#define W3 q3
+#define W4 q4
+#define W5 q5
+#define W6 q6
+#define W7 q7
+
+#define tmp0 q8
+#define tmp1 q9
+#define tmp2 q10
+#define tmp3 q11
+
+#define qK1 q12
+#define qK2 q13
+#define qK3 q14
+#define qK4 q15
+
+
+/* Round function macros. */
+
+#define WK_offs(i) (((i)  15) * 4)
+
+#define _R_F1(a,b,c,d,e,i,pre1,pre2,pre3,i16,\
+ W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
+   ldr RT3, [sp, WK_offs(i)]; \
+   pre1(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
+   bic RT0, d, b; \
+   add e, e, a, ror #(32 - 5); \
+   and RT1, c, b; \
+   pre2(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
+   add RT0, RT0, RT3; \
+   add e, e, RT1; \
+   ror b, #(32 - 30); \
+   pre3(i16,W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28); \
+   add e, e, RT0;
+
+#define _R_F2(a,b,c,d,e,i,pre1,pre2,pre3,i16,\
+ W,W_m04,W_m08,W_m12,W_m16,W_m20,W_m24,W_m28) \
+   ldr RT3, [sp, WK_offs(i)]; \
+   

[PATCH] [v3] crypto: sha512: add ARM NEON implementation

2014-06-30 Thread Jussi Kivilinna
This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384
algorithms.

tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm:

block-size  bytes/updateold-vs-new
16  16  2.99x
64  16  2.67x
64  64  3.00x
256 16  2.64x
256 64  3.06x
256 256 3.33x
102416  2.53x
1024256 3.39x
102410243.52x
204816  2.50x
2048256 3.41x
204810243.54x
204820483.57x
409616  2.49x
4096256 3.42x
409610243.56x
409640963.59x
819216  2.48x
8192256 3.42x
819210243.56x
819240963.60x
819281923.60x

Acked-by: Ard Biesheuvel ard.biesheu...@linaro.org
Tested-by: Ard Biesheuvel ard.biesheu...@linaro.org
Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi

---

Changes in v2:
 - Use ENTRY/ENDPROC
 - Don't provide Thumb2 version

v3:
 - Changelog moved below '---'
---
 arch/arm/crypto/Makefile|2 
 arch/arm/crypto/sha512-armv7-neon.S |  455 +++
 arch/arm/crypto/sha512_neon_glue.c  |  305 +++
 crypto/Kconfig  |   15 +
 4 files changed, 777 insertions(+)
 create mode 100644 arch/arm/crypto/sha512-armv7-neon.S
 create mode 100644 arch/arm/crypto/sha512_neon_glue.c

diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
index 374956d..b48fa34 100644
--- a/arch/arm/crypto/Makefile
+++ b/arch/arm/crypto/Makefile
@@ -6,11 +6,13 @@ obj-$(CONFIG_CRYPTO_AES_ARM) += aes-arm.o
 obj-$(CONFIG_CRYPTO_AES_ARM_BS) += aes-arm-bs.o
 obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
 obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
+obj-$(CONFIG_CRYPTO_SHA512_ARM_NEON) += sha512-arm-neon.o
 
 aes-arm-y  := aes-armv4.o aes_glue.o
 aes-arm-bs-y   := aesbs-core.o aesbs-glue.o
 sha1-arm-y := sha1-armv4-large.o sha1_glue.o
 sha1-arm-neon-y:= sha1-armv7-neon.o sha1_neon_glue.o
+sha512-arm-neon-y := sha512-armv7-neon.o sha512_neon_glue.o
 
 quiet_cmd_perl = PERL$@
   cmd_perl = $(PERL) $()  $(@)
diff --git a/arch/arm/crypto/sha512-armv7-neon.S 
b/arch/arm/crypto/sha512-armv7-neon.S
new file mode 100644
index 000..fe99472
--- /dev/null
+++ b/arch/arm/crypto/sha512-armv7-neon.S
@@ -0,0 +1,455 @@
+/* sha512-armv7-neon.S  -  ARM/NEON assembly implementation of SHA-512 
transform
+ *
+ * Copyright © 2013-2014 Jussi Kivilinna jussi.kivili...@iki.fi
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ */
+
+#include linux/linkage.h
+
+
+.syntax unified
+.code   32
+.fpu neon
+
+.text
+
+/* structure of SHA512_CONTEXT */
+#define hd_a 0
+#define hd_b ((hd_a) + 8)
+#define hd_c ((hd_b) + 8)
+#define hd_d ((hd_c) + 8)
+#define hd_e ((hd_d) + 8)
+#define hd_f ((hd_e) + 8)
+#define hd_g ((hd_f) + 8)
+
+/* register macros */
+#define RK %r2
+
+#define RA d0
+#define RB d1
+#define RC d2
+#define RD d3
+#define RE d4
+#define RF d5
+#define RG d6
+#define RH d7
+
+#define RT0 d8
+#define RT1 d9
+#define RT2 d10
+#define RT3 d11
+#define RT4 d12
+#define RT5 d13
+#define RT6 d14
+#define RT7 d15
+
+#define RT01q q4
+#define RT23q q5
+#define RT45q q6
+#define RT67q q7
+
+#define RW0 d16
+#define RW1 d17
+#define RW2 d18
+#define RW3 d19
+#define RW4 d20
+#define RW5 d21
+#define RW6 d22
+#define RW7 d23
+#define RW8 d24
+#define RW9 d25
+#define RW10 d26
+#define RW11 d27
+#define RW12 d28
+#define RW13 d29
+#define RW14 d30
+#define RW15 d31
+
+#define RW01q q8
+#define RW23q q9
+#define RW45q q10
+#define RW67q q11
+#define RW89q q12
+#define RW1011q q13
+#define RW1213q q14
+#define RW1415q q15
+
+/***
+ * ARM assembly implementation of sha512 transform
+ ***/
+#define rounds2_0_63(ra, rb, rc, rd, re, rf, rg, rh, rw0, rw1, rw01q, rw2, \
+ rw23q, rw1415q, rw9, rw10, interleave_op, arg1) \
+   /* t1 = h + Sum1 (e) + Ch (e, f, g) + k[t] + w[t]; */ \
+   vshr.u64 RT2, re, #14; \
+   vshl.u64 RT3, re, #64 - 14; \
+   interleave_op(arg1); \
+   vshr.u64 RT4, re, #18; \
+   vshl.u64 RT5, re, #64 - 18; \
+   vld1.64 {RT0}, [RK]!; \
+   veor.64 RT23q, RT23q, RT45q; \
+   vshr.u64 RT4, re, #41; \
+   vshl.u64 RT5, re, #64 - 41; \
+   vadd.u64 RT0, RT0, rw0; \
+   veor.64 RT23q, RT23q, RT45q; \
+   

[PATCH 1/2] [v3] crypto: sha1/ARM: make use of common SHA-1 structures

2014-06-30 Thread Jussi Kivilinna
Common SHA-1 structures are defined in crypto/sha.h for code sharing.

This patch changes SHA-1/ARM glue code to use these structures.

Acked-by: Ard Biesheuvel ard.biesheu...@linaro.org
Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi
---
 arch/arm/crypto/sha1_glue.c |   50 +++
 1 file changed, 22 insertions(+), 28 deletions(-)

diff --git a/arch/arm/crypto/sha1_glue.c b/arch/arm/crypto/sha1_glue.c
index 76cd976..c494e57 100644
--- a/arch/arm/crypto/sha1_glue.c
+++ b/arch/arm/crypto/sha1_glue.c
@@ -24,31 +24,25 @@
 #include crypto/sha.h
 #include asm/byteorder.h
 
-struct SHA1_CTX {
-   uint32_t h0,h1,h2,h3,h4;
-   u64 count;
-   u8 data[SHA1_BLOCK_SIZE];
-};
 
-asmlinkage void sha1_block_data_order(struct SHA1_CTX *digest,
+asmlinkage void sha1_block_data_order(u32 *digest,
const unsigned char *data, unsigned int rounds);
 
 
 static int sha1_init(struct shash_desc *desc)
 {
-   struct SHA1_CTX *sctx = shash_desc_ctx(desc);
-   memset(sctx, 0, sizeof(*sctx));
-   sctx-h0 = SHA1_H0;
-   sctx-h1 = SHA1_H1;
-   sctx-h2 = SHA1_H2;
-   sctx-h3 = SHA1_H3;
-   sctx-h4 = SHA1_H4;
+   struct sha1_state *sctx = shash_desc_ctx(desc);
+
+   *sctx = (struct sha1_state){
+   .state = { SHA1_H0, SHA1_H1, SHA1_H2, SHA1_H3, SHA1_H4 },
+   };
+
return 0;
 }
 
 
-static int __sha1_update(struct SHA1_CTX *sctx, const u8 *data,
-  unsigned int len, unsigned int partial)
+static int __sha1_update(struct sha1_state *sctx, const u8 *data,
+unsigned int len, unsigned int partial)
 {
unsigned int done = 0;
 
@@ -56,17 +50,17 @@ static int __sha1_update(struct SHA1_CTX *sctx, const u8 
*data,
 
if (partial) {
done = SHA1_BLOCK_SIZE - partial;
-   memcpy(sctx-data + partial, data, done);
-   sha1_block_data_order(sctx, sctx-data, 1);
+   memcpy(sctx-buffer + partial, data, done);
+   sha1_block_data_order(sctx-state, sctx-buffer, 1);
}
 
if (len - done = SHA1_BLOCK_SIZE) {
const unsigned int rounds = (len - done) / SHA1_BLOCK_SIZE;
-   sha1_block_data_order(sctx, data + done, rounds);
+   sha1_block_data_order(sctx-state, data + done, rounds);
done += rounds * SHA1_BLOCK_SIZE;
}
 
-   memcpy(sctx-data, data + done, len - done);
+   memcpy(sctx-buffer, data + done, len - done);
return 0;
 }
 
@@ -74,14 +68,14 @@ static int __sha1_update(struct SHA1_CTX *sctx, const u8 
*data,
 static int sha1_update(struct shash_desc *desc, const u8 *data,
 unsigned int len)
 {
-   struct SHA1_CTX *sctx = shash_desc_ctx(desc);
+   struct sha1_state *sctx = shash_desc_ctx(desc);
unsigned int partial = sctx-count % SHA1_BLOCK_SIZE;
int res;
 
/* Handle the fast case right here */
if (partial + len  SHA1_BLOCK_SIZE) {
sctx-count += len;
-   memcpy(sctx-data + partial, data, len);
+   memcpy(sctx-buffer + partial, data, len);
return 0;
}
res = __sha1_update(sctx, data, len, partial);
@@ -92,7 +86,7 @@ static int sha1_update(struct shash_desc *desc, const u8 
*data,
 /* Add padding and return the message digest. */
 static int sha1_final(struct shash_desc *desc, u8 *out)
 {
-   struct SHA1_CTX *sctx = shash_desc_ctx(desc);
+   struct sha1_state *sctx = shash_desc_ctx(desc);
unsigned int i, index, padlen;
__be32 *dst = (__be32 *)out;
__be64 bits;
@@ -106,7 +100,7 @@ static int sha1_final(struct shash_desc *desc, u8 *out)
/* We need to fill a whole block for __sha1_update() */
if (padlen = 56) {
sctx-count += padlen;
-   memcpy(sctx-data + index, padding, padlen);
+   memcpy(sctx-buffer + index, padding, padlen);
} else {
__sha1_update(sctx, padding, padlen, index);
}
@@ -114,7 +108,7 @@ static int sha1_final(struct shash_desc *desc, u8 *out)
 
/* Store state in digest */
for (i = 0; i  5; i++)
-   dst[i] = cpu_to_be32(((u32 *)sctx)[i]);
+   dst[i] = cpu_to_be32(sctx-state[i]);
 
/* Wipe context */
memset(sctx, 0, sizeof(*sctx));
@@ -124,7 +118,7 @@ static int sha1_final(struct shash_desc *desc, u8 *out)
 
 static int sha1_export(struct shash_desc *desc, void *out)
 {
-   struct SHA1_CTX *sctx = shash_desc_ctx(desc);
+   struct sha1_state *sctx = shash_desc_ctx(desc);
memcpy(out, sctx, sizeof(*sctx));
return 0;
 }
@@ -132,7 +126,7 @@ static int sha1_export(struct shash_desc *desc, void *out)
 
 static int sha1_import(struct shash_desc *desc, const void *in)
 {
-   struct SHA1_CTX *sctx = shash_desc_ctx(desc);
+   struct sha1_state *sctx = shash_desc_ctx(desc);

Re: [PATCH 1/2] [v2] crypto: sha1/ARM: make use of common SHA-1 structures

2014-06-30 Thread Ard Biesheuvel
On 29 June 2014 16:33, Jussi Kivilinna jussi.kivili...@iki.fi wrote:
 Common SHA-1 structures are defined in crypto/sha.h for code sharing.

 This patch changes SHA-1/ARM glue code to use these structures.

 Acked-by: Ard Biesheuvel ard.biesheu...@linaro.org
 Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi
 ---

These two should go into Russell's patch system if nobody else has any
more comments.

http://www.arm.linux.org.uk/developer/patches/

-- 
Ard.


  arch/arm/crypto/sha1_glue.c |   50 
 +++
  1 file changed, 22 insertions(+), 28 deletions(-)

 diff --git a/arch/arm/crypto/sha1_glue.c b/arch/arm/crypto/sha1_glue.c
 index 76cd976..c494e57 100644
 --- a/arch/arm/crypto/sha1_glue.c
 +++ b/arch/arm/crypto/sha1_glue.c
 @@ -24,31 +24,25 @@
  #include crypto/sha.h
  #include asm/byteorder.h

 -struct SHA1_CTX {
 -   uint32_t h0,h1,h2,h3,h4;
 -   u64 count;
 -   u8 data[SHA1_BLOCK_SIZE];
 -};

 -asmlinkage void sha1_block_data_order(struct SHA1_CTX *digest,
 +asmlinkage void sha1_block_data_order(u32 *digest,
 const unsigned char *data, unsigned int rounds);


  static int sha1_init(struct shash_desc *desc)
  {
 -   struct SHA1_CTX *sctx = shash_desc_ctx(desc);
 -   memset(sctx, 0, sizeof(*sctx));
 -   sctx-h0 = SHA1_H0;
 -   sctx-h1 = SHA1_H1;
 -   sctx-h2 = SHA1_H2;
 -   sctx-h3 = SHA1_H3;
 -   sctx-h4 = SHA1_H4;
 +   struct sha1_state *sctx = shash_desc_ctx(desc);
 +
 +   *sctx = (struct sha1_state){
 +   .state = { SHA1_H0, SHA1_H1, SHA1_H2, SHA1_H3, SHA1_H4 },
 +   };
 +
 return 0;
  }


 -static int __sha1_update(struct SHA1_CTX *sctx, const u8 *data,
 -  unsigned int len, unsigned int partial)
 +static int __sha1_update(struct sha1_state *sctx, const u8 *data,
 +unsigned int len, unsigned int partial)
  {
 unsigned int done = 0;

 @@ -56,17 +50,17 @@ static int __sha1_update(struct SHA1_CTX *sctx, const u8 
 *data,

 if (partial) {
 done = SHA1_BLOCK_SIZE - partial;
 -   memcpy(sctx-data + partial, data, done);
 -   sha1_block_data_order(sctx, sctx-data, 1);
 +   memcpy(sctx-buffer + partial, data, done);
 +   sha1_block_data_order(sctx-state, sctx-buffer, 1);
 }

 if (len - done = SHA1_BLOCK_SIZE) {
 const unsigned int rounds = (len - done) / SHA1_BLOCK_SIZE;
 -   sha1_block_data_order(sctx, data + done, rounds);
 +   sha1_block_data_order(sctx-state, data + done, rounds);
 done += rounds * SHA1_BLOCK_SIZE;
 }

 -   memcpy(sctx-data, data + done, len - done);
 +   memcpy(sctx-buffer, data + done, len - done);
 return 0;
  }

 @@ -74,14 +68,14 @@ static int __sha1_update(struct SHA1_CTX *sctx, const u8 
 *data,
  static int sha1_update(struct shash_desc *desc, const u8 *data,
  unsigned int len)
  {
 -   struct SHA1_CTX *sctx = shash_desc_ctx(desc);
 +   struct sha1_state *sctx = shash_desc_ctx(desc);
 unsigned int partial = sctx-count % SHA1_BLOCK_SIZE;
 int res;

 /* Handle the fast case right here */
 if (partial + len  SHA1_BLOCK_SIZE) {
 sctx-count += len;
 -   memcpy(sctx-data + partial, data, len);
 +   memcpy(sctx-buffer + partial, data, len);
 return 0;
 }
 res = __sha1_update(sctx, data, len, partial);
 @@ -92,7 +86,7 @@ static int sha1_update(struct shash_desc *desc, const u8 
 *data,
  /* Add padding and return the message digest. */
  static int sha1_final(struct shash_desc *desc, u8 *out)
  {
 -   struct SHA1_CTX *sctx = shash_desc_ctx(desc);
 +   struct sha1_state *sctx = shash_desc_ctx(desc);
 unsigned int i, index, padlen;
 __be32 *dst = (__be32 *)out;
 __be64 bits;
 @@ -106,7 +100,7 @@ static int sha1_final(struct shash_desc *desc, u8 *out)
 /* We need to fill a whole block for __sha1_update() */
 if (padlen = 56) {
 sctx-count += padlen;
 -   memcpy(sctx-data + index, padding, padlen);
 +   memcpy(sctx-buffer + index, padding, padlen);
 } else {
 __sha1_update(sctx, padding, padlen, index);
 }
 @@ -114,7 +108,7 @@ static int sha1_final(struct shash_desc *desc, u8 *out)

 /* Store state in digest */
 for (i = 0; i  5; i++)
 -   dst[i] = cpu_to_be32(((u32 *)sctx)[i]);
 +   dst[i] = cpu_to_be32(sctx-state[i]);

 /* Wipe context */
 memset(sctx, 0, sizeof(*sctx));
 @@ -124,7 +118,7 @@ static int sha1_final(struct shash_desc *desc, u8 *out)

  static int sha1_export(struct shash_desc *desc, void *out)
  {
 -   struct SHA1_CTX *sctx = shash_desc_ctx(desc);
 +   struct sha1_state *sctx = shash_desc_ctx(desc);
   

Re: [PATCH] [v3] crypto: sha512: add ARM NEON implementation

2014-06-30 Thread Ard Biesheuvel
On 30 June 2014 18:39, Jussi Kivilinna jussi.kivili...@iki.fi wrote:
 This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384
 algorithms.

 tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm:

 block-size  bytes/updateold-vs-new
 16  16  2.99x
 64  16  2.67x
 64  64  3.00x
 256 16  2.64x
 256 64  3.06x
 256 256 3.33x
 102416  2.53x
 1024256 3.39x
 102410243.52x
 204816  2.50x
 2048256 3.41x
 204810243.54x
 204820483.57x
 409616  2.49x
 4096256 3.42x
 409610243.56x
 409640963.59x
 819216  2.48x
 8192256 3.42x
 819210243.56x
 819240963.60x
 819281923.60x

 Acked-by: Ard Biesheuvel ard.biesheu...@linaro.org
 Tested-by: Ard Biesheuvel ard.biesheu...@linaro.org
 Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi


Likewise for this one: if nobody has any more comments, this should go
into the patch system.

One remaining question though: is this code (and the SHA1 code) known
to be broken for big endian or just untested?

Thanks,
Ard.

 ---

 Changes in v2:
  - Use ENTRY/ENDPROC
  - Don't provide Thumb2 version

 v3:
  - Changelog moved below '---'
 ---
  arch/arm/crypto/Makefile|2
  arch/arm/crypto/sha512-armv7-neon.S |  455 
 +++
  arch/arm/crypto/sha512_neon_glue.c  |  305 +++
  crypto/Kconfig  |   15 +
  4 files changed, 777 insertions(+)
  create mode 100644 arch/arm/crypto/sha512-armv7-neon.S
  create mode 100644 arch/arm/crypto/sha512_neon_glue.c

 diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
 index 374956d..b48fa34 100644
 --- a/arch/arm/crypto/Makefile
 +++ b/arch/arm/crypto/Makefile
 @@ -6,11 +6,13 @@ obj-$(CONFIG_CRYPTO_AES_ARM) += aes-arm.o
  obj-$(CONFIG_CRYPTO_AES_ARM_BS) += aes-arm-bs.o
  obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
  obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 +obj-$(CONFIG_CRYPTO_SHA512_ARM_NEON) += sha512-arm-neon.o

  aes-arm-y  := aes-armv4.o aes_glue.o
  aes-arm-bs-y   := aesbs-core.o aesbs-glue.o
  sha1-arm-y := sha1-armv4-large.o sha1_glue.o
  sha1-arm-neon-y:= sha1-armv7-neon.o sha1_neon_glue.o
 +sha512-arm-neon-y := sha512-armv7-neon.o sha512_neon_glue.o

  quiet_cmd_perl = PERL$@
cmd_perl = $(PERL) $()  $(@)
 diff --git a/arch/arm/crypto/sha512-armv7-neon.S 
 b/arch/arm/crypto/sha512-armv7-neon.S
 new file mode 100644
 index 000..fe99472
 --- /dev/null
 +++ b/arch/arm/crypto/sha512-armv7-neon.S
 @@ -0,0 +1,455 @@
 +/* sha512-armv7-neon.S  -  ARM/NEON assembly implementation of SHA-512 
 transform
 + *
 + * Copyright © 2013-2014 Jussi Kivilinna jussi.kivili...@iki.fi
 + *
 + * This program is free software; you can redistribute it and/or modify it
 + * under the terms of the GNU General Public License as published by the Free
 + * Software Foundation; either version 2 of the License, or (at your option)
 + * any later version.
 + */
 +
 +#include linux/linkage.h
 +
 +
 +.syntax unified
 +.code   32
 +.fpu neon
 +
 +.text
 +
 +/* structure of SHA512_CONTEXT */
 +#define hd_a 0
 +#define hd_b ((hd_a) + 8)
 +#define hd_c ((hd_b) + 8)
 +#define hd_d ((hd_c) + 8)
 +#define hd_e ((hd_d) + 8)
 +#define hd_f ((hd_e) + 8)
 +#define hd_g ((hd_f) + 8)
 +
 +/* register macros */
 +#define RK %r2
 +
 +#define RA d0
 +#define RB d1
 +#define RC d2
 +#define RD d3
 +#define RE d4
 +#define RF d5
 +#define RG d6
 +#define RH d7
 +
 +#define RT0 d8
 +#define RT1 d9
 +#define RT2 d10
 +#define RT3 d11
 +#define RT4 d12
 +#define RT5 d13
 +#define RT6 d14
 +#define RT7 d15
 +
 +#define RT01q q4
 +#define RT23q q5
 +#define RT45q q6
 +#define RT67q q7
 +
 +#define RW0 d16
 +#define RW1 d17
 +#define RW2 d18
 +#define RW3 d19
 +#define RW4 d20
 +#define RW5 d21
 +#define RW6 d22
 +#define RW7 d23
 +#define RW8 d24
 +#define RW9 d25
 +#define RW10 d26
 +#define RW11 d27
 +#define RW12 d28
 +#define RW13 d29
 +#define RW14 d30
 +#define RW15 d31
 +
 +#define RW01q q8
 +#define RW23q q9
 +#define RW45q q10
 +#define RW67q q11
 +#define RW89q q12
 +#define RW1011q q13
 +#define RW1213q q14
 +#define RW1415q q15
 +
 +/***
 + * ARM assembly implementation of sha512 transform
 + ***/
 +#define rounds2_0_63(ra, rb, rc, rd, re, rf, rg, rh, rw0, rw1, rw01q, rw2, \
 + rw23q, rw1415q, rw9, rw10, interleave_op, arg1) \
 +   /* t1 = 

Re: [PATCH] [v3] crypto: sha512: add ARM NEON implementation

2014-06-30 Thread Jussi Kivilinna
On 30.06.2014 21:13, Ard Biesheuvel wrote:
 On 30 June 2014 18:39, Jussi Kivilinna jussi.kivili...@iki.fi wrote:
 This patch adds ARM NEON assembly implementation of SHA-512 and SHA-384
 algorithms.

 tcrypt benchmark results on Cortex-A8, sha512-generic vs sha512-neon-asm:

 block-size  bytes/updateold-vs-new
 16  16  2.99x
 64  16  2.67x
 64  64  3.00x
 256 16  2.64x
 256 64  3.06x
 256 256 3.33x
 102416  2.53x
 1024256 3.39x
 102410243.52x
 204816  2.50x
 2048256 3.41x
 204810243.54x
 204820483.57x
 409616  2.49x
 4096256 3.42x
 409610243.56x
 409640963.59x
 819216  2.48x
 8192256 3.42x
 819210243.56x
 819240963.60x
 819281923.60x

 Acked-by: Ard Biesheuvel ard.biesheu...@linaro.org
 Tested-by: Ard Biesheuvel ard.biesheu...@linaro.org
 Signed-off-by: Jussi Kivilinna jussi.kivili...@iki.fi

 
 Likewise for this one: if nobody has any more comments, this should go
 into the patch system.
 
 One remaining question though: is this code (and the SHA1 code) known
 to be broken for big endian or just untested?
 

Untested and probably broken, so therefore I've disabled when CPU_BIG_ENDIAN=y.

-Jussi

 Thanks,
 Ard.
 
 ---

 Changes in v2:
  - Use ENTRY/ENDPROC
  - Don't provide Thumb2 version

 v3:
  - Changelog moved below '---'
 ---
  arch/arm/crypto/Makefile|2
  arch/arm/crypto/sha512-armv7-neon.S |  455 
 +++
  arch/arm/crypto/sha512_neon_glue.c  |  305 +++
  crypto/Kconfig  |   15 +
  4 files changed, 777 insertions(+)
  create mode 100644 arch/arm/crypto/sha512-armv7-neon.S
  create mode 100644 arch/arm/crypto/sha512_neon_glue.c

 diff --git a/arch/arm/crypto/Makefile b/arch/arm/crypto/Makefile
 index 374956d..b48fa34 100644
 --- a/arch/arm/crypto/Makefile
 +++ b/arch/arm/crypto/Makefile
 @@ -6,11 +6,13 @@ obj-$(CONFIG_CRYPTO_AES_ARM) += aes-arm.o
  obj-$(CONFIG_CRYPTO_AES_ARM_BS) += aes-arm-bs.o
  obj-$(CONFIG_CRYPTO_SHA1_ARM) += sha1-arm.o
  obj-$(CONFIG_CRYPTO_SHA1_ARM_NEON) += sha1-arm-neon.o
 +obj-$(CONFIG_CRYPTO_SHA512_ARM_NEON) += sha512-arm-neon.o

  aes-arm-y  := aes-armv4.o aes_glue.o
  aes-arm-bs-y   := aesbs-core.o aesbs-glue.o
  sha1-arm-y := sha1-armv4-large.o sha1_glue.o
  sha1-arm-neon-y:= sha1-armv7-neon.o sha1_neon_glue.o
 +sha512-arm-neon-y := sha512-armv7-neon.o sha512_neon_glue.o

  quiet_cmd_perl = PERL$@
cmd_perl = $(PERL) $()  $(@)
 diff --git a/arch/arm/crypto/sha512-armv7-neon.S 
 b/arch/arm/crypto/sha512-armv7-neon.S
 new file mode 100644
 index 000..fe99472
 --- /dev/null
 +++ b/arch/arm/crypto/sha512-armv7-neon.S
 @@ -0,0 +1,455 @@
 +/* sha512-armv7-neon.S  -  ARM/NEON assembly implementation of SHA-512 
 transform
 + *
 + * Copyright © 2013-2014 Jussi Kivilinna jussi.kivili...@iki.fi
 + *
 + * This program is free software; you can redistribute it and/or modify it
 + * under the terms of the GNU General Public License as published by the 
 Free
 + * Software Foundation; either version 2 of the License, or (at your option)
 + * any later version.
 + */
 +
 +#include linux/linkage.h
 +
 +
 +.syntax unified
 +.code   32
 +.fpu neon
 +
 +.text
 +
 +/* structure of SHA512_CONTEXT */
 +#define hd_a 0
 +#define hd_b ((hd_a) + 8)
 +#define hd_c ((hd_b) + 8)
 +#define hd_d ((hd_c) + 8)
 +#define hd_e ((hd_d) + 8)
 +#define hd_f ((hd_e) + 8)
 +#define hd_g ((hd_f) + 8)
 +
 +/* register macros */
 +#define RK %r2
 +
 +#define RA d0
 +#define RB d1
 +#define RC d2
 +#define RD d3
 +#define RE d4
 +#define RF d5
 +#define RG d6
 +#define RH d7
 +
 +#define RT0 d8
 +#define RT1 d9
 +#define RT2 d10
 +#define RT3 d11
 +#define RT4 d12
 +#define RT5 d13
 +#define RT6 d14
 +#define RT7 d15
 +
 +#define RT01q q4
 +#define RT23q q5
 +#define RT45q q6
 +#define RT67q q7
 +
 +#define RW0 d16
 +#define RW1 d17
 +#define RW2 d18
 +#define RW3 d19
 +#define RW4 d20
 +#define RW5 d21
 +#define RW6 d22
 +#define RW7 d23
 +#define RW8 d24
 +#define RW9 d25
 +#define RW10 d26
 +#define RW11 d27
 +#define RW12 d28
 +#define RW13 d29
 +#define RW14 d30
 +#define RW15 d31
 +
 +#define RW01q q8
 +#define RW23q q9
 +#define RW45q q10
 +#define RW67q q11
 +#define RW89q q12
 +#define RW1011q q13
 +#define RW1213q q14
 +#define RW1415q q15
 +
 +/***
 + * ARM assembly implementation of sha512 transform
 + ***/
 +#define