Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
On 6 March 2017 at 20:08, Gilad Ben-Yossefwrote: > > I gave it a spin on a x86_64 with 8 CPUs with AES-NI using cryptd and > on Arm using CryptoCell hardware accelerator. > > There was no difference in performance between 512 and 4096 bytes > cluster size on the x86_64 (800 MB loop file system) > > There was an improvement in latency of 3.2% between 512 and 4096 bytes > cluster size on the Arm. I expect the performance benefits for this > test for Binoy's patch to be the same. > > In both cases the very naive test was a simple dd with block size of > 4096 bytes or the raw block device. > > I do not know what effect having a bigger cluster size would have on > have on other more complex file system operations. > Is there any specific benchmark worth testing with? The multiple instances issue in /proc/crypto is fixed. It was because of the IV code itself modifying the algorithm name inadvertently in the global crypto algorithm lookup table when it was splitting up "plain(cbc(aes))" into "plain" and "cbc(aes)" so as to invoke the child algorithm. I ran a few tests with dd, bonnie and FIO under Qemu - x86 using the automated script [1] that I wrote to make the testing easy. The tests were done on software implementations of the algorithms as the real hardware was not available with me. According to the test, I found that the sequential reads and writes have a good improvement (5.7 %) in the data rate with the proposed solution while the random reads shows a very little improvement. When tested with FIO, the random writes also shows a small improvement (2.2%) but the random reads show a little deterioration in performance (4 %). When tested in arm hardware, only the sequential writes with bonnie shows improvement (5.6%). All other tests shows degraded performance in the absence of crypto hardware. [1] https://github.com/binoyjayan/utilities/blob/master/utils/dmtest Dependencies: dd [Full version], bonnie, fio Thanks, Binoy
Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
On Wed, Mar 1, 2017 at 5:38 PM, Milan Brozwrote: > > On 03/01/2017 02:04 PM, Milan Broz wrote: >> On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote: >> ... >> >>> I can certainly understand if you don't wont to take the patch until >>> we have results with >>> dm-crypt itself but the difference between 8 separate invocation of >>> the engine for 512 >>> bytes of XTS and a single invocation for 4KB are pretty big. >> >> Yes, I know it. But the same can be achieved if we just implement >> 4k sector encryption in dmcrypt. It is incompatible with LUKS1 >> (but next LUKS version will support it) but I think this is not >> a problem for now. >> >> If the underlying device supports atomic write of 4k sectors, then >> there should not be a problem. >> >> This is one of the speed-up I would like to compare with the IV approach, >> because everyone should benefit from 4k sectors in the end. >> And no crypto API changes are needed here. >> >> (I have an old patch for this, so I will try to revive it.) > > If anyone interested, simple experimental patch for larger sector size > (up to the page size) for dmcrypt is in this branch: > > http://git.kernel.org/cgit/linux/kernel/git/mbroz/linux.git/log/?h=dm-crypt-4k-sector > > It would be nice to check what performance gain could be provided > by this simple approach. I gave it a spin on a x86_64 with 8 CPUs with AES-NI using cryptd and on Arm using CryptoCell hardware accelerator. There was no difference in performance between 512 and 4096 bytes cluster size on the x86_64 (800 MB loop file system) There was an improvement in latency of 3.2% between 512 and 4096 bytes cluster size on the Arm. I expect the performance benefits for this test for Binoy's patch to be the same. In both cases the very naive test was a simple dd with block size of 4096 bytes or the raw block device. I do not know what effect having a bigger cluster size would have on have on other more complex file system operations. Is there any specific benchmark worth testing with? Gilad -- Gilad Ben-Yossef Chief Coffee Drinker "If you take a class in large-scale robotics, can you end up in a situation where the homework eats your dog?" -- Jean-Baptiste Queru
Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
On 03/01/2017 02:04 PM, Milan Broz wrote: > On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote: > ... > >> I can certainly understand if you don't wont to take the patch until >> we have results with >> dm-crypt itself but the difference between 8 separate invocation of >> the engine for 512 >> bytes of XTS and a single invocation for 4KB are pretty big. > > Yes, I know it. But the same can be achieved if we just implement > 4k sector encryption in dmcrypt. It is incompatible with LUKS1 > (but next LUKS version will support it) but I think this is not > a problem for now. > > If the underlying device supports atomic write of 4k sectors, then > there should not be a problem. > > This is one of the speed-up I would like to compare with the IV approach, > because everyone should benefit from 4k sectors in the end. > And no crypto API changes are needed here. > > (I have an old patch for this, so I will try to revive it.) If anyone interested, simple experimental patch for larger sector size (up to the page size) for dmcrypt is in this branch: http://git.kernel.org/cgit/linux/kernel/git/mbroz/linux.git/log/?h=dm-crypt-4k-sector It would be nice to check what performance gain could be provided by this simple approach. Milan
Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
2017-03-01 13:42 GMT+01:00 Gilad Ben-Yossef: > It really is an observation about overhead of context switches between > dm-crypt and > whatever/wherever you handle crypto - be it an off CPU hardware engine > or a bunch > of parallel kernel threads running on other cores. You really want to > burst as much as > possible. [...] >> For XTS you need just simple linear IV. No problem with that, implementation >> in crypto API and hw is trivial. >> >> But for compatible IV (that provides compatibility with loopAES and very old >> TrueCrypt), >> these should be never ever implemented again anywhere. > >> >> Specifically "tcw" is broken, insecure and provided here just to help people >> to migrate >> from old Truecrypt containers. Even Truecrypt followers removed it from the >> codebase. >> (It is basically combination of IV and slight modification of CBC mode. All >> recent version switched to XTS and plain IV.) >> >> So building abstraction over something known to be broken and that is now >> intentionally >> isolated inside dmcrypt is, in my opinion, really not a good idea. >> > > I don't think anyone is interested in these modes. How do you support > XTS and essiv in > a generic way without supporting this broken modes is not something > I'm clear on though. Wouldn't adopting a bulk request API (something like what I tried to do here [1]) that allows users to supply multiple messages, each with their own IV, fulfill this purpose? That way, we wouldn't need to introduce any new modes into Crypto API and the drivers/accelerators would only need to provide bulk implementations of common modes (xts(aes), cbc(aes), ...) to provide better performance for dm-crypt (and possibly other users, too). I'm not sure how exactly these crypto accelerators work, but wouldn't it help if the drivers simply get more messages (in our case sectors) in a single call? I wonder, would (efficiently) supporting such a scheme require changes in the HW itself or could it be achieved just by modifying driver code (let's say specifically for your CryptoCell accelerator)? Thanks, Ondrej [1] https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg23007.html > > Thanks, > Gilad > > > > -- > Gilad Ben-Yossef > Chief Coffee Drinker > > "If you take a class in large-scale robotics, can you end up in a > situation where the homework eats your dog?" > -- Jean-Baptiste Queru
Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote: ... > I can certainly understand if you don't wont to take the patch until > we have results with > dm-crypt itself but the difference between 8 separate invocation of > the engine for 512 > bytes of XTS and a single invocation for 4KB are pretty big. Yes, I know it. But the same can be achieved if we just implement 4k sector encryption in dmcrypt. It is incompatible with LUKS1 (but next LUKS version will support it) but I think this is not a problem for now. If the underlying device supports atomic write of 4k sectors, then there should not be a problem. This is one of the speed-up I would like to compare with the IV approach, because everyone should benefit from 4k sectors in the end. And no crypto API changes are needed here. (I have an old patch for this, so I will try to revive it.) Milan
Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
On Wed, Mar 1, 2017 at 11:29 AM, Milan Brozwrote: > > On 03/01/2017 09:30 AM, Gilad Ben-Yossef wrote: > > On Tue, Feb 28, 2017 at 11:05 PM, Milan Broz wrote: > >> > >> On 02/22/2017 07:12 AM, Binoy Jayan wrote: > >>> > >>> I was wondering if this is near to be ready for submission (apart from > >>> the testmgr.c > >>> changes) or I need to make some changes to make it similar to the IPSec > >>> offload? > >> > >> I just tried this and except it registers the IV for every new device > >> again, it works... > >> (After a while you have many duplicate entries in /proc/crypto.) > >> > >> But I would like to see some summary why such a big patch is needed in the > >> first place. > >> (During an internal discussions seems that people are already lost in > >> mails and > >> patches here, so Ondra promised me to send some summary mail soon here.) > >> > >> IIRC the first initial problem was dmcrypt performance on some embedded > >> crypto processors that are not able to cope with small crypto requests > >> effectively. > >> > >> > >> Do you have some real performance numbers that proves that such a patch is > >> adequate? > >> > >> I would really like to see the performance issue fixed but I am really not > >> sure > >> this approach works for everyone. It would be better to avoid repeating > >> this exercise later. > >> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential > >> to speedup things even for crypt drivers that do not support own IV > >> generators. > >> > > > > AFAIK the problem that we are trying to solve is that if the IV is > > generated outside the crypto API > > domain than you are forced to have an invocation of the crypto API per > > each block because you > > need to provide the IV for each block. > > > > By putting the IV generation responsibility in the Crypto API we open > > the way to do a single invocation > > of the crypto API for a whole sequence of blocks. > > Sure, but this is theory. Does it really work on some hw already? > Do you have performance measurements or comparison? I'm working on up streaming a driver for Arm CryptoCell that supports this and working offline to get Binoy a board to test this with. Alas, shipping crypto HW has its fair share of regulatory challenges... :-) I can certainly understand if you don't wont to take the patch until we have results with dm-crypt itself but the difference between 8 separate invocation of the engine for 512 bytes of XTS and a single invocation for 4KB are pretty big. >From what I know of HW engines I'd be surprised if this is in any way unique to CryptoCell. > > For software implementation of XTS this doesn't matter much - but for > > hardware based XTS providers > > It is not only embedded crypto, we have some more reports in the past > that 512B sectors are not ideal even for other systems. > (IIRC it was also with AES-NI that represents really big group of users). I never said anything about embedded :-) It really is an observation about overhead of context switches between dm-crypt and whatever/wherever you handle crypto - be it an off CPU hardware engine or a bunch of parallel kernel threads running on other cores. You really want to burst as much as possible. > > > This lead some vendors to ship hacked up versions of dm-crypt to match > > the specific crypto hardware > > they were using, or so I've heard at least - didn't see the code myself. > > I saw few version of that. There was a very hacky way to provide > request-based dmcrypt > (see old "Introduce the request handling for dm-crypt" thread on dm-devel). > This is not the acceptable way but definitely it points to the same problem. > > > I believe Binoy is trying to address this in a generic upstream worthy > > way instead. > > IIRC the problem is performance, if we can solve it by some generic way, > good, but for now it seems to be a big change and just hope it helps later... > I see what you're saying. We need number to back this up. > > Anyway, you are only supposed to see s difference when using a > > hardware based XTS provider algo > > that supports IV generation. > > > >> I like the patch is now contained inside dmcrypt, but it still exposes IVs > >> that > >> are designed just for old, insecure, compatibility-only containers. > >> > >> I really do not think every compatible crap must be accessible through > >> crypto API. > >> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do > >> that this way > >> if I know it is accessible outside of dmcrypt internals...) > >> Even the ESSIV is something that was born to fix predictive IVs (CBC > >> watermarking > >> attacks) for disk encryption only, no reason to expose it outside of disk > >> encryption. > >> > > > > The point is that you have more than one implementation of these > > "compatible crap" - the > > software implementation that you wrote and potentially multiple > > hardware implementations > >
Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
On 03/01/2017 09:30 AM, Gilad Ben-Yossef wrote: > On Tue, Feb 28, 2017 at 11:05 PM, Milan Brozwrote: >> >> On 02/22/2017 07:12 AM, Binoy Jayan wrote: >>> >>> I was wondering if this is near to be ready for submission (apart from >>> the testmgr.c >>> changes) or I need to make some changes to make it similar to the IPSec >>> offload? >> >> I just tried this and except it registers the IV for every new device again, >> it works... >> (After a while you have many duplicate entries in /proc/crypto.) >> >> But I would like to see some summary why such a big patch is needed in the >> first place. >> (During an internal discussions seems that people are already lost in mails >> and >> patches here, so Ondra promised me to send some summary mail soon here.) >> >> IIRC the first initial problem was dmcrypt performance on some embedded >> crypto processors that are not able to cope with small crypto requests >> effectively. >> >> >> Do you have some real performance numbers that proves that such a patch is >> adequate? >> >> I would really like to see the performance issue fixed but I am really not >> sure >> this approach works for everyone. It would be better to avoid repeating this >> exercise later. >> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential >> to speedup things even for crypt drivers that do not support own IV >> generators. >> > > AFAIK the problem that we are trying to solve is that if the IV is > generated outside the crypto API > domain than you are forced to have an invocation of the crypto API per > each block because you > need to provide the IV for each block. > > By putting the IV generation responsibility in the Crypto API we open > the way to do a single invocation > of the crypto API for a whole sequence of blocks. Sure, but this is theory. Does it really work on some hw already? Do you have performance measurements or comparison? > For software implementation of XTS this doesn't matter much - but for > hardware based XTS providers It is not only embedded crypto, we have some more reports in the past that 512B sectors are not ideal even for other systems. (IIRC it was also with AES-NI that represents really big group of users). > This lead some vendors to ship hacked up versions of dm-crypt to match > the specific crypto hardware > they were using, or so I've heard at least - didn't see the code myself. I saw few version of that. There was a very hacky way to provide request-based dmcrypt (see old "Introduce the request handling for dm-crypt" thread on dm-devel). This is not the acceptable way but definitely it points to the same problem. > I believe Binoy is trying to address this in a generic upstream worthy > way instead. IIRC the problem is performance, if we can solve it by some generic way, good, but for now it seems to be a big change and just hope it helps later... > Anyway, you are only supposed to see s difference when using a > hardware based XTS provider algo > that supports IV generation. > >> I like the patch is now contained inside dmcrypt, but it still exposes IVs >> that >> are designed just for old, insecure, compatibility-only containers. >> >> I really do not think every compatible crap must be accessible through >> crypto API. >> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that >> this way >> if I know it is accessible outside of dmcrypt internals...) >> Even the ESSIV is something that was born to fix predictive IVs (CBC >> watermarking >> attacks) for disk encryption only, no reason to expose it outside of disk >> encryption. >> > > The point is that you have more than one implementation of these > "compatible crap" - the > software implementation that you wrote and potentially multiple > hardware implementations > and putting this in the crypto API domain is the way to abstract this > so you use the one > that works best of your platform. For XTS you need just simple linear IV. No problem with that, implementation in crypto API and hw is trivial. But for compatible IV (that provides compatibility with loopAES and very old TrueCrypt), these should be never ever implemented again anywhere. Specifically "tcw" is broken, insecure and provided here just to help people to migrate from old Truecrypt containers. Even Truecrypt followers removed it from the codebase. (It is basically combination of IV and slight modification of CBC mode. All recent version switched to XTS and plain IV.) So building abstraction over something known to be broken and that is now intentionally isolated inside dmcrypt is, in my opinion, really not a good idea. But please do get me wrong, I do not want to block any improvement. But it seems to me that this thread focused on creating nice crypto API interface for FDE IVs instead of demonstration that the proposed solution really solves the performance issue. And not only for your hw driver, maybe other systems could benefit from
Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
On Tue, Feb 28, 2017 at 11:05 PM, Milan Brozwrote: > > On 02/22/2017 07:12 AM, Binoy Jayan wrote: > > > > I was wondering if this is near to be ready for submission (apart from > > the testmgr.c > > changes) or I need to make some changes to make it similar to the IPSec > > offload? > > I just tried this and except it registers the IV for every new device again, > it works... > (After a while you have many duplicate entries in /proc/crypto.) > > But I would like to see some summary why such a big patch is needed in the > first place. > (During an internal discussions seems that people are already lost in mails > and > patches here, so Ondra promised me to send some summary mail soon here.) > > IIRC the first initial problem was dmcrypt performance on some embedded > crypto processors that are not able to cope with small crypto requests > effectively. > > > Do you have some real performance numbers that proves that such a patch is > adequate? > > I would really like to see the performance issue fixed but I am really not > sure > this approach works for everyone. It would be better to avoid repeating this > exercise later. > IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential > to speedup things even for crypt drivers that do not support own IV > generators. > AFAIK the problem that we are trying to solve is that if the IV is generated outside the crypto API domain than you are forced to have an invocation of the crypto API per each block because you need to provide the IV for each block. By putting the IV generation responsibility in the Crypto API we open the way to do a single invocation of the crypto API for a whole sequence of blocks. For software implementation of XTS this doesn't matter much - but for hardware based XTS providers that can do the IV generation themselves it's the difference between a sequence of small invocation, with all the overhead that that entails and a single big invocatio - and this can be big. This lead some vendors to ship hacked up versions of dm-crypt to match the specific crypto hardware they were using, or so I've heard at least - didn't see the code myself. I believe Binoy is trying to address this in a generic upstream worthy way instead. Anyway, you are only supposed to see s difference when using a hardware based XTS provider algo that supports IV generation. > I like the patch is now contained inside dmcrypt, but it still exposes IVs > that > are designed just for old, insecure, compatibility-only containers. > > I really do not think every compatible crap must be accessible through crypto > API. > (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that > this way > if I know it is accessible outside of dmcrypt internals...) > Even the ESSIV is something that was born to fix predictive IVs (CBC > watermarking > attacks) for disk encryption only, no reason to expose it outside of disk > encryption. > The point is that you have more than one implementation of these "compatible crap" - the software implementation that you wrote and potentially multiple hardware implementations and putting this in the crypto API domain is the way to abstract this so you use the one that works best of your platform. Thanks, Gilad -- Gilad Ben-Yossef Chief Coffee Drinker "If you take a class in large-scale robotics, can you end up in a situation where the homework eats your dog?" -- Jean-Baptiste Queru
Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
Hi Herbert, On 8 February 2017 at 13:02, Gilad Ben-Yossefwrote: > On Tue, Feb 7, 2017 at 12:35 PM, Binoy Jayan wrote: >> === >> dm-crypt optimization for larger block sizes >> === >> >> Currently, the iv generation algorithms are implemented in dm-crypt.c. The >> goal >> is to move these algorithms from the dm layer to the kernel crypto layer by >> implementing them as template ciphers so they can be used in relation with >> algorithms like aes, and with multiple modes like cbc, ecb etc. As part of >> this >> patchset, the iv-generation code is moved from the dm layer to the crypto >> layer >> and adapt the dm-layer to send a whole 'bio' (as defined in the block layer) >> at a time. Each bio contains the in memory representation of physically >> contiguous disk blocks. Since the bio itself may not be contiguous in main >> memory, the dm layer sets up a chained scatterlist of these blocks split into >> physically contiguous segments in memory so that DMA can be performed. > Ran Bonnie++ on it last night (Luks mode, plain64, Qemu Virt platform > Arm64) and it works just fine. > > Tested-by: Gilad Ben-Yossef I was wondering if this is near to be ready for submission (apart from the testmgr.c changes) or I need to make some changes to make it similar to the IPSec offload? Thanks, Binoy
Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
On 8 February 2017 at 13:02, Gilad Ben-Yossefwrote: > > Ran Bonnie++ on it last night (Luks mode, plain64, Qemu Virt platform > Arm64) and it works just fine. > > Tested-by: Gilad Ben-Yossef > Hi Gilad, Thank you for testing it. Do you have access to a device having crypto hardware with IV generation capability and associated drivers for let say, aes with cbc or any another mode? I was wondering if I can customize it to work with dm-crypt by generating IVs automatically. Please let me know your thoughts. Thanks, Binoy
Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt
On Tue, Feb 7, 2017 at 12:35 PM, Binoy Jayanwrote: > === > dm-crypt optimization for larger block sizes > === > > Currently, the iv generation algorithms are implemented in dm-crypt.c. The > goal > is to move these algorithms from the dm layer to the kernel crypto layer by > implementing them as template ciphers so they can be used in relation with > algorithms like aes, and with multiple modes like cbc, ecb etc. As part of > this > patchset, the iv-generation code is moved from the dm layer to the crypto > layer > and adapt the dm-layer to send a whole 'bio' (as defined in the block layer) > at a time. Each bio contains the in memory representation of physically > contiguous disk blocks. Since the bio itself may not be contiguous in main > memory, the dm layer sets up a chained scatterlist of these blocks split into > physically contiguous segments in memory so that DMA can be performed. ... > Binoy Jayan (1): > crypto: Add IV generation algorithms > > drivers/md/dm-crypt.c | 1894 > ++-- > include/crypto/geniv.h | 47 ++ > 2 files changed, 1402 insertions(+), 539 deletions(-) > create mode 100644 include/crypto/geniv.h Ran Bonnie++ on it last night (Luks mode, plain64, Qemu Virt platform Arm64) and it works just fine. Tested-by: Gilad Ben-Yossef Gilad -- Gilad Ben-Yossef Chief Coffee Drinker "If you take a class in large-scale robotics, can you end up in a situation where the homework eats your dog?" -- Jean-Baptiste Queru