Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-20 Thread Binoy Jayan
On 6 March 2017 at 20:08, Gilad Ben-Yossef  wrote:
>
> I gave it a spin on a x86_64 with 8 CPUs with AES-NI using cryptd and
> on Arm  using CryptoCell hardware accelerator.
>
> There was no difference in performance between 512 and 4096 bytes
> cluster size on the x86_64 (800 MB loop file system)
>
> There was an improvement in latency of 3.2% between 512 and 4096 bytes
> cluster size on the Arm. I expect the performance benefits for this
> test for Binoy's patch to be the same.
>
> In both cases the very naive test was a simple dd with block size of
> 4096 bytes or the raw block device.
>
> I do not know what effect having a bigger cluster size would have on
> have on other more complex file system operations.
> Is there any specific benchmark worth testing with?


The multiple instances issue in /proc/crypto is fixed. It was because of
the IV code itself modifying the algorithm name inadvertently in the
global crypto algorithm lookup table when it was splitting up
"plain(cbc(aes))" into "plain" and "cbc(aes)" so as to invoke the child
algorithm.

I ran a few tests with dd, bonnie and FIO under Qemu - x86 using the
automated script [1] that I wrote to make the testing easy.
The tests were done on software implementations of the algorithms
as the real hardware was not available with me. According to the test,
I found that the sequential reads and writes have a good improvement
(5.7 %) in the data rate with the proposed solution while the random
reads shows a very little improvement. When tested with FIO, the
random writes also shows a small improvement (2.2%) but the random
reads show a little deterioration in performance (4 %).

When tested in arm hardware, only the sequential writes with bonnie
shows improvement (5.6%). All other tests shows degraded performance
in the absence of crypto hardware.

[1] https://github.com/binoyjayan/utilities/blob/master/utils/dmtest
Dependencies: dd [Full version], bonnie, fio

Thanks,
Binoy


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-20 Thread Binoy Jayan
On 6 March 2017 at 20:08, Gilad Ben-Yossef  wrote:
>
> I gave it a spin on a x86_64 with 8 CPUs with AES-NI using cryptd and
> on Arm  using CryptoCell hardware accelerator.
>
> There was no difference in performance between 512 and 4096 bytes
> cluster size on the x86_64 (800 MB loop file system)
>
> There was an improvement in latency of 3.2% between 512 and 4096 bytes
> cluster size on the Arm. I expect the performance benefits for this
> test for Binoy's patch to be the same.
>
> In both cases the very naive test was a simple dd with block size of
> 4096 bytes or the raw block device.
>
> I do not know what effect having a bigger cluster size would have on
> have on other more complex file system operations.
> Is there any specific benchmark worth testing with?


The multiple instances issue in /proc/crypto is fixed. It was because of
the IV code itself modifying the algorithm name inadvertently in the
global crypto algorithm lookup table when it was splitting up
"plain(cbc(aes))" into "plain" and "cbc(aes)" so as to invoke the child
algorithm.

I ran a few tests with dd, bonnie and FIO under Qemu - x86 using the
automated script [1] that I wrote to make the testing easy.
The tests were done on software implementations of the algorithms
as the real hardware was not available with me. According to the test,
I found that the sequential reads and writes have a good improvement
(5.7 %) in the data rate with the proposed solution while the random
reads shows a very little improvement. When tested with FIO, the
random writes also shows a small improvement (2.2%) but the random
reads show a little deterioration in performance (4 %).

When tested in arm hardware, only the sequential writes with bonnie
shows improvement (5.6%). All other tests shows degraded performance
in the absence of crypto hardware.

[1] https://github.com/binoyjayan/utilities/blob/master/utils/dmtest
Dependencies: dd [Full version], bonnie, fio

Thanks,
Binoy


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-06 Thread Gilad Ben-Yossef
On Wed, Mar 1, 2017 at 5:38 PM, Milan Broz  wrote:
>
> On 03/01/2017 02:04 PM, Milan Broz wrote:
>> On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote:
>> ...
>>
>>> I can certainly understand if you don't wont to take the patch until
>>> we have results with
>>> dm-crypt itself but the difference between 8 separate invocation of
>>> the engine for 512
>>> bytes of XTS and a single invocation for 4KB are pretty big.
>>
>> Yes, I know it. But the same can be achieved if we just implement
>> 4k sector encryption in dmcrypt. It is incompatible with LUKS1
>> (but next LUKS version will support it) but I think this is not
>> a problem for now.
>>
>> If the underlying device supports atomic write of 4k sectors, then
>> there should not be a problem.
>>
>> This is one of the speed-up I would like to compare with the IV approach,
>> because everyone should benefit from 4k sectors in the end.
>> And no crypto API changes are needed here.
>>
>> (I have an old patch for this, so I will try to revive it.)
>
> If anyone interested, simple experimental patch for larger sector size
> (up to the page size) for dmcrypt is in this branch:
>
> http://git.kernel.org/cgit/linux/kernel/git/mbroz/linux.git/log/?h=dm-crypt-4k-sector
>
> It would be nice to check what performance gain could be provided
> by this simple approach.


I gave it a spin on a x86_64 with 8 CPUs with AES-NI using cryptd and
on Arm  using CryptoCell hardware accelerator.

There was no difference in performance between 512 and 4096 bytes
cluster size on the x86_64 (800 MB loop file system)

There was an improvement in latency of 3.2% between 512 and 4096 bytes
cluster size on the Arm. I expect the performance benefits for this
test for Binoy's patch to be the same.

In both cases the very naive test was a simple dd with block size of
4096 bytes or the raw block device.

I do not know what effect having a bigger cluster size would have on
have on other more complex file system operations.
Is there any specific benchmark worth testing with?


Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-06 Thread Gilad Ben-Yossef
On Wed, Mar 1, 2017 at 5:38 PM, Milan Broz  wrote:
>
> On 03/01/2017 02:04 PM, Milan Broz wrote:
>> On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote:
>> ...
>>
>>> I can certainly understand if you don't wont to take the patch until
>>> we have results with
>>> dm-crypt itself but the difference between 8 separate invocation of
>>> the engine for 512
>>> bytes of XTS and a single invocation for 4KB are pretty big.
>>
>> Yes, I know it. But the same can be achieved if we just implement
>> 4k sector encryption in dmcrypt. It is incompatible with LUKS1
>> (but next LUKS version will support it) but I think this is not
>> a problem for now.
>>
>> If the underlying device supports atomic write of 4k sectors, then
>> there should not be a problem.
>>
>> This is one of the speed-up I would like to compare with the IV approach,
>> because everyone should benefit from 4k sectors in the end.
>> And no crypto API changes are needed here.
>>
>> (I have an old patch for this, so I will try to revive it.)
>
> If anyone interested, simple experimental patch for larger sector size
> (up to the page size) for dmcrypt is in this branch:
>
> http://git.kernel.org/cgit/linux/kernel/git/mbroz/linux.git/log/?h=dm-crypt-4k-sector
>
> It would be nice to check what performance gain could be provided
> by this simple approach.


I gave it a spin on a x86_64 with 8 CPUs with AES-NI using cryptd and
on Arm  using CryptoCell hardware accelerator.

There was no difference in performance between 512 and 4096 bytes
cluster size on the x86_64 (800 MB loop file system)

There was an improvement in latency of 3.2% between 512 and 4096 bytes
cluster size on the Arm. I expect the performance benefits for this
test for Binoy's patch to be the same.

In both cases the very naive test was a simple dd with block size of
4096 bytes or the raw block device.

I do not know what effect having a bigger cluster size would have on
have on other more complex file system operations.
Is there any specific benchmark worth testing with?


Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-02 Thread Gilad Ben-Yossef
On Wed, Mar 1, 2017 at 3:21 PM, Ondrej Mosnacek  wrote:
> 2017-03-01 13:42 GMT+01:00 Gilad Ben-Yossef :
>
> Wouldn't adopting a bulk request API (something like what I tried to
> do here [1]) that allows users to supply multiple messages, each with
> their own IV, fulfill this purpose? That way, we wouldn't need to
> introduce any new modes into Crypto API and the drivers/accelerators
> would only need to provide bulk implementations of common modes
> (xts(aes), cbc(aes), ...) to provide better performance for dm-crypt
> (and possibly other users, too).
>
> I'm not sure how exactly these crypto accelerators work, but wouldn't
> it help if the drivers simply get more messages (in our case sectors)
> in a single call? I wonder, would (efficiently) supporting such a
> scheme require changes in the HW itself or could it be achieved just
> by modifying driver code (let's say specifically for your CryptoCell
> accelerator)?
>
> [1] https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg23007.html
>


>From a general perspective - that is things are expect to be true not
just for CryptoCell but for most HW crypto engines,
you want two things - for the HW engine to be able to burst work for a
long time and than rest for a long time vs. a stop and go scheme
(engine utilization)
and for the average IO transaction to be relatively long (bus utilization)

So, a big cluster size i.e. Milan's proposal) works great - you get both.

Submitting a series of sequential small clusters where the HW can
calculate the IV (e.g. Binoy's proposal) works great if the HW
supports it - you get both.

A batched series of small clusters + IV is less favorable - if your HW
engines has lots of parallel context processing (this is expensive for
HW) you might enjoy engine utilization but the bus utilization will be
low - lots of small transactions.

Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-02 Thread Gilad Ben-Yossef
On Wed, Mar 1, 2017 at 3:21 PM, Ondrej Mosnacek  wrote:
> 2017-03-01 13:42 GMT+01:00 Gilad Ben-Yossef :
>
> Wouldn't adopting a bulk request API (something like what I tried to
> do here [1]) that allows users to supply multiple messages, each with
> their own IV, fulfill this purpose? That way, we wouldn't need to
> introduce any new modes into Crypto API and the drivers/accelerators
> would only need to provide bulk implementations of common modes
> (xts(aes), cbc(aes), ...) to provide better performance for dm-crypt
> (and possibly other users, too).
>
> I'm not sure how exactly these crypto accelerators work, but wouldn't
> it help if the drivers simply get more messages (in our case sectors)
> in a single call? I wonder, would (efficiently) supporting such a
> scheme require changes in the HW itself or could it be achieved just
> by modifying driver code (let's say specifically for your CryptoCell
> accelerator)?
>
> [1] https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg23007.html
>


>From a general perspective - that is things are expect to be true not
just for CryptoCell but for most HW crypto engines,
you want two things - for the HW engine to be able to burst work for a
long time and than rest for a long time vs. a stop and go scheme
(engine utilization)
and for the average IO transaction to be relatively long (bus utilization)

So, a big cluster size i.e. Milan's proposal) works great - you get both.

Submitting a series of sequential small clusters where the HW can
calculate the IV (e.g. Binoy's proposal) works great if the HW
supports it - you get both.

A batched series of small clusters + IV is less favorable - if your HW
engines has lots of parallel context processing (this is expensive for
HW) you might enjoy engine utilization but the bus utilization will be
low - lots of small transactions.

Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Milan Broz

On 03/01/2017 02:04 PM, Milan Broz wrote:
> On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote:
> ...
> 
>> I can certainly understand if you don't wont to take the patch until
>> we have results with
>> dm-crypt itself but the difference between 8 separate invocation of
>> the engine for 512
>> bytes of XTS and a single invocation for 4KB are pretty big.
> 
> Yes, I know it. But the same can be achieved if we just implement
> 4k sector encryption in dmcrypt. It is incompatible with LUKS1
> (but next LUKS version will support it) but I think this is not
> a problem for now.
> 
> If the underlying device supports atomic write of 4k sectors, then
> there should not be a problem.
> 
> This is one of the speed-up I would like to compare with the IV approach,
> because everyone should benefit from 4k sectors in the end.
> And no crypto API changes are needed here.
> 
> (I have an old patch for this, so I will try to revive it.)

If anyone interested, simple experimental patch for larger sector size
(up to the page size) for dmcrypt is in this branch:

http://git.kernel.org/cgit/linux/kernel/git/mbroz/linux.git/log/?h=dm-crypt-4k-sector

It would be nice to check what performance gain could be provided
by this simple approach.

Milan


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Milan Broz

On 03/01/2017 02:04 PM, Milan Broz wrote:
> On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote:
> ...
> 
>> I can certainly understand if you don't wont to take the patch until
>> we have results with
>> dm-crypt itself but the difference between 8 separate invocation of
>> the engine for 512
>> bytes of XTS and a single invocation for 4KB are pretty big.
> 
> Yes, I know it. But the same can be achieved if we just implement
> 4k sector encryption in dmcrypt. It is incompatible with LUKS1
> (but next LUKS version will support it) but I think this is not
> a problem for now.
> 
> If the underlying device supports atomic write of 4k sectors, then
> there should not be a problem.
> 
> This is one of the speed-up I would like to compare with the IV approach,
> because everyone should benefit from 4k sectors in the end.
> And no crypto API changes are needed here.
> 
> (I have an old patch for this, so I will try to revive it.)

If anyone interested, simple experimental patch for larger sector size
(up to the page size) for dmcrypt is in this branch:

http://git.kernel.org/cgit/linux/kernel/git/mbroz/linux.git/log/?h=dm-crypt-4k-sector

It would be nice to check what performance gain could be provided
by this simple approach.

Milan


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Ondrej Mosnacek
2017-03-01 13:42 GMT+01:00 Gilad Ben-Yossef :
> It really is an observation about overhead of context switches between
> dm-crypt and
> whatever/wherever you handle crypto - be it an off CPU hardware engine
> or a bunch
> of parallel kernel threads running on other cores. You really want to
> burst as much as
> possible.

[...]

>> For XTS you need just simple linear IV. No problem with that, implementation
>> in crypto API and hw is trivial.
>>
>> But for compatible IV (that provides compatibility with loopAES and very old 
>> TrueCrypt),
>> these should be never ever implemented again anywhere.
>
>>
>> Specifically "tcw" is broken, insecure and provided here just to help people 
>> to migrate
>> from old Truecrypt containers. Even Truecrypt followers removed it from the 
>> codebase.
>> (It is basically combination of IV and slight modification of CBC mode. All
>> recent version switched to XTS and plain IV.)
>>
>> So building abstraction over something known to be broken and that is now 
>> intentionally
>> isolated inside dmcrypt is, in my opinion, really not a good idea.
>>
>
> I don't think anyone is interested in these modes. How do you support
> XTS and essiv in
> a generic way without supporting this broken modes is not something
> I'm clear on though.

Wouldn't adopting a bulk request API (something like what I tried to
do here [1]) that allows users to supply multiple messages, each with
their own IV, fulfill this purpose? That way, we wouldn't need to
introduce any new modes into Crypto API and the drivers/accelerators
would only need to provide bulk implementations of common modes
(xts(aes), cbc(aes), ...) to provide better performance for dm-crypt
(and possibly other users, too).

I'm not sure how exactly these crypto accelerators work, but wouldn't
it help if the drivers simply get more messages (in our case sectors)
in a single call? I wonder, would (efficiently) supporting such a
scheme require changes in the HW itself or could it be achieved just
by modifying driver code (let's say specifically for your CryptoCell
accelerator)?

Thanks,
Ondrej

[1] https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg23007.html

>
> Thanks,
> Gilad
>
>
>
> --
> Gilad Ben-Yossef
> Chief Coffee Drinker
>
> "If you take a class in large-scale robotics, can you end up in a
> situation where the homework eats your dog?"
>  -- Jean-Baptiste Queru


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Ondrej Mosnacek
2017-03-01 13:42 GMT+01:00 Gilad Ben-Yossef :
> It really is an observation about overhead of context switches between
> dm-crypt and
> whatever/wherever you handle crypto - be it an off CPU hardware engine
> or a bunch
> of parallel kernel threads running on other cores. You really want to
> burst as much as
> possible.

[...]

>> For XTS you need just simple linear IV. No problem with that, implementation
>> in crypto API and hw is trivial.
>>
>> But for compatible IV (that provides compatibility with loopAES and very old 
>> TrueCrypt),
>> these should be never ever implemented again anywhere.
>
>>
>> Specifically "tcw" is broken, insecure and provided here just to help people 
>> to migrate
>> from old Truecrypt containers. Even Truecrypt followers removed it from the 
>> codebase.
>> (It is basically combination of IV and slight modification of CBC mode. All
>> recent version switched to XTS and plain IV.)
>>
>> So building abstraction over something known to be broken and that is now 
>> intentionally
>> isolated inside dmcrypt is, in my opinion, really not a good idea.
>>
>
> I don't think anyone is interested in these modes. How do you support
> XTS and essiv in
> a generic way without supporting this broken modes is not something
> I'm clear on though.

Wouldn't adopting a bulk request API (something like what I tried to
do here [1]) that allows users to supply multiple messages, each with
their own IV, fulfill this purpose? That way, we wouldn't need to
introduce any new modes into Crypto API and the drivers/accelerators
would only need to provide bulk implementations of common modes
(xts(aes), cbc(aes), ...) to provide better performance for dm-crypt
(and possibly other users, too).

I'm not sure how exactly these crypto accelerators work, but wouldn't
it help if the drivers simply get more messages (in our case sectors)
in a single call? I wonder, would (efficiently) supporting such a
scheme require changes in the HW itself or could it be achieved just
by modifying driver code (let's say specifically for your CryptoCell
accelerator)?

Thanks,
Ondrej

[1] https://www.mail-archive.com/linux-crypto@vger.kernel.org/msg23007.html

>
> Thanks,
> Gilad
>
>
>
> --
> Gilad Ben-Yossef
> Chief Coffee Drinker
>
> "If you take a class in large-scale robotics, can you end up in a
> situation where the homework eats your dog?"
>  -- Jean-Baptiste Queru


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Milan Broz
On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote:
...

> I can certainly understand if you don't wont to take the patch until
> we have results with
> dm-crypt itself but the difference between 8 separate invocation of
> the engine for 512
> bytes of XTS and a single invocation for 4KB are pretty big.

Yes, I know it. But the same can be achieved if we just implement
4k sector encryption in dmcrypt. It is incompatible with LUKS1
(but next LUKS version will support it) but I think this is not
a problem for now.

If the underlying device supports atomic write of 4k sectors, then
there should not be a problem.

This is one of the speed-up I would like to compare with the IV approach,
because everyone should benefit from 4k sectors in the end.
And no crypto API changes are needed here.

(I have an old patch for this, so I will try to revive it.)

Milan


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Milan Broz
On 03/01/2017 01:42 PM, Gilad Ben-Yossef wrote:
...

> I can certainly understand if you don't wont to take the patch until
> we have results with
> dm-crypt itself but the difference between 8 separate invocation of
> the engine for 512
> bytes of XTS and a single invocation for 4KB are pretty big.

Yes, I know it. But the same can be achieved if we just implement
4k sector encryption in dmcrypt. It is incompatible with LUKS1
(but next LUKS version will support it) but I think this is not
a problem for now.

If the underlying device supports atomic write of 4k sectors, then
there should not be a problem.

This is one of the speed-up I would like to compare with the IV approach,
because everyone should benefit from 4k sectors in the end.
And no crypto API changes are needed here.

(I have an old patch for this, so I will try to revive it.)

Milan


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Gilad Ben-Yossef
On Wed, Mar 1, 2017 at 11:29 AM, Milan Broz  wrote:
>
> On 03/01/2017 09:30 AM, Gilad Ben-Yossef wrote:
> > On Tue, Feb 28, 2017 at 11:05 PM, Milan Broz  wrote:
> >>
> >> On 02/22/2017 07:12 AM, Binoy Jayan wrote:
> >>>
> >>> I was wondering if this is near to be ready for submission (apart from
> >>> the testmgr.c
> >>> changes) or I need to make some changes to make it similar to the IPSec 
> >>> offload?
> >>
> >> I just tried this and except it registers the IV for every new device 
> >> again, it works...
> >> (After a while you have many duplicate entries in /proc/crypto.)
> >>
> >> But I would like to see some summary why such a big patch is needed in the 
> >> first place.
> >> (During an internal discussions seems that people are already lost in 
> >> mails and
> >> patches here, so Ondra promised me to send some summary mail soon here.)
> >>
> >> IIRC the first initial problem was dmcrypt performance on some embedded
> >> crypto processors that are not able to cope with small crypto requests 
> >> effectively.
> >>
> >>
> >> Do you have some real performance numbers that proves that such a patch is 
> >> adequate?
> >>
> >> I would really like to see the performance issue fixed but I am really not 
> >> sure
> >> this approach works for everyone. It would be better to avoid repeating 
> >> this exercise later.
> >> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
> >> to speedup things even for crypt drivers that do not support own IV 
> >> generators.
> >>
> >
> > AFAIK the problem that we are trying to solve is that if the IV is
> > generated outside the crypto API
> > domain than you are forced to have an invocation of the crypto API per
> > each block because you
> > need to provide the IV for each block.
> >
> > By putting the IV generation responsibility in the Crypto API we open
> > the way to do a single invocation
> > of the crypto API for a whole sequence of blocks.
>
> Sure, but this is theory. Does it really work on some hw already?
> Do you have performance measurements or comparison?

I'm working on up streaming a driver for Arm  CryptoCell that supports
this and working
offline to get Binoy a board to test this with. Alas, shipping crypto
HW has its fair share
of regulatory challenges... :-)

I can certainly understand if you don't wont to take the patch until
we have results with
dm-crypt itself but the difference between 8 separate invocation of
the engine for 512
bytes of XTS and a single invocation for 4KB are pretty big.

>From what I know of HW engines I'd be surprised if this is in any way
unique to CryptoCell.

> > For software implementation of XTS this doesn't matter much - but for
> > hardware based XTS providers
>
> It is not only embedded crypto, we have some more reports in the past
> that 512B sectors are not ideal even for other systems.
> (IIRC it was also with AES-NI that represents really big group of users).

I never said anything about embedded :-)

It really is an observation about overhead of context switches between
dm-crypt and
whatever/wherever you handle crypto - be it an off CPU hardware engine
or a bunch
of parallel kernel threads running on other cores. You really want to
burst as much as
possible.


>
> > This lead some vendors to ship hacked up versions of dm-crypt to match
> > the specific crypto hardware
> > they were using, or so I've heard at least - didn't see the code myself.
>
> I saw few version of that. There was a very hacky way to provide 
> request-based dmcrypt
> (see old "Introduce the request handling for dm-crypt" thread on dm-devel).
> This is not the acceptable way but definitely it points to the same problem.
>
> > I believe Binoy is trying to address this in a generic upstream worthy
> > way instead.
>
> IIRC the problem is performance, if we can solve it by some generic way,
> good, but for now it seems to be a big change and just hope it helps later...
>

I see what you're saying. We need number to back this up.

> > Anyway, you are only supposed to see s difference when using a
> > hardware based XTS provider algo
> > that supports IV generation.
> >
> >> I like the patch is now contained inside dmcrypt, but it still exposes IVs 
> >> that
> >> are designed just for old, insecure, compatibility-only containers.
> >>
> >> I really do not think every compatible crap must be accessible through 
> >> crypto API.
> >> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do 
> >> that this way
> >> if I know it is accessible outside of dmcrypt internals...)
> >> Even the ESSIV is something that was born to fix predictive IVs (CBC 
> >> watermarking
> >> attacks) for disk encryption only, no reason to expose it outside of disk 
> >> encryption.
> >>
> >
> > The point is that you have more than one implementation of these
> > "compatible crap" - the
> > software implementation that you wrote and potentially multiple
> > hardware implementations
> > 

Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Gilad Ben-Yossef
On Wed, Mar 1, 2017 at 11:29 AM, Milan Broz  wrote:
>
> On 03/01/2017 09:30 AM, Gilad Ben-Yossef wrote:
> > On Tue, Feb 28, 2017 at 11:05 PM, Milan Broz  wrote:
> >>
> >> On 02/22/2017 07:12 AM, Binoy Jayan wrote:
> >>>
> >>> I was wondering if this is near to be ready for submission (apart from
> >>> the testmgr.c
> >>> changes) or I need to make some changes to make it similar to the IPSec 
> >>> offload?
> >>
> >> I just tried this and except it registers the IV for every new device 
> >> again, it works...
> >> (After a while you have many duplicate entries in /proc/crypto.)
> >>
> >> But I would like to see some summary why such a big patch is needed in the 
> >> first place.
> >> (During an internal discussions seems that people are already lost in 
> >> mails and
> >> patches here, so Ondra promised me to send some summary mail soon here.)
> >>
> >> IIRC the first initial problem was dmcrypt performance on some embedded
> >> crypto processors that are not able to cope with small crypto requests 
> >> effectively.
> >>
> >>
> >> Do you have some real performance numbers that proves that such a patch is 
> >> adequate?
> >>
> >> I would really like to see the performance issue fixed but I am really not 
> >> sure
> >> this approach works for everyone. It would be better to avoid repeating 
> >> this exercise later.
> >> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
> >> to speedup things even for crypt drivers that do not support own IV 
> >> generators.
> >>
> >
> > AFAIK the problem that we are trying to solve is that if the IV is
> > generated outside the crypto API
> > domain than you are forced to have an invocation of the crypto API per
> > each block because you
> > need to provide the IV for each block.
> >
> > By putting the IV generation responsibility in the Crypto API we open
> > the way to do a single invocation
> > of the crypto API for a whole sequence of blocks.
>
> Sure, but this is theory. Does it really work on some hw already?
> Do you have performance measurements or comparison?

I'm working on up streaming a driver for Arm  CryptoCell that supports
this and working
offline to get Binoy a board to test this with. Alas, shipping crypto
HW has its fair share
of regulatory challenges... :-)

I can certainly understand if you don't wont to take the patch until
we have results with
dm-crypt itself but the difference between 8 separate invocation of
the engine for 512
bytes of XTS and a single invocation for 4KB are pretty big.

>From what I know of HW engines I'd be surprised if this is in any way
unique to CryptoCell.

> > For software implementation of XTS this doesn't matter much - but for
> > hardware based XTS providers
>
> It is not only embedded crypto, we have some more reports in the past
> that 512B sectors are not ideal even for other systems.
> (IIRC it was also with AES-NI that represents really big group of users).

I never said anything about embedded :-)

It really is an observation about overhead of context switches between
dm-crypt and
whatever/wherever you handle crypto - be it an off CPU hardware engine
or a bunch
of parallel kernel threads running on other cores. You really want to
burst as much as
possible.


>
> > This lead some vendors to ship hacked up versions of dm-crypt to match
> > the specific crypto hardware
> > they were using, or so I've heard at least - didn't see the code myself.
>
> I saw few version of that. There was a very hacky way to provide 
> request-based dmcrypt
> (see old "Introduce the request handling for dm-crypt" thread on dm-devel).
> This is not the acceptable way but definitely it points to the same problem.
>
> > I believe Binoy is trying to address this in a generic upstream worthy
> > way instead.
>
> IIRC the problem is performance, if we can solve it by some generic way,
> good, but for now it seems to be a big change and just hope it helps later...
>

I see what you're saying. We need number to back this up.

> > Anyway, you are only supposed to see s difference when using a
> > hardware based XTS provider algo
> > that supports IV generation.
> >
> >> I like the patch is now contained inside dmcrypt, but it still exposes IVs 
> >> that
> >> are designed just for old, insecure, compatibility-only containers.
> >>
> >> I really do not think every compatible crap must be accessible through 
> >> crypto API.
> >> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do 
> >> that this way
> >> if I know it is accessible outside of dmcrypt internals...)
> >> Even the ESSIV is something that was born to fix predictive IVs (CBC 
> >> watermarking
> >> attacks) for disk encryption only, no reason to expose it outside of disk 
> >> encryption.
> >>
> >
> > The point is that you have more than one implementation of these
> > "compatible crap" - the
> > software implementation that you wrote and potentially multiple
> > hardware implementations
> > and putting this in the crypto API domain is 

Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Milan Broz
On 03/01/2017 09:30 AM, Gilad Ben-Yossef wrote:
> On Tue, Feb 28, 2017 at 11:05 PM, Milan Broz  wrote:
>>
>> On 02/22/2017 07:12 AM, Binoy Jayan wrote:
>>>
>>> I was wondering if this is near to be ready for submission (apart from
>>> the testmgr.c
>>> changes) or I need to make some changes to make it similar to the IPSec 
>>> offload?
>>
>> I just tried this and except it registers the IV for every new device again, 
>> it works...
>> (After a while you have many duplicate entries in /proc/crypto.)
>>
>> But I would like to see some summary why such a big patch is needed in the 
>> first place.
>> (During an internal discussions seems that people are already lost in mails 
>> and
>> patches here, so Ondra promised me to send some summary mail soon here.)
>>
>> IIRC the first initial problem was dmcrypt performance on some embedded
>> crypto processors that are not able to cope with small crypto requests 
>> effectively.
>>
>>
>> Do you have some real performance numbers that proves that such a patch is 
>> adequate?
>>
>> I would really like to see the performance issue fixed but I am really not 
>> sure
>> this approach works for everyone. It would be better to avoid repeating this 
>> exercise later.
>> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
>> to speedup things even for crypt drivers that do not support own IV 
>> generators.
>>
> 
> AFAIK the problem that we are trying to solve is that if the IV is
> generated outside the crypto API
> domain than you are forced to have an invocation of the crypto API per
> each block because you
> need to provide the IV for each block.
> 
> By putting the IV generation responsibility in the Crypto API we open
> the way to do a single invocation
> of the crypto API for a whole sequence of blocks.

Sure, but this is theory. Does it really work on some hw already?
Do you have performance measurements or comparison?

> For software implementation of XTS this doesn't matter much - but for
> hardware based XTS providers

It is not only embedded crypto, we have some more reports in the past
that 512B sectors are not ideal even for other systems.
(IIRC it was also with AES-NI that represents really big group of users).

> This lead some vendors to ship hacked up versions of dm-crypt to match
> the specific crypto hardware
> they were using, or so I've heard at least - didn't see the code myself.

I saw few version of that. There was a very hacky way to provide request-based 
dmcrypt
(see old "Introduce the request handling for dm-crypt" thread on dm-devel).
This is not the acceptable way but definitely it points to the same problem.

> I believe Binoy is trying to address this in a generic upstream worthy
> way instead.

IIRC the problem is performance, if we can solve it by some generic way,
good, but for now it seems to be a big change and just hope it helps later...

> Anyway, you are only supposed to see s difference when using a
> hardware based XTS provider algo
> that supports IV generation.
> 
>> I like the patch is now contained inside dmcrypt, but it still exposes IVs 
>> that
>> are designed just for old, insecure, compatibility-only containers.
>>
>> I really do not think every compatible crap must be accessible through 
>> crypto API.
>> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that 
>> this way
>> if I know it is accessible outside of dmcrypt internals...)
>> Even the ESSIV is something that was born to fix predictive IVs (CBC 
>> watermarking
>> attacks) for disk encryption only, no reason to expose it outside of disk 
>> encryption.
>>
> 
> The point is that you have more than one implementation of these
> "compatible crap" - the
> software implementation that you wrote and potentially multiple
> hardware implementations
> and putting this in the crypto API domain is the way to abstract this
> so you use the one
> that works best of your platform.

For XTS you need just simple linear IV. No problem with that, implementation
in crypto API and hw is trivial.

But for compatible IV (that provides compatibility with loopAES and very old 
TrueCrypt),
these should be never ever implemented again anywhere.

Specifically "tcw" is broken, insecure and provided here just to help people to 
migrate
from old Truecrypt containers. Even Truecrypt followers removed it from the 
codebase.
(It is basically combination of IV and slight modification of CBC mode. All
recent version switched to XTS and plain IV.)

So building abstraction over something known to be broken and that is now 
intentionally
isolated inside dmcrypt is, in my opinion, really not a good idea.


But please do get me wrong,  I do not want to block any improvement.

But it seems to me that this thread focused on creating nice crypto API 
interface
for FDE IVs instead of demonstration that the proposed solution really solves
the performance issue.
And not only for your hw driver, maybe other systems could benefit from 

Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Milan Broz
On 03/01/2017 09:30 AM, Gilad Ben-Yossef wrote:
> On Tue, Feb 28, 2017 at 11:05 PM, Milan Broz  wrote:
>>
>> On 02/22/2017 07:12 AM, Binoy Jayan wrote:
>>>
>>> I was wondering if this is near to be ready for submission (apart from
>>> the testmgr.c
>>> changes) or I need to make some changes to make it similar to the IPSec 
>>> offload?
>>
>> I just tried this and except it registers the IV for every new device again, 
>> it works...
>> (After a while you have many duplicate entries in /proc/crypto.)
>>
>> But I would like to see some summary why such a big patch is needed in the 
>> first place.
>> (During an internal discussions seems that people are already lost in mails 
>> and
>> patches here, so Ondra promised me to send some summary mail soon here.)
>>
>> IIRC the first initial problem was dmcrypt performance on some embedded
>> crypto processors that are not able to cope with small crypto requests 
>> effectively.
>>
>>
>> Do you have some real performance numbers that proves that such a patch is 
>> adequate?
>>
>> I would really like to see the performance issue fixed but I am really not 
>> sure
>> this approach works for everyone. It would be better to avoid repeating this 
>> exercise later.
>> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
>> to speedup things even for crypt drivers that do not support own IV 
>> generators.
>>
> 
> AFAIK the problem that we are trying to solve is that if the IV is
> generated outside the crypto API
> domain than you are forced to have an invocation of the crypto API per
> each block because you
> need to provide the IV for each block.
> 
> By putting the IV generation responsibility in the Crypto API we open
> the way to do a single invocation
> of the crypto API for a whole sequence of blocks.

Sure, but this is theory. Does it really work on some hw already?
Do you have performance measurements or comparison?

> For software implementation of XTS this doesn't matter much - but for
> hardware based XTS providers

It is not only embedded crypto, we have some more reports in the past
that 512B sectors are not ideal even for other systems.
(IIRC it was also with AES-NI that represents really big group of users).

> This lead some vendors to ship hacked up versions of dm-crypt to match
> the specific crypto hardware
> they were using, or so I've heard at least - didn't see the code myself.

I saw few version of that. There was a very hacky way to provide request-based 
dmcrypt
(see old "Introduce the request handling for dm-crypt" thread on dm-devel).
This is not the acceptable way but definitely it points to the same problem.

> I believe Binoy is trying to address this in a generic upstream worthy
> way instead.

IIRC the problem is performance, if we can solve it by some generic way,
good, but for now it seems to be a big change and just hope it helps later...

> Anyway, you are only supposed to see s difference when using a
> hardware based XTS provider algo
> that supports IV generation.
> 
>> I like the patch is now contained inside dmcrypt, but it still exposes IVs 
>> that
>> are designed just for old, insecure, compatibility-only containers.
>>
>> I really do not think every compatible crap must be accessible through 
>> crypto API.
>> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that 
>> this way
>> if I know it is accessible outside of dmcrypt internals...)
>> Even the ESSIV is something that was born to fix predictive IVs (CBC 
>> watermarking
>> attacks) for disk encryption only, no reason to expose it outside of disk 
>> encryption.
>>
> 
> The point is that you have more than one implementation of these
> "compatible crap" - the
> software implementation that you wrote and potentially multiple
> hardware implementations
> and putting this in the crypto API domain is the way to abstract this
> so you use the one
> that works best of your platform.

For XTS you need just simple linear IV. No problem with that, implementation
in crypto API and hw is trivial.

But for compatible IV (that provides compatibility with loopAES and very old 
TrueCrypt),
these should be never ever implemented again anywhere.

Specifically "tcw" is broken, insecure and provided here just to help people to 
migrate
from old Truecrypt containers. Even Truecrypt followers removed it from the 
codebase.
(It is basically combination of IV and slight modification of CBC mode. All
recent version switched to XTS and plain IV.)

So building abstraction over something known to be broken and that is now 
intentionally
isolated inside dmcrypt is, in my opinion, really not a good idea.


But please do get me wrong,  I do not want to block any improvement.

But it seems to me that this thread focused on creating nice crypto API 
interface
for FDE IVs instead of demonstration that the proposed solution really solves
the performance issue.
And not only for your hw driver, maybe other systems could benefit from the 
better
processing 

Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Gilad Ben-Yossef
On Tue, Feb 28, 2017 at 11:05 PM, Milan Broz  wrote:
>
> On 02/22/2017 07:12 AM, Binoy Jayan wrote:
> >
> > I was wondering if this is near to be ready for submission (apart from
> > the testmgr.c
> > changes) or I need to make some changes to make it similar to the IPSec 
> > offload?
>
> I just tried this and except it registers the IV for every new device again, 
> it works...
> (After a while you have many duplicate entries in /proc/crypto.)
>
> But I would like to see some summary why such a big patch is needed in the 
> first place.
> (During an internal discussions seems that people are already lost in mails 
> and
> patches here, so Ondra promised me to send some summary mail soon here.)
>
> IIRC the first initial problem was dmcrypt performance on some embedded
> crypto processors that are not able to cope with small crypto requests 
> effectively.
>
>
> Do you have some real performance numbers that proves that such a patch is 
> adequate?
>
> I would really like to see the performance issue fixed but I am really not 
> sure
> this approach works for everyone. It would be better to avoid repeating this 
> exercise later.
> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
> to speedup things even for crypt drivers that do not support own IV 
> generators.
>

AFAIK the problem that we are trying to solve is that if the IV is
generated outside the crypto API
domain than you are forced to have an invocation of the crypto API per
each block because you
need to provide the IV for each block.

By putting the IV generation responsibility in the Crypto API we open
the way to do a single invocation
of the crypto API for a whole sequence of blocks.

For software implementation of XTS this doesn't matter much - but for
hardware based XTS providers
that can do the IV generation themselves it's the difference between a
sequence of small invocation,
with all the overhead that that entails  and a single big invocatio -
and this can be big.

This lead some vendors to ship hacked up versions of dm-crypt to match
the specific crypto hardware
they were using, or so I've heard at least - didn't see the code myself.

I believe Binoy is trying to address this in a generic upstream worthy
way instead.

Anyway, you are only supposed to see s difference when using a
hardware based XTS provider algo
that supports IV generation.

> I like the patch is now contained inside dmcrypt, but it still exposes IVs 
> that
> are designed just for old, insecure, compatibility-only containers.
>
> I really do not think every compatible crap must be accessible through crypto 
> API.
> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that 
> this way
> if I know it is accessible outside of dmcrypt internals...)
> Even the ESSIV is something that was born to fix predictive IVs (CBC 
> watermarking
> attacks) for disk encryption only, no reason to expose it outside of disk 
> encryption.
>

The point is that you have more than one implementation of these
"compatible crap" - the
software implementation that you wrote and potentially multiple
hardware implementations
and putting this in the crypto API domain is the way to abstract this
so you use the one
that works best of your platform.

Thanks,
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-03-01 Thread Gilad Ben-Yossef
On Tue, Feb 28, 2017 at 11:05 PM, Milan Broz  wrote:
>
> On 02/22/2017 07:12 AM, Binoy Jayan wrote:
> >
> > I was wondering if this is near to be ready for submission (apart from
> > the testmgr.c
> > changes) or I need to make some changes to make it similar to the IPSec 
> > offload?
>
> I just tried this and except it registers the IV for every new device again, 
> it works...
> (After a while you have many duplicate entries in /proc/crypto.)
>
> But I would like to see some summary why such a big patch is needed in the 
> first place.
> (During an internal discussions seems that people are already lost in mails 
> and
> patches here, so Ondra promised me to send some summary mail soon here.)
>
> IIRC the first initial problem was dmcrypt performance on some embedded
> crypto processors that are not able to cope with small crypto requests 
> effectively.
>
>
> Do you have some real performance numbers that proves that such a patch is 
> adequate?
>
> I would really like to see the performance issue fixed but I am really not 
> sure
> this approach works for everyone. It would be better to avoid repeating this 
> exercise later.
> IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
> to speedup things even for crypt drivers that do not support own IV 
> generators.
>

AFAIK the problem that we are trying to solve is that if the IV is
generated outside the crypto API
domain than you are forced to have an invocation of the crypto API per
each block because you
need to provide the IV for each block.

By putting the IV generation responsibility in the Crypto API we open
the way to do a single invocation
of the crypto API for a whole sequence of blocks.

For software implementation of XTS this doesn't matter much - but for
hardware based XTS providers
that can do the IV generation themselves it's the difference between a
sequence of small invocation,
with all the overhead that that entails  and a single big invocatio -
and this can be big.

This lead some vendors to ship hacked up versions of dm-crypt to match
the specific crypto hardware
they were using, or so I've heard at least - didn't see the code myself.

I believe Binoy is trying to address this in a generic upstream worthy
way instead.

Anyway, you are only supposed to see s difference when using a
hardware based XTS provider algo
that supports IV generation.

> I like the patch is now contained inside dmcrypt, but it still exposes IVs 
> that
> are designed just for old, insecure, compatibility-only containers.
>
> I really do not think every compatible crap must be accessible through crypto 
> API.
> (I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that 
> this way
> if I know it is accessible outside of dmcrypt internals...)
> Even the ESSIV is something that was born to fix predictive IVs (CBC 
> watermarking
> attacks) for disk encryption only, no reason to expose it outside of disk 
> encryption.
>

The point is that you have more than one implementation of these
"compatible crap" - the
software implementation that you wrote and potentially multiple
hardware implementations
and putting this in the crypto API domain is the way to abstract this
so you use the one
that works best of your platform.

Thanks,
Gilad


-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-02-28 Thread Milan Broz
On 02/22/2017 07:12 AM, Binoy Jayan wrote:
> 
> I was wondering if this is near to be ready for submission (apart from
> the testmgr.c
> changes) or I need to make some changes to make it similar to the IPSec 
> offload?

I just tried this and except it registers the IV for every new device again, it 
works...
(After a while you have many duplicate entries in /proc/crypto.)

But I would like to see some summary why such a big patch is needed in the 
first place.
(During an internal discussions seems that people are already lost in mails and
patches here, so Ondra promised me to send some summary mail soon here.)

IIRC the first initial problem was dmcrypt performance on some embedded
crypto processors that are not able to cope with small crypto requests 
effectively.

Do you have some real performance numbers that proves that such a patch is 
adequate?

I would really like to see the performance issue fixed but I am really not sure
this approach works for everyone. It would be better to avoid repeating this 
exercise later.
IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
to speedup things even for crypt drivers that do not support own IV generators.

I like the patch is now contained inside dmcrypt, but it still exposes IVs that
are designed just for old, insecure, compatibility-only containers.

I really do not think every compatible crap must be accessible through crypto 
API.
(I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that 
this way
if I know it is accessible outside of dmcrypt internals...)
Even the ESSIV is something that was born to fix predictive IVs (CBC 
watermarking
attacks) for disk encryption only, no reason to expose it outside of disk 
encryption.

Milan


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-02-28 Thread Milan Broz
On 02/22/2017 07:12 AM, Binoy Jayan wrote:
> 
> I was wondering if this is near to be ready for submission (apart from
> the testmgr.c
> changes) or I need to make some changes to make it similar to the IPSec 
> offload?

I just tried this and except it registers the IV for every new device again, it 
works...
(After a while you have many duplicate entries in /proc/crypto.)

But I would like to see some summary why such a big patch is needed in the 
first place.
(During an internal discussions seems that people are already lost in mails and
patches here, so Ondra promised me to send some summary mail soon here.)

IIRC the first initial problem was dmcrypt performance on some embedded
crypto processors that are not able to cope with small crypto requests 
effectively.

Do you have some real performance numbers that proves that such a patch is 
adequate?

I would really like to see the performance issue fixed but I am really not sure
this approach works for everyone. It would be better to avoid repeating this 
exercise later.
IIRC Ondra's "bulk" mode, despite rejected, shows that there is a potential
to speedup things even for crypt drivers that do not support own IV generators.

I like the patch is now contained inside dmcrypt, but it still exposes IVs that
are designed just for old, insecure, compatibility-only containers.

I really do not think every compatible crap must be accessible through crypto 
API.
(I wrote the dmcrypt lrw and tcw compatibility IVs and I would never do that 
this way
if I know it is accessible outside of dmcrypt internals...)
Even the ESSIV is something that was born to fix predictive IVs (CBC 
watermarking
attacks) for disk encryption only, no reason to expose it outside of disk 
encryption.

Milan


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-02-21 Thread Binoy Jayan
Hi Herbert,

On 8 February 2017 at 13:02, Gilad Ben-Yossef  wrote:
> On Tue, Feb 7, 2017 at 12:35 PM, Binoy Jayan  wrote:
>> ===
>> dm-crypt optimization for larger block sizes
>> ===
>>
>> Currently, the iv generation algorithms are implemented in dm-crypt.c. The 
>> goal
>> is to move these algorithms from the dm layer to the kernel crypto layer by
>> implementing them as template ciphers so they can be used in relation with
>> algorithms like aes, and with multiple modes like cbc, ecb etc. As part of 
>> this
>> patchset, the iv-generation code is moved from the dm layer to the crypto 
>> layer
>> and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
>> at a time. Each bio contains the in memory representation of physically
>> contiguous disk blocks. Since the bio itself may not be contiguous in main
>> memory, the dm layer sets up a chained scatterlist of these blocks split into
>> physically contiguous segments in memory so that DMA can be performed.

> Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
> Arm64) and it works just fine.
>
> Tested-by: Gilad Ben-Yossef 

I was wondering if this is near to be ready for submission (apart from
the testmgr.c
changes) or I need to make some changes to make it similar to the IPSec offload?

Thanks,
Binoy


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-02-21 Thread Binoy Jayan
Hi Herbert,

On 8 February 2017 at 13:02, Gilad Ben-Yossef  wrote:
> On Tue, Feb 7, 2017 at 12:35 PM, Binoy Jayan  wrote:
>> ===
>> dm-crypt optimization for larger block sizes
>> ===
>>
>> Currently, the iv generation algorithms are implemented in dm-crypt.c. The 
>> goal
>> is to move these algorithms from the dm layer to the kernel crypto layer by
>> implementing them as template ciphers so they can be used in relation with
>> algorithms like aes, and with multiple modes like cbc, ecb etc. As part of 
>> this
>> patchset, the iv-generation code is moved from the dm layer to the crypto 
>> layer
>> and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
>> at a time. Each bio contains the in memory representation of physically
>> contiguous disk blocks. Since the bio itself may not be contiguous in main
>> memory, the dm layer sets up a chained scatterlist of these blocks split into
>> physically contiguous segments in memory so that DMA can be performed.

> Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
> Arm64) and it works just fine.
>
> Tested-by: Gilad Ben-Yossef 

I was wondering if this is near to be ready for submission (apart from
the testmgr.c
changes) or I need to make some changes to make it similar to the IPSec offload?

Thanks,
Binoy


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-02-09 Thread Binoy Jayan
On 8 February 2017 at 13:02, Gilad Ben-Yossef  wrote:
>
> Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
> Arm64) and it works just fine.
>
> Tested-by: Gilad Ben-Yossef 
>

Hi Gilad,

Thank you for testing it. Do you have access to a device having crypto
hardware with IV generation capability and associated drivers for let
say, aes with cbc or any another mode? I was wondering if I can customize
it to work with dm-crypt by generating IVs automatically. Please let
me know your thoughts.

Thanks,
Binoy


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-02-09 Thread Binoy Jayan
On 8 February 2017 at 13:02, Gilad Ben-Yossef  wrote:
>
> Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
> Arm64) and it works just fine.
>
> Tested-by: Gilad Ben-Yossef 
>

Hi Gilad,

Thank you for testing it. Do you have access to a device having crypto
hardware with IV generation capability and associated drivers for let
say, aes with cbc or any another mode? I was wondering if I can customize
it to work with dm-crypt by generating IVs automatically. Please let
me know your thoughts.

Thanks,
Binoy


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-02-07 Thread Gilad Ben-Yossef
On Tue, Feb 7, 2017 at 12:35 PM, Binoy Jayan  wrote:
> ===
> dm-crypt optimization for larger block sizes
> ===
>
> Currently, the iv generation algorithms are implemented in dm-crypt.c. The 
> goal
> is to move these algorithms from the dm layer to the kernel crypto layer by
> implementing them as template ciphers so they can be used in relation with
> algorithms like aes, and with multiple modes like cbc, ecb etc. As part of 
> this
> patchset, the iv-generation code is moved from the dm layer to the crypto 
> layer
> and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
> at a time. Each bio contains the in memory representation of physically
> contiguous disk blocks. Since the bio itself may not be contiguous in main
> memory, the dm layer sets up a chained scatterlist of these blocks split into
> physically contiguous segments in memory so that DMA can be performed.
...
> Binoy Jayan (1):
>   crypto: Add IV generation algorithms
>
>  drivers/md/dm-crypt.c  | 1894 
> ++--
>  include/crypto/geniv.h |   47 ++
>  2 files changed, 1402 insertions(+), 539 deletions(-)
>  create mode 100644 include/crypto/geniv.h

Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
Arm64) and it works just fine.

Tested-by: Gilad Ben-Yossef 

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru


Re: [RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-02-07 Thread Gilad Ben-Yossef
On Tue, Feb 7, 2017 at 12:35 PM, Binoy Jayan  wrote:
> ===
> dm-crypt optimization for larger block sizes
> ===
>
> Currently, the iv generation algorithms are implemented in dm-crypt.c. The 
> goal
> is to move these algorithms from the dm layer to the kernel crypto layer by
> implementing them as template ciphers so they can be used in relation with
> algorithms like aes, and with multiple modes like cbc, ecb etc. As part of 
> this
> patchset, the iv-generation code is moved from the dm layer to the crypto 
> layer
> and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
> at a time. Each bio contains the in memory representation of physically
> contiguous disk blocks. Since the bio itself may not be contiguous in main
> memory, the dm layer sets up a chained scatterlist of these blocks split into
> physically contiguous segments in memory so that DMA can be performed.
...
> Binoy Jayan (1):
>   crypto: Add IV generation algorithms
>
>  drivers/md/dm-crypt.c  | 1894 
> ++--
>  include/crypto/geniv.h |   47 ++
>  2 files changed, 1402 insertions(+), 539 deletions(-)
>  create mode 100644 include/crypto/geniv.h

Ran Bonnie++ on it last night  (Luks mode, plain64, Qemu Virt platform
Arm64) and it works just fine.

Tested-by: Gilad Ben-Yossef 

Gilad

-- 
Gilad Ben-Yossef
Chief Coffee Drinker

"If you take a class in large-scale robotics, can you end up in a
situation where the homework eats your dog?"
 -- Jean-Baptiste Queru


[RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-02-07 Thread Binoy Jayan
===
dm-crypt optimization for larger block sizes
===

Currently, the iv generation algorithms are implemented in dm-crypt.c. The goal
is to move these algorithms from the dm layer to the kernel crypto layer by
implementing them as template ciphers so they can be used in relation with
algorithms like aes, and with multiple modes like cbc, ecb etc. As part of this
patchset, the iv-generation code is moved from the dm layer to the crypto layer
and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
at a time. Each bio contains the in memory representation of physically
contiguous disk blocks. Since the bio itself may not be contiguous in main
memory, the dm layer sets up a chained scatterlist of these blocks split into
physically contiguous segments in memory so that DMA can be performed.

One challenge in doing so is that the IVs are generated based on a 512-byte
sector number. This infact limits the block sizes to 512 bytes. But this should
not be a problem if a hardware with iv generation support is used. The geniv
itself splits the segments into sectors so it could choose the IV based on
sector number. But it could be modelled in hardware effectively by not
splitting up the segments in the bio.

Another challenge faced is that dm-crypt has an option to use multiple keys.
The key selection is done based on the sector number. If the whole bio is
encrypted / decrypted with the same key, the encrypted volumes will not be
compatible with the original dm-crypt [without the changes]. So, the key
selection code is moved to crypto layer so the neighboring sectors are
encrypted with a different key.

The dm layer allocates space for iv. The hardware drivers can choose to make
use of this space to generate their IVs sequentially or allocate it on their
own. This can be moved to crypto layer too. Postponing this decision until
the requirement to integrate milan's changes are clear.

Interface to the crypto layer - include/crypto/geniv.h

Revisions:
--

v1: https://patchwork.kernel.org/patch/9439175
v2: https://patchwork.kernel.org/patch/9471923
v3: https://lkml.org/lkml/2017/1/18/170

v3 --> v4
--
Fix for the bug reported by Gilad Ben-Yossef.
The element '__ctx' in 'struct skcipher_request req' overflowed into the
element 'struct scatterlist src' which immediately follows 'req' in
'struct geniv_subreq' and corrupted src.

v2 --> v3
--

1. Moved iv algorithms in dm-crypt.c for control
2. Key management code moved from dm layer to cryto layer
   so that cipher instance selection can be made depending on key_index
3. The revision v2 had scatterlist nodes created for every sector in the bio.
   It is modified to create only once scatterlist node to reduce memory
   foot print. Synchronous requests are processed sequentially. Asynchronous
   requests are processed in parallel and is freed in the async callback.
4. Changed allocation for sub-requests using mempool

v1 --> v2
--

1. dm-crypt changes to process larger block sizes (one segment in a bio)
2. Incorporated changes w.r.t. comments from Herbert.

Binoy Jayan (1):
  crypto: Add IV generation algorithms

 drivers/md/dm-crypt.c  | 1894 ++--
 include/crypto/geniv.h |   47 ++
 2 files changed, 1402 insertions(+), 539 deletions(-)
 create mode 100644 include/crypto/geniv.h

-- 
Binoy Jayan



[RFC PATCH v4] IV Generation algorithms for dm-crypt

2017-02-07 Thread Binoy Jayan
===
dm-crypt optimization for larger block sizes
===

Currently, the iv generation algorithms are implemented in dm-crypt.c. The goal
is to move these algorithms from the dm layer to the kernel crypto layer by
implementing them as template ciphers so they can be used in relation with
algorithms like aes, and with multiple modes like cbc, ecb etc. As part of this
patchset, the iv-generation code is moved from the dm layer to the crypto layer
and adapt the dm-layer to send a whole 'bio' (as defined in the block layer)
at a time. Each bio contains the in memory representation of physically
contiguous disk blocks. Since the bio itself may not be contiguous in main
memory, the dm layer sets up a chained scatterlist of these blocks split into
physically contiguous segments in memory so that DMA can be performed.

One challenge in doing so is that the IVs are generated based on a 512-byte
sector number. This infact limits the block sizes to 512 bytes. But this should
not be a problem if a hardware with iv generation support is used. The geniv
itself splits the segments into sectors so it could choose the IV based on
sector number. But it could be modelled in hardware effectively by not
splitting up the segments in the bio.

Another challenge faced is that dm-crypt has an option to use multiple keys.
The key selection is done based on the sector number. If the whole bio is
encrypted / decrypted with the same key, the encrypted volumes will not be
compatible with the original dm-crypt [without the changes]. So, the key
selection code is moved to crypto layer so the neighboring sectors are
encrypted with a different key.

The dm layer allocates space for iv. The hardware drivers can choose to make
use of this space to generate their IVs sequentially or allocate it on their
own. This can be moved to crypto layer too. Postponing this decision until
the requirement to integrate milan's changes are clear.

Interface to the crypto layer - include/crypto/geniv.h

Revisions:
--

v1: https://patchwork.kernel.org/patch/9439175
v2: https://patchwork.kernel.org/patch/9471923
v3: https://lkml.org/lkml/2017/1/18/170

v3 --> v4
--
Fix for the bug reported by Gilad Ben-Yossef.
The element '__ctx' in 'struct skcipher_request req' overflowed into the
element 'struct scatterlist src' which immediately follows 'req' in
'struct geniv_subreq' and corrupted src.

v2 --> v3
--

1. Moved iv algorithms in dm-crypt.c for control
2. Key management code moved from dm layer to cryto layer
   so that cipher instance selection can be made depending on key_index
3. The revision v2 had scatterlist nodes created for every sector in the bio.
   It is modified to create only once scatterlist node to reduce memory
   foot print. Synchronous requests are processed sequentially. Asynchronous
   requests are processed in parallel and is freed in the async callback.
4. Changed allocation for sub-requests using mempool

v1 --> v2
--

1. dm-crypt changes to process larger block sizes (one segment in a bio)
2. Incorporated changes w.r.t. comments from Herbert.

Binoy Jayan (1):
  crypto: Add IV generation algorithms

 drivers/md/dm-crypt.c  | 1894 ++--
 include/crypto/geniv.h |   47 ++
 2 files changed, 1402 insertions(+), 539 deletions(-)
 create mode 100644 include/crypto/geniv.h

-- 
Binoy Jayan