The last patch follows the C implementation but I just figured out a decent
way to do it.
---
powerpc64/p7/chacha-core-internal.asm | 22 +-
1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/powerpc64/p7/chacha-core-internal.asm
---
powerpc64/p7/chacha-core-internal.asm | 55
++-
1 file changed, 54 insertions(+), 1 deletion(-)
diff --git a/powerpc64/p7/chacha-core-internal.asm
b/powerpc64/p7/chacha-core-internal.asm
index 33c721c1..922050ff 100644
---
Writing .align explicitly instead of defining FUNC_ALIGN has no negative
effects except the function won't get alignment for big-endian mode.
It looks like there are some additional operations are needed for
big-endian mode before storing the results to 'dst' buffer, in
chacha-core-internal.c:
Maamoun TK writes:
> Great work. The implementation looks fine, I like the idea of using -16
> instead of 16 for rotating because vspltisw is limited to (-16 to 15)
> and vrlw picks the low-order 5 bits which is the same for both -16 and
> 16.
I picked up that trick from Torbjörn Granlund's
Jeffrey Walton writes:
> I hope I'm not crossing my wires, but doesn't ChaCha core require a
> counter addition?
Sure, but nettle's _chacha_core function (what I've implemented so far
for ppc) does a single block, and doesn't modify the counter. Variants
like _chacha_3core (currently
On Fri, Sep 25, 2020 at 11:04 AM Jeffrey Walton wrote:
>
> On Fri, Sep 25, 2020 at 10:25 AM Niels Möller wrote:
> >
> > Jeffrey Walton writes:
> > ...
> It should be easy enough to test. Start with a counter of 0xfff8
> and encrypt a couple of [64-byte] blocks. You can use Bernstein's
>
On Fri, Sep 25, 2020 at 10:25 AM Niels Möller wrote:
>
> Jeffrey Walton writes:
>
> > I believe the 64-bit adds (addudm) and subtracts (subudm) require
> > POWER8.
>
> I don't think there are any 64-bit adds in my chacha code, only 32-bit,
> vadduwm. The chacha state is fundamentally 16 32-bit
Yes, it would make sense.
On Fri, Sep 25, 2020 at 5:25 PM Niels Möller wrote:
> Jeffrey Walton writes:
>
> > I believe the 64-bit adds (addudm) and subtracts (subudm) require
> > POWER8.
>
> I don't think there are any 64-bit adds in my chacha code, only 32-bit,
> vadduwm. The chacha state is
Jeffrey Walton writes:
> I believe the 64-bit adds (addudm) and subtracts (subudm) require
> POWER8.
I don't think there are any 64-bit adds in my chacha code, only 32-bit,
vadduwm. The chacha state is fundamentally 16 32-bit words, with
operations very friendly to 4-way simd.
Using 64-bit
On Fri, Sep 25, 2020 at 7:43 AM Maamoun TK wrote:
> ...
> > I'm not sure where it fits under powerpc64. The code doesn't need any
> > cryptographic extensions, but it depends on vector instructions as well
> > as VSX registers (for the unaligned load and store instructions). So I'd
> > need
>
>
> I'm trying to learn a bit of ppc assembly. Below is an implementation of
> _chacha_core. Seems to work, when tested on gcc112.fsffrance.org (just
> put the file in the powerpc64 directory and reconfigure). This machine
> is little-endian, I haven't yet tested on big-endian.
>
Great work.
It's gotten better with this patch, now it takes 0.49 seconds to
execute under the same circumstances.
On Fri, Sep 25, 2020 at 9:59 AM Niels Möller wrote:
> Maamoun TK writes:
>
> >> What's the speedup you get from assembly gcm_fill? I see the C
> >> implementation uses memcpy and
Maamoun TK writes:
>> What's the speedup you get from assembly gcm_fill? I see the C
>> implementation uses memcpy and WRITE_UINT32, and is likely significantly
>> slower than the ctr_fill16 in ctr.c. But it could be improved using
>> portable means. If done well, it should be a very small
13 matches
Mail list logo