Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-27 Thread Jason A. Donenfeld
On Thu, Sep 27, 2018 at 6:27 PM Andy Lutomirski  wrote:
> I would add another consideration: if you can get better latency with 
> negligible overhead (0.1%? 0.05%), then that might make sense too. For 
> example, it seems plausible that checking need_resched() every few blocks 
> adds basically no overhead, and the SIMD helpers could do this themselves or 
> perhaps only ever do a block at a time.
>
> need_resched() costs a cacheline access, but it’s usually a hot cacheline, 
> and the actual check is just whether a certain bit in memory is set.

Yes you're right, I do plan to check quite often, rather than seldom,
for this reason. I've been toying with the idea of instead processing
65k (maximum size of a UDP packet) at a time before checking
need_resched(), but armed with the 20µs figure, this isn't remotely
possible on most hardware. So I'll stick with the original
conservative plan of checking very often, and not making things
different from the aspects worked out by the present crypto API in
this regard.


Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-27 Thread Andy Lutomirski



> On Sep 27, 2018, at 8:19 AM, Jason A. Donenfeld  wrote:
> 
> Hey again Thomas,
> 
>> On Thu, Sep 27, 2018 at 3:26 PM Jason A. Donenfeld  wrote:
>> 
>> Hi Thomas,
>> 
>> I'm trying to optimize this for crypto performance while still taking
>> into account preemption concerns. I'm having a bit of trouble figuring
>> out a way to determine numerically what the upper bounds for this
>> stuff looks like. I'm sure I could pick a pretty sane number that's
>> arguably okay -- and way under the limit -- but I still am interested
>> in determining what that limit actually is. I was hoping there'd be a
>> debugging option called, "warn if preemption is disabled for too
>> long", or something, but I couldn't find anything like that. I'm also
>> not quite sure what the latency limits are, to just compute this with
>> a formula. Essentially what I'm trying to determine is:
>> 
>> preempt_disable();
>> asm volatile(".fill N, 1, 0x90;");
>> preempt_enable();
>> 
>> What is the maximum value of N for which the above is okay? What
>> technique would you generally use in measuring this?
>> 
>> Thanks,
>> Jason
> 
> From talking to Peter (now CC'd) on IRC, it sounds like what you're
> mostly interested in is clocktime latency on reasonable hardware, with
> a goal of around ~20µs as a maximum upper bound? I don't expect to get
> anywhere near this value at all, but if you can confirm that's a
> decent ballpark, it would make for some interesting calculations.
> 
> 

I would add another consideration: if you can get better latency with 
negligible overhead (0.1%? 0.05%), then that might make sense too. For example, 
it seems plausible that checking need_resched() every few blocks adds basically 
no overhead, and the SIMD helpers could do this themselves or perhaps only ever 
do a block at a time.

need_resched() costs a cacheline access, but it’s usually a hot cacheline, and 
the actual check is just whether a certain bit in memory is set.

Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-27 Thread Jason A. Donenfeld
Hey again Thomas,

On Thu, Sep 27, 2018 at 3:26 PM Jason A. Donenfeld  wrote:
>
> Hi Thomas,
>
> I'm trying to optimize this for crypto performance while still taking
> into account preemption concerns. I'm having a bit of trouble figuring
> out a way to determine numerically what the upper bounds for this
> stuff looks like. I'm sure I could pick a pretty sane number that's
> arguably okay -- and way under the limit -- but I still am interested
> in determining what that limit actually is. I was hoping there'd be a
> debugging option called, "warn if preemption is disabled for too
> long", or something, but I couldn't find anything like that. I'm also
> not quite sure what the latency limits are, to just compute this with
> a formula. Essentially what I'm trying to determine is:
>
> preempt_disable();
> asm volatile(".fill N, 1, 0x90;");
> preempt_enable();
>
> What is the maximum value of N for which the above is okay? What
> technique would you generally use in measuring this?
>
> Thanks,
> Jason

>From talking to Peter (now CC'd) on IRC, it sounds like what you're
mostly interested in is clocktime latency on reasonable hardware, with
a goal of around ~20µs as a maximum upper bound? I don't expect to get
anywhere near this value at all, but if you can confirm that's a
decent ballpark, it would make for some interesting calculations.

Regards,
Jason


Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-27 Thread Jason A. Donenfeld
Hi Thomas,

I'm trying to optimize this for crypto performance while still taking
into account preemption concerns. I'm having a bit of trouble figuring
out a way to determine numerically what the upper bounds for this
stuff looks like. I'm sure I could pick a pretty sane number that's
arguably okay -- and way under the limit -- but I still am interested
in determining what that limit actually is. I was hoping there'd be a
debugging option called, "warn if preemption is disabled for too
long", or something, but I couldn't find anything like that. I'm also
not quite sure what the latency limits are, to just compute this with
a formula. Essentially what I'm trying to determine is:

preempt_disable();
asm volatile(".fill N, 1, 0x90;");
preempt_enable();

What is the maximum value of N for which the above is okay? What
technique would you generally use in measuring this?

Thanks,
Jason


Re: [PATCH net-next v6 07/23] zinc: ChaCha20 ARM and ARM64 implementations

2018-09-26 Thread Ard Biesheuvel
On Wed, 26 Sep 2018 at 17:50, Jason A. Donenfeld  wrote:
>
> On Wed, Sep 26, 2018 at 5:45 PM Jason A. Donenfeld  wrote:
> > So what you have in mind is something like calling simd_relax() every
> > 4096 bytes or so?
>
> That was actually pretty easy, putting together both of your suggestions:
>
> static inline bool chacha20_arch(struct chacha20_ctx *state, u8 *dst,
>  u8 *src, size_t len,
>  simd_context_t *simd_context)
> {
> while (len > PAGE_SIZE) {
> chacha20_arch(state, dst, src, PAGE_SIZE, simd_context);
> len -= PAGE_SIZE;
> src += PAGE_SIZE;
> dst += PAGE_SIZE;
> simd_relax(simd_context);
> }
> if (IS_ENABLED(CONFIG_KERNEL_MODE_NEON) && chacha20_use_neon &&
> len >= CHACHA20_BLOCK_SIZE * 3 && simd_use(simd_context))
> chacha20_neon(dst, src, len, state->key, state->counter);
> else
> chacha20_arm(dst, src, len, state->key, state->counter);
>
> state->counter[0] += (len + 63) / 64;
> return true;
> }

Nice one :-)

This works for me (but perhaps add a comment as well)