Re: Data Independent Timing on arm64

2022-10-02 Thread Theo de Raadt
ok, let's give it a shot then.

And watch for behaviour changes...


Mark Kettenis  wrote:

> > From: "Theo de Raadt" 
> > Date: Sat, 01 Oct 2022 09:37:01 -0600
> > 
> > Mark Kettenis  wrote:
> > 
> > > Armv8.4 introduced a feature that provides data independent timing for
> > > data processing instructions.  This feature is obviously introduced to
> > > mitigate timing side-channel attacks.  Presumably enabling the feature
> > > has some impact on performance as it would disable certain
> > > optimizations to guarantee constant-time execution of instructions.
> > 
> > But what impact does it have on all regular code?  I cannot find that
> > answer in the quoted email thread.  And I don't even see the question
> > being asked, because those people aren't talking about changing the
> > cpu to this mode permanently & completely.
> > 
> > > The only hardware that implements this feature that I've seen so far
> > > is Apple's Icestorm/Firestorm (M1) and Avalanche/Blizzard (M2) cores.
> > > I ran some benchmarks on an M2 Macbook Air.  In particular, I ran
> > > "eopenssl30 speed" bound to a "performance" core.
> > 
> > That is testing the performance of a program which uses a very narrow
> > set of cpu behaviours.  For example, it has almost no crossings in & out
> > of the kernel: system calls and page faults.  The class of operations
> > being tested are mostly pure compute against the register set rather
> > than memory, and when it does perform memory loads, it does so in quite
> > linear fashion.  It also does relatively few memory writes.
> 
> I also tested kernel builds.  I don't see any evidence of any
> significant impact on those.  
> 
> > It is a program that would be slower if they implimented the feature
> > poorly, but using such a small subset of system behaviours, I don't
> > think it can identify things which might be slower in part, and thus
> > have an effect on whole system performance.
> 
> The ARM ARM is rather explicit on the instructions that might be
> affected by those flags.  That list makes me believe any performance
> impact would show up most prominantly in code that uses the ARMv8
> crypto instructions.
> 
> > > I could not detect a significant slowdown with this feature enabled.
> > 
> > Then why did they make it a chicken bit?  Why did the cpu makers not
> > simply make the cpus always act this way?  There must be a reason,
> > probably undisclosed.  They have been conservative for some reasons.
> 
> I wouldn't call it a chicken bit.  But obviously those in charge of
> the architecture spec anticipated some significant speedup from having
> instructions that have data-dependent timings.  It appears that
> Apple's implementation doesn't though.
> 
> What might be going on here is that Apple is just ticking boxes to
> make their implementation spec compliant.  The feature is required for
> ARMv8.4 and above.  But if none of their instructions have
> data-dependent timing, they could implement the "chicken bit" without
> it actually having an effect.
> 
> > Is there an impact on the performance of building a full snapshot?  That
> > at least has a richer use of all code flows, with lots of kernel crossings,
> > as opposed to the openssl speed command.
> 
> I didn't test full snapshot builds.  I think the kernel builds I did
> should be representative enough.  But I certainly can do more
> benchmarking if you think that would be desirable.
> 
> > > Therefore I think we should enable this feature by default on OpenBSD.
> > 
> > Maybe.  But I am a bit unconvinced.
> > 
> > > The use of this feature is still being discussed in the wider
> > > comminity.  See for example the following thread:
> > > 
> > >   
> > > https://lore.kernel.org/linux-arm-kernel/ywgcrqutxmx0w...@gmail.com/T/#mfcba14511c69461bd8921fef758baae120d090dc
> > > 
> > 
> > That discussion is talking about providing the ability for programs to
> > request that mode of operation.  They are not talking about switching
> > into that mode for the kernel and all processes.
> > 
> > That seems like a big difference.
> > 
> > >From a security perspective it might make some sense, but this isn't
> > like some x86 catastrophy level speculation. I'm not sure there is
> > enough evidence yet that this is required for all modes of operation.
> 
> In principle this should be only necessary for code that is sensitive
> to timing side-channel attacks.  But the way I see it, the ecosystem
> will need (a lot of) time to figure out how to enable this mode around
> the bits of code where it *does* matter.
> 
> > > On arm64, the feature can be controlled from userland, so even if we
> > > turn it on by default, userland code can still make its own decisions
> > > on whether it wants the feature enabled or disabled.  We may have to
> > > expose the ID_AA64PFR0_EL1 register to userland when code shows uo
> > > that attempts to do that.
> > 
> > I suspect that is this will go.  Programs with specific libraries would
> > then request the featu

Re: Data Independent Timing on arm64

2022-10-01 Thread Mark Kettenis
> From: "Theo de Raadt" 
> Date: Sat, 01 Oct 2022 09:37:01 -0600
> 
> Mark Kettenis  wrote:
> 
> > Armv8.4 introduced a feature that provides data independent timing for
> > data processing instructions.  This feature is obviously introduced to
> > mitigate timing side-channel attacks.  Presumably enabling the feature
> > has some impact on performance as it would disable certain
> > optimizations to guarantee constant-time execution of instructions.
> 
> But what impact does it have on all regular code?  I cannot find that
> answer in the quoted email thread.  And I don't even see the question
> being asked, because those people aren't talking about changing the
> cpu to this mode permanently & completely.
> 
> > The only hardware that implements this feature that I've seen so far
> > is Apple's Icestorm/Firestorm (M1) and Avalanche/Blizzard (M2) cores.
> > I ran some benchmarks on an M2 Macbook Air.  In particular, I ran
> > "eopenssl30 speed" bound to a "performance" core.
> 
> That is testing the performance of a program which uses a very narrow
> set of cpu behaviours.  For example, it has almost no crossings in & out
> of the kernel: system calls and page faults.  The class of operations
> being tested are mostly pure compute against the register set rather
> than memory, and when it does perform memory loads, it does so in quite
> linear fashion.  It also does relatively few memory writes.

I also tested kernel builds.  I don't see any evidence of any
significant impact on those.  

> It is a program that would be slower if they implimented the feature
> poorly, but using such a small subset of system behaviours, I don't
> think it can identify things which might be slower in part, and thus
> have an effect on whole system performance.

The ARM ARM is rather explicit on the instructions that might be
affected by those flags.  That list makes me believe any performance
impact would show up most prominantly in code that uses the ARMv8
crypto instructions.

> > I could not detect a significant slowdown with this feature enabled.
> 
> Then why did they make it a chicken bit?  Why did the cpu makers not
> simply make the cpus always act this way?  There must be a reason,
> probably undisclosed.  They have been conservative for some reasons.

I wouldn't call it a chicken bit.  But obviously those in charge of
the architecture spec anticipated some significant speedup from having
instructions that have data-dependent timings.  It appears that
Apple's implementation doesn't though.

What might be going on here is that Apple is just ticking boxes to
make their implementation spec compliant.  The feature is required for
ARMv8.4 and above.  But if none of their instructions have
data-dependent timing, they could implement the "chicken bit" without
it actually having an effect.

> Is there an impact on the performance of building a full snapshot?  That
> at least has a richer use of all code flows, with lots of kernel crossings,
> as opposed to the openssl speed command.

I didn't test full snapshot builds.  I think the kernel builds I did
should be representative enough.  But I certainly can do more
benchmarking if you think that would be desirable.

> > Therefore I think we should enable this feature by default on OpenBSD.
> 
> Maybe.  But I am a bit unconvinced.
> 
> > The use of this feature is still being discussed in the wider
> > comminity.  See for example the following thread:
> > 
> >   
> > https://lore.kernel.org/linux-arm-kernel/ywgcrqutxmx0w...@gmail.com/T/#mfcba14511c69461bd8921fef758baae120d090dc
> > 
> 
> That discussion is talking about providing the ability for programs to
> request that mode of operation.  They are not talking about switching
> into that mode for the kernel and all processes.
> 
> That seems like a big difference.
> 
> >From a security perspective it might make some sense, but this isn't
> like some x86 catastrophy level speculation. I'm not sure there is
> enough evidence yet that this is required for all modes of operation.

In principle this should be only necessary for code that is sensitive
to timing side-channel attacks.  But the way I see it, the ecosystem
will need (a lot of) time to figure out how to enable this mode around
the bits of code where it *does* matter.

> > On arm64, the feature can be controlled from userland, so even if we
> > turn it on by default, userland code can still make its own decisions
> > on whether it wants the feature enabled or disabled.  We may have to
> > expose the ID_AA64PFR0_EL1 register to userland when code shows uo
> > that attempts to do that.
> 
> I suspect that is this will go.  Programs with specific libraries would
> then request the feature on behalf of their entire runtime.  Something
> like a constructor (or startup function) in libcrypto would enable it.
> Meaning this program "might" care about timingsafe behaviour, so let's
> enable it, for the remainder of the life of that program.



Re: Data Independent Timing on arm64

2022-10-01 Thread Theo de Raadt
Mark Kettenis  wrote:

> Armv8.4 introduced a feature that provides data independent timing for
> data processing instructions.  This feature is obviously introduced to
> mitigate timing side-channel attacks.  Presumably enabling the feature
> has some impact on performance as it would disable certain
> optimizations to guarantee constant-time execution of instructions.

But what impact does it have on all regular code?  I cannot find that
answer in the quoted email thread.  And I don't even see the question
being asked, because those people aren't talking about changing the
cpu to this mode permanently & completely.

> The only hardware that implements this feature that I've seen so far
> is Apple's Icestorm/Firestorm (M1) and Avalanche/Blizzard (M2) cores.
> I ran some benchmarks on an M2 Macbook Air.  In particular, I ran
> "eopenssl30 speed" bound to a "performance" core.

That is testing the performance of a program which uses a very narrow
set of cpu behaviours.  For example, it has almost no crossings in & out
of the kernel: system calls and page faults.  The class of operations
being tested are mostly pure compute against the register set rather
than memory, and when it does perform memory loads, it does so in quite
linear fashion.  It also does relatively few memory writes.

It is a program that would be slower if they implimented the feature
poorly, but using such a small subset of system behaviours, I don't
think it can identify things which might be slower in part, and thus
have an effect on whole system performance.

> I could not detect a significant slowdown with this feature enabled.

Then why did they make it a chicken bit?  Why did the cpu makers not
simply make the cpus always act this way?  There must be a reason,
probably undisclosed.  They have been conservative for some reasons.

Is there an impact on the performance of building a full snapshot?  That
at least has a richer use of all code flows, with lots of kernel crossings,
as opposed to the openssl speed command.

> Therefore I think we should enable this feature by default on OpenBSD.

Maybe.  But I am a bit unconvinced.

> The use of this feature is still being discussed in the wider
> comminity.  See for example the following thread:
> 
>   
> https://lore.kernel.org/linux-arm-kernel/ywgcrqutxmx0w...@gmail.com/T/#mfcba14511c69461bd8921fef758baae120d090dc
> 

That discussion is talking about providing the ability for programs to
request that mode of operation.  They are not talking about switching
into that mode for the kernel and all processes.

That seems like a big difference.

>From a security perspective it might make some sense, but this isn't
like some x86 catastrophy level speculation. I'm not sure there is
enough evidence yet that this is required for all modes of operation.

> On arm64, the feature can be controlled from userland, so even if we
> turn it on by default, userland code can still make its own decisions
> on whether it wants the feature enabled or disabled.  We may have to
> expose the ID_AA64PFR0_EL1 register to userland when code shows uo
> that attempts to do that.

I suspect that is this will go.  Programs with specific libraries would
then request the feature on behalf of their entire runtime.  Something
like a constructor (or startup function) in libcrypto would enable it.
Meaning this program "might" care about timingsafe behaviour, so let's
enable it, for the remainder of the life of that program.




Data Independent Timing on arm64

2022-10-01 Thread Mark Kettenis
Armv8.4 introduced a feature that provides data independent timing for
data processing instructions.  This feature is obviously introduced to
mitigate timing side-channel attacks.  Presumably enabling the feature
has some impact on performance as it would disable certain
optimizations to guarantee constant-time execution of instructions.

The only hardware that implements this feature that I've seen so far
is Apple's Icestorm/Firestorm (M1) and Avalanche/Blizzard (M2) cores.
I ran some benchmarks on an M2 Macbook Air.  In particular, I ran
"eopenssl30 speed" bound to a "performance" core.  I could not detect
a significant slowdown with this feature enabled.  Therefore I think
we should enable this feature by default on OpenBSD.  If at some point
a processor core appears where this feature does have a significant
performance impact, we can reconsider.

The use of this feature is still being discussed in the wider
comminity.  See for example the following thread:

  
https://lore.kernel.org/linux-arm-kernel/ywgcrqutxmx0w...@gmail.com/T/#mfcba14511c69461bd8921fef758baae120d090dc

On arm64, the feature can be controlled from userland, so even if we
turn it on by default, userland code can still make its own decisions
on whether it wants the feature enabled or disabled.  We may have to
expose the ID_AA64PFR0_EL1 register to userland when code shows uo
that attempts to do that.

The diff below enables the feature for both the kernel and for
userland processes.

ok?


Index: arch/arm64/arm64/cpu.c
===
RCS file: /cvs/src/sys/arch/arm64/arm64/cpu.c,v
retrieving revision 1.70
diff -u -p -r1.70 cpu.c
--- arch/arm64/arm64/cpu.c  15 Sep 2022 01:57:52 -  1.70
+++ arch/arm64/arm64/cpu.c  1 Oct 2022 10:05:15 -
@@ -756,6 +756,7 @@ void
 cpu_init(void)
 {
uint64_t id_aa64mmfr1, sctlr;
+   uint64_t id_aa64pfr0;
uint64_t tcr;
 
WRITE_SPECIALREG(ttbr0_el1, pmap_kernel()->pm_pt0pa);
@@ -774,6 +775,11 @@ cpu_init(void)
sctlr &= ~SCTLR_SPAN;
WRITE_SPECIALREG(sctlr_el1, sctlr);
}
+
+   /* Enable DIT. */
+   id_aa64pfr0 = READ_SPECIALREG(id_aa64pfr0_el1);
+   if (ID_AA64PFR0_DIT(id_aa64pfr0) >= ID_AA64PFR0_DIT_IMPL)
+   __asm volatile (".arch armv8.4-a; msr dit, #1");
 
/* Initialize debug registers. */
WRITE_SPECIALREG(mdscr_el1, DBG_MDSCR_TDCC);
Index: arch/arm64/arm64/machdep.c
===
RCS file: /cvs/src/sys/arch/arm64/arm64/machdep.c,v
retrieving revision 1.72
diff -u -p -r1.72 machdep.c
--- arch/arm64/arm64/machdep.c  5 Sep 2022 19:18:56 -   1.72
+++ arch/arm64/arm64/machdep.c  1 Oct 2022 10:05:15 -
@@ -433,7 +433,7 @@ setregs(struct proc *p, struct exec_pack
tf->tf_sp = stack;
tf->tf_lr = pack->ep_entry;
tf->tf_elr = pack->ep_entry; /* ??? */
-   tf->tf_spsr = PSR_M_EL0t;
+   tf->tf_spsr = PSR_M_EL0t | PSR_DIT;
 
retval[1] = 0;
 }
Index: arch/arm64/include/armreg.h
===
RCS file: /cvs/src/sys/arch/arm64/include/armreg.h,v
retrieving revision 1.21
diff -u -p -r1.21 armreg.h
--- arch/arm64/include/armreg.h 29 Aug 2022 02:01:18 -  1.21
+++ arch/arm64/include/armreg.h 1 Oct 2022 10:05:15 -
@@ -602,9 +602,13 @@
 #definePSR_I   0x0080
 #definePSR_A   0x0100
 #definePSR_D   0x0200
+#definePSR_SSBS0x1000
 #definePSR_IL  0x0010
 #definePSR_SS  0x0020
 #definePSR_PAN 0x0040
+#definePSR_UAO 0x0080
+#definePSR_DIT 0x0100
+#definePSR_TCO 0x0200
 #definePSR_V   0x1000
 #definePSR_C   0x2000
 #definePSR_Z   0x4000